Papers from 08 to 12 September, 2025

Here are the personalized paper recommendations sorted by most relevant
Economics of Productivity
👍 👎 ♄ Save
Abstract
The rapid adoption of autonomous AI agents is giving rise to a new economic layer where agents transact and coordinate at scales and speeds beyond direct human oversight. We propose the "sandbox economy" as a framework for analyzing this emergent system, characterizing it along two key dimensions: its origins (emergent vs. intentional) and its degree of separateness from the established human economy (permeable vs. impermeable). Our current trajectory points toward a spontaneous emergence of a vast and highly permeable AI agent economy, presenting us with opportunities for an unprecedented degree of coordination as well as significant challenges, including systemic economic risk and exacerbated inequality. Here we discuss a number of possible design choices that may lead to safely steerable AI agent markets. In particular, we consider auction mechanisms for fair resource allocation and preference resolution, the design of AI "mission economies" to coordinate around achieving collective goals, and socio-technical infrastructure needed to ensure trust, safety, and accountability. By doing this, we argue for the proactive design of steerable agent markets to ensure the coming technological shift aligns with humanity's long-term collective flourishing.
👍 👎 ♄ Save
NYU Stern, Information Sy
Abstract
Generative AI is a technology which depends in part on participation by humans in training and improving the automation potential. We focus on the development of an "AI twin" that could complement its creator's efforts, enabling them to produce higher-quality output in their individual style. However, AI twins could also, over time, replace individual humans. We analyze this trade-off using a principal-agent model in which agents have the opportunity to make investments into training an AI twin that lead to a lower cost of effort, a higher probability of success, or both. We propose a new framework to situate the model in which the tasks performed vary in the ease to which AI output can be improved by the human (the "editability") and also vary in the extent to which a non-expert can assess the quality of output (its "verifiability.") Our synthesis of recent empirical studies indicates that productivity gains from the use of generative AI are higher overall when task editability is higher, while non-experts enjoy greater relative productivity gains for tasks with higher verifiability. We show that during investment a strategic agent will trade off improvements in quality and ease of effort to preserve their wage bargaining power. Tasks with high verifiability and low editability are most aligned with a worker's incentives to train their twin, but for tasks where the stakes are low, this alignment is constrained by the risk of displacement. Our results suggest that sustained improvements in company-sponsored generative AI will require nuanced design of human incentives, and that public policy which encourages balancing worker returns with generative AI improvements could yield more sustained long-run productivity gains.
AI for Productivity Tools
👍 👎 ♄ Save
Google DeepMind and other
Paper visualization
Rate this image: 😍 👍 👎
Abstract
The cycle of scientific discovery is frequently bottlenecked by the slow, manual creation of software to support computational experiments. To address this, we present an AI system that creates expert-level scientific software whose goal is to maximize a quality metric. The system uses a Large Language Model (LLM) and Tree Search (TS) to systematically improve the quality metric and intelligently navigate the large space of possible solutions. The system achieves expert-level results when it explores and integrates complex research ideas from external sources. The effectiveness of tree search is demonstrated across a wide range of benchmarks. In bioinformatics, it discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard. In epidemiology, it generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations. Our method also produced state-of-the-art software for geospatial analysis, neural activity prediction in zebrafish, time series forecasting and numerical solution of integrals. By devising and implementing novel solutions to diverse tasks, the system represents a significant step towards accelerating scientific progress.
AI Insights
  • Integration forbids cell_type usage, forcing reliance on scanpy, sklearn, numpy, scipy, tensorflow, torch, or jax.
  • Evaluation spans ASW Batch, ASW Label, ARI, NMI, kBET, iLISI, and PCR for batch‑integration quality.
  • Scalable batch removal is achieved by blending CCA, MNN, and BBKNN insights into a unified, GPU‑friendly pipeline.
  • The system’s tree search explores millions of code variants, pruning by a quality metric that balances accuracy and computational cost.
  • Future work could integrate JAX‑accelerated differentiable programming to learn batch‑removal parameters end‑to‑end.
  • A key challenge is handling datasets with many batches or complex technical noise without exploding memory usage.
👍 👎 ♄ Save
Deakin University
Abstract
Organizations and educational institutions use time-bound assessment tasks to evaluate coding and problem-solving skills. These assessments measure not only the correctness of the solutions, but also their efficiency. Problem setters (educator/interviewer) are responsible for crafting these challenges, carefully balancing difficulty and relevance to create meaningful evaluation experiences. Conversely, problem solvers (student/interviewee) apply coding efficiency and logical thinking to arrive at correct solutions. In the era of Large Language Models (LLMs), LLMs assist problem setters in generating diverse and challenging questions, but they can undermine assessment integrity for problem solvers by providing easy access to solutions. This paper introduces OpenCoderRank, an easy-to-use platform designed to simulate technical assessments. It acts as a bridge between problem setters and problem solvers, helping solvers prepare for time constraints and unfamiliar problems while allowing setters to self-host assessments, offering a no-cost and customizable solution for technical assessments in resource-constrained environments.
AI Insights
  • OpenCoderRank runs on Flask with SQLite, enabling rapid deployment on modest hardware.
  • It auto‑judges MCQs and executes user code in isolated containers for SQL, Python, and Java.
  • Full‑screen mode, copy‑paste blocking, and random ordering protect against LLM‑assisted cheating.
  • Educators embed it in LMS to run timed drills that mirror real interview challenges.
  • Ideal for intra‑college contests, peer learning, and startup hiring tests, all free of charge.
  • Key reading: Bhushan et al. 2025 on LLM‑answer detection and Desmond et al. 2025 on LLM‑as‑a‑judge.
LLMs for Productivity
👍 👎 ♄ Save
Abstract
The success and wide adoption of generative AI (GenAI), particularly large language models (LLMs), has attracted the attention of cybercriminals seeking to abuse models, steal sensitive data, or disrupt services. Moreover, providing security to LLM-based systems is a great challenge, as both traditional threats to software applications and threats targeting LLMs and their integration must be mitigated. In this survey, we shed light on security and privacy concerns of such LLM-based systems by performing a systematic review and comprehensive categorization of threats and defensive strategies considering the entire software and LLM life cycles. We analyze real-world scenarios with distinct characteristics of LLM usage, spanning from development to operation. In addition, threats are classified according to their severity level and to which scenarios they pertain, facilitating the identification of the most relevant threats. Recommended defense strategies are systematically categorized and mapped to the corresponding life cycle phase and possible attack strategies they attenuate. This work paves the way for consumers and vendors to understand and efficiently mitigate risks during integration of LLMs in their respective solutions or organizations. It also enables the research community to benefit from the discussion of open challenges and edge cases that may hinder the secure and privacy-preserving adoption of LLM-based systems.
👍 👎 ♄ Save
Oak Ridge National Labort
Abstract
Engineering educational curriculum and standards cover many material and manufacturing options. However, engineers and designers are often unfamiliar with certain composite materials or manufacturing techniques. Large language models (LLMs) could potentially bridge the gap. Their capacity to store and retrieve data from large databases provides them with a breadth of knowledge across disciplines. However, their generalized knowledge base can lack targeted, industry-specific knowledge. To this end, we present two LLM-based applications based on the GPT-4 architecture: (1) The Composites Guide: a system that provides expert knowledge on composites material and connects users with research and industry professionals who can provide additional support and (2) The Equipment Assistant: a system that provides guidance for manufacturing tool operation and material characterization. By combining the knowledge of general AI models with industry-specific knowledge, both applications are intended to provide more meaningful information for engineers. In this paper, we discuss the development of the applications and evaluate it through a benchmark and two informal user studies. The benchmark analysis uses the Rouge and Bertscore metrics to evaluate our model performance against GPT-4o. The results show that GPT-4o and the proposed models perform similarly or better on the ROUGE and BERTScore metrics. The two user studies supplement this quantitative evaluation by asking experts to provide qualitative and open-ended feedback about our model performance on a set of domain-specific questions. The results of both studies highlight a potential for more detailed and specific responses with the Composites Guide and the Equipment Assistant.
AI Insights
  • 100 Q&As on damping ratio, Poisson’s ratio, and thermal transport.
  • 100 Q&As per process covering mold size, clamp opening, and electrical specs.
  • ROUGE and BERTScore benchmarking shows the specialized LLM matches or exceeds GPT‑4o.
  • Human ratings confirm higher accuracy and relevance for domain‑specific queries.
  • Weaknesses include occasional inaccurate fabrication steps and incomplete injection‑molding breakdowns.
  • Suggested reading: “Composites: Materials, Manufacturing and Design” and “Advanced Composites for Aerospace, Marine and Land Applications.”
  • Poisson’s Ratio: transverse contraction over longitudinal extension in tensile testing.
Unsubscribe from these updates