Papers from 08 to 12 September, 2025
Here are the personalized paper recommendations sorted by most relevant
Economics of Productivity
Abstract
The rapid adoption of autonomous AI agents is giving rise to a new economic
layer where agents transact and coordinate at scales and speeds beyond direct
human oversight. We propose the "sandbox economy" as a framework for analyzing
this emergent system, characterizing it along two key dimensions: its origins
(emergent vs. intentional) and its degree of separateness from the established
human economy (permeable vs. impermeable). Our current trajectory points toward
a spontaneous emergence of a vast and highly permeable AI agent economy,
presenting us with opportunities for an unprecedented degree of coordination as
well as significant challenges, including systemic economic risk and
exacerbated inequality. Here we discuss a number of possible design choices
that may lead to safely steerable AI agent markets. In particular, we consider
auction mechanisms for fair resource allocation and preference resolution, the
design of AI "mission economies" to coordinate around achieving collective
goals, and socio-technical infrastructure needed to ensure trust, safety, and
accountability. By doing this, we argue for the proactive design of steerable
agent markets to ensure the coming technological shift aligns with humanity's
long-term collective flourishing.
NYU Stern, Information Sy
Abstract
Generative AI is a technology which depends in part on participation by
humans in training and improving the automation potential. We focus on the
development of an "AI twin" that could complement its creator's efforts,
enabling them to produce higher-quality output in their individual style.
However, AI twins could also, over time, replace individual humans. We analyze
this trade-off using a principal-agent model in which agents have the
opportunity to make investments into training an AI twin that lead to a lower
cost of effort, a higher probability of success, or both. We propose a new
framework to situate the model in which the tasks performed vary in the ease to
which AI output can be improved by the human (the "editability") and also vary
in the extent to which a non-expert can assess the quality of output (its
"verifiability.") Our synthesis of recent empirical studies indicates that
productivity gains from the use of generative AI are higher overall when task
editability is higher, while non-experts enjoy greater relative productivity
gains for tasks with higher verifiability. We show that during investment a
strategic agent will trade off improvements in quality and ease of effort to
preserve their wage bargaining power. Tasks with high verifiability and low
editability are most aligned with a worker's incentives to train their twin,
but for tasks where the stakes are low, this alignment is constrained by the
risk of displacement. Our results suggest that sustained improvements in
company-sponsored generative AI will require nuanced design of human
incentives, and that public policy which encourages balancing worker returns
with generative AI improvements could yield more sustained long-run
productivity gains.
AI for Productivity Tools
Google DeepMind and other
Abstract
The cycle of scientific discovery is frequently bottlenecked by the slow,
manual creation of software to support computational experiments. To address
this, we present an AI system that creates expert-level scientific software
whose goal is to maximize a quality metric. The system uses a Large Language
Model (LLM) and Tree Search (TS) to systematically improve the quality metric
and intelligently navigate the large space of possible solutions. The system
achieves expert-level results when it explores and integrates complex research
ideas from external sources. The effectiveness of tree search is demonstrated
across a wide range of benchmarks. In bioinformatics, it discovered 40 novel
methods for single-cell data analysis that outperformed the top human-developed
methods on a public leaderboard. In epidemiology, it generated 14 models that
outperformed the CDC ensemble and all other individual models for forecasting
COVID-19 hospitalizations. Our method also produced state-of-the-art software
for geospatial analysis, neural activity prediction in zebrafish, time series
forecasting and numerical solution of integrals. By devising and implementing
novel solutions to diverse tasks, the system represents a significant step
towards accelerating scientific progress.
AI Insights - Integration forbids cell_type usage, forcing reliance on scanpy, sklearn, numpy, scipy, tensorflow, torch, or jax.
- Evaluation spans ASW Batch, ASW Label, ARI, NMI, kBET, iLISI, and PCR for batchâintegration quality.
- Scalable batch removal is achieved by blending CCA, MNN, and BBKNN insights into a unified, GPUâfriendly pipeline.
- The systemâs tree search explores millions of code variants, pruning by a quality metric that balances accuracy and computational cost.
- Future work could integrate JAXâaccelerated differentiable programming to learn batchâremoval parameters endâtoâend.
- A key challenge is handling datasets with many batches or complex technical noise without exploding memory usage.
Deakin University
Abstract
Organizations and educational institutions use time-bound assessment tasks to
evaluate coding and problem-solving skills. These assessments measure not only
the correctness of the solutions, but also their efficiency. Problem setters
(educator/interviewer) are responsible for crafting these challenges, carefully
balancing difficulty and relevance to create meaningful evaluation experiences.
Conversely, problem solvers (student/interviewee) apply coding efficiency and
logical thinking to arrive at correct solutions. In the era of Large Language
Models (LLMs), LLMs assist problem setters in generating diverse and
challenging questions, but they can undermine assessment integrity for problem
solvers by providing easy access to solutions. This paper introduces
OpenCoderRank, an easy-to-use platform designed to simulate technical
assessments. It acts as a bridge between problem setters and problem solvers,
helping solvers prepare for time constraints and unfamiliar problems while
allowing setters to self-host assessments, offering a no-cost and customizable
solution for technical assessments in resource-constrained environments.
AI Insights - OpenCoderRank runs on Flask with SQLite, enabling rapid deployment on modest hardware.
- It autoâjudges MCQs and executes user code in isolated containers for SQL, Python, and Java.
- Fullâscreen mode, copyâpaste blocking, and random ordering protect against LLMâassisted cheating.
- Educators embed it in LMS to run timed drills that mirror real interview challenges.
- Ideal for intraâcollege contests, peer learning, and startup hiring tests, all free of charge.
- Key reading: Bhushan etâŻal. 2025 on LLMâanswer detection and Desmond etâŻal. 2025 on LLMâasâaâjudge.
LLMs for Productivity
Abstract
The success and wide adoption of generative AI (GenAI), particularly large
language models (LLMs), has attracted the attention of cybercriminals seeking
to abuse models, steal sensitive data, or disrupt services. Moreover, providing
security to LLM-based systems is a great challenge, as both traditional threats
to software applications and threats targeting LLMs and their integration must
be mitigated. In this survey, we shed light on security and privacy concerns of
such LLM-based systems by performing a systematic review and comprehensive
categorization of threats and defensive strategies considering the entire
software and LLM life cycles. We analyze real-world scenarios with distinct
characteristics of LLM usage, spanning from development to operation. In
addition, threats are classified according to their severity level and to which
scenarios they pertain, facilitating the identification of the most relevant
threats. Recommended defense strategies are systematically categorized and
mapped to the corresponding life cycle phase and possible attack strategies
they attenuate. This work paves the way for consumers and vendors to understand
and efficiently mitigate risks during integration of LLMs in their respective
solutions or organizations. It also enables the research community to benefit
from the discussion of open challenges and edge cases that may hinder the
secure and privacy-preserving adoption of LLM-based systems.
Oak Ridge National Labort
Abstract
Engineering educational curriculum and standards cover many material and
manufacturing options. However, engineers and designers are often unfamiliar
with certain composite materials or manufacturing techniques. Large language
models (LLMs) could potentially bridge the gap. Their capacity to store and
retrieve data from large databases provides them with a breadth of knowledge
across disciplines. However, their generalized knowledge base can lack
targeted, industry-specific knowledge. To this end, we present two LLM-based
applications based on the GPT-4 architecture: (1) The Composites Guide: a
system that provides expert knowledge on composites material and connects users
with research and industry professionals who can provide additional support and
(2) The Equipment Assistant: a system that provides guidance for manufacturing
tool operation and material characterization. By combining the knowledge of
general AI models with industry-specific knowledge, both applications are
intended to provide more meaningful information for engineers. In this paper,
we discuss the development of the applications and evaluate it through a
benchmark and two informal user studies. The benchmark analysis uses the Rouge
and Bertscore metrics to evaluate our model performance against GPT-4o. The
results show that GPT-4o and the proposed models perform similarly or better on
the ROUGE and BERTScore metrics. The two user studies supplement this
quantitative evaluation by asking experts to provide qualitative and open-ended
feedback about our model performance on a set of domain-specific questions. The
results of both studies highlight a potential for more detailed and specific
responses with the Composites Guide and the Equipment Assistant.
AI Insights - 100 Q&As on damping ratio, Poissonâs ratio, and thermal transport.
- 100 Q&As per process covering mold size, clamp opening, and electrical specs.
- ROUGE and BERTScore benchmarking shows the specialized LLM matches or exceeds GPTâ4o.
- Human ratings confirm higher accuracy and relevance for domainâspecific queries.
- Weaknesses include occasional inaccurate fabrication steps and incomplete injectionâmolding breakdowns.
- Suggested reading: âComposites: Materials, Manufacturing and Designâ and âAdvanced Composites for Aerospace, Marine and Land Applications.â
- Poissonâs Ratio: transverse contraction over longitudinal extension in tensile testing.