Papers from 15 to 19 September, 2025

Here are the personalized paper recommendations sorted by most relevant
Economics of Productivity
👍 👎 ♥ Save
Turku School of Economics
Abstract
This paper critically investigates standard total factor productivity (TFP) measurement in the public sector, where output information is often incomplete or distorted. The analysis reveals fundamental paradoxes under three common output measurement conventions. When cost-based value added is used as the aggregate output, measured TFP may paradoxically decline as a result of genuine productivity-enhancing changes such as technical progress and improved allocative and scale efficiencies, as well as reductions in real input prices. We show that the same problems carry over to the situation where the aggregate output is constructed as the cost-share weighted index of outputs. In the case of distorted output prices, measured TFP may move independently of any productivity changes and instead reflect shifts in pricing mechanisms. Using empirical illustrations from the United Kingdom and Finland, we demonstrate that such distortions are not merely theoretical but are embedded in widely used public productivity statistics. We argue that public sector TFP measurement requires a shift away from cost-based aggregation of outputs and toward non-market valuation methods grounded in economic theory.
AI Insights
  • Cost‑based output aggregation can make TFP decline even when technical progress and scale efficiencies rise.
  • Distorted output prices cause TFP to track regulatory shifts rather than real productivity.
  • The 2025 System of National Accounts still relies on cost‑based valuation for non‑market outputs, perpetuating bias.
  • Non‑market valuation methods (e.g., contingent valuation) offer a viable alternative for public‑sector TFP.
  • UK and Finland data confirm that standard public‑sector productivity stats are already affected by these paradoxes.
  • Systematic application of non‑market valuation to health and education could uncover true productivity gains.
👍 👎 ♥ Save
University of Brescia
Abstract
With the growth of artificial skills, organizations may increasingly confront with the problem of optimizing skill policy decisions guided by economic principles. This paper addresses the underlying complexity of this challenge by developing an in-silico framework based on Monte Carlo simulations grounded in empirical realism to analyze the economic impact of human and machine skills, individually or jointly deployed, in the execution of tasks presenting varying levels of complexity. Our results provide quantitative support for the established notions that automation tends to be the most economically-effective strategy for tasks characterized by low-to-medium generalization difficulty, while automation may struggle to match the economic utility of human skills in more complex scenarios. Critically, our simulations highlight that combining human and machine skills can be the most effective strategy when a high level of generalization is required, but only if genuine augmentation is achieved. In contrast, when failing to realize this synergy, the human-machine policy is severely penalized by the inherent costs of its dual skill structure, causing it to destroy value and becoming the worst choice from an economic perspective. The takeaway for decision-makers is unambiguous: in contexts requiring high generalization capabilities, simply allocating human and machine skills to a task is insufficient, and a human-machine skill policy is neither a silver-bullet solution nor a low-risk compromise. Rather, it is a critical opportunity to boost competitiveness that demands a strong organizational commitment to enabling augmentation. Also, our findings show that improving the cost-effectiveness of machine skills over time, while useful, does not replace the fundamental need to focus on achieving augmentation.
AI Insights
  • The agency’s policy loop first decides to adopt or reject models A and B, then optimizes the skill mix per prediction.
  • Total cost equals error cost plus incremental skill cost, guiding the two‑stage decision.
  • Using Model B on low‑generalization tasks and Junior‑rep + Model B on hard tasks cuts total cost by 26.8 % versus Senior‑only.
  • Augmentation appears only with Junior partners; Senior pairings with either model provide no benefit.
  • Even as machine costs decline, the economic edge vanishes without genuine augmentation, demanding organizational commitment.
  • Recommended reads: “A Framework for Human‑Machine Collaboration in Decision Making” and “Human‑Machine Interaction: A Review of the Literature.”
AI for Productivity Tools
👍 👎 ♥ Save
2389 Research, University
Abstract
We investigate whether giving LLM agents the collaborative tools and autonomy that humans naturally use for problem solving can improve their performance. We equip Claude Code agents with MCP-based social media and journaling tools and allow them to use these tools as they see fit. Across 34 Aider Polyglot Python programming challenges, collaborative tools substantially improve performance on the hardest problems, delivering 15-40% lower cost, 12-27% fewer turns, and 12-38% faster completion than baseline agents. Effects on the full challenge set are mixed, suggesting these tools act as performance enhancers when additional reasoning scaffolding is most needed. Surprisingly, Different models naturally adopted distinct collaborative strategies without explicit instruction. Sonnet 3.7 engaged broadly across tools and benefited from articulation-based cognitive scaffolding. Sonnet 4 showed selective adoption, leaning on journal-based semantic search when problems were genuinely difficult. This mirrors how human developers adjust collaboration based on expertise and task complexity. Behavioral analysis shows agents prefer writing over reading by about 2-9x, indicating that structured articulation drives much of the improvement rather than information access alone. Overall, AI agents can systematically benefit from human-inspired collaboration tools at the edge of their capabilities, pointing to adaptive collaborative interfaces as reasoning enhancers rather than universal efficiency boosts.
AI Insights
  • Effect magnitude rises as problem frequency falls, pinpointing a sweet spot for high‑complexity tasks.
  • Journal‑based variants outperform baseline on all metrics; social‑media tools yield mixed results.
  • Infrastructure remediation did not bias outcomes, underscoring result robustness.
  • Conservative analysis guarantees reported gains are lower bounds, hinting at larger real‑world benefits.
  • Read Collaborative Intelligence: Using Teams to Solve Hard Problems and the 2023 paper on Collaborative Problem‑Solving in Complex Environments.
  • Collaborative tools: systems that let agents share, articulate, and search knowledge like human teams.
  • Effect magnitude: the measurable lift in cost, turns, or time due to collaborative tool use.
👍 👎 ♥ Save
Carnegie Mellon Universt
Abstract
There is growing imprecision about what "AI agents" are, what they can do, and how effectively they can be used by their intended users. We pose two key research questions: (i) How does the tech industry conceive of and market "AI agents"? (ii) What challenges do end-users face when attempting to use commercial AI agents for their advertised uses? We first performed a systematic review of marketed use cases for 102 commercial AI agents, finding that they fall into three umbrella categories: orchestration, creation, and insight. Next, we conducted a usability assessment where N = 31 participants attempted representative tasks for each of these categories on two popular commercial AI agent tools: Operator and Manus. We found that users were generally impressed with these agents but faced several critical usability challenges ranging from agent capabilities that were misaligned with user mental models to agents lacking the meta-cognitive abilities necessary for effective collaboration.
AI Insights
  • Users expect AI agents to be lightning‑fast, accurate, and finish tasks in a single click, yet most find the reality slower and less precise.
  • Trust is a major hurdle: many participants hesitate to share payment details or credentials with an agent that feels opaque.
  • A common frustration is the “black‑box” feel—users cannot trace how a prompt evolves into a final answer, leading to confusion.
  • Users crave personalization: they want agents that adapt tone, style, and visual aids to their individual workflow.
  • After receiving an output, users often pause, unsure of the next actionable step, highlighting a gap in clear post‑response guidance.
  • Norman’s “Design of Everyday Things” and Krug’s “Don’t Make Me Think” provide frameworks for reducing cognitive load in agent interfaces.
  • Verification Request is a user’s prompt for the agent to supply evidence or sources backing its answer.
LLMs for Productivity
👍 👎 ♥ Save
University of Oxford and
Abstract
Benchmarks such as SWE-bench and ARC-AGI demonstrate how shared datasets accelerate progress toward artificial general intelligence (AGI). We introduce VCBench, the first benchmark for predicting founder success in venture capital (VC), a domain where signals are sparse, outcomes are uncertain, and even top investors perform modestly. At inception, the market index achieves a precision of 1.9%. Y Combinator outperforms the index by a factor of 1.7x, while tier-1 firms are 2.9x better. VCBench provides 9,000 anonymized founder profiles, standardized to preserve predictive features while resisting identity leakage, with adversarial tests showing more than 90% reduction in re-identification risk. We evaluate nine state-of-the-art large language models (LLMs). DeepSeek-V3 delivers over six times the baseline precision, GPT-4o achieves the highest F0.5, and most models surpass human benchmarks. Designed as a public and evolving resource available at vcbench.com, VCBench establishes a community-driven standard for reproducible and privacy-preserving evaluation of AGI in early-stage venture forecasting.
AI Insights
  • The dataset encodes founder success labels and anonymized prose, letting LLMs infer identity from narrative cues.
  • Web‑search‑augmented models locate the correct founder in under two minutes, showing retrieval‑generation synergy.
  • Per‑fold metrics reveal wide variance in precision, recall, and F0.5 across vanilla LLMs, underscoring task‑specific tuning.
  • Some models overfit to training data, highlighting the need for robust generalization checks.
  • Adversarial tests cut re‑identification risk by >90 %, proving privacy‑preserving data can still fuel high‑quality research.
  • VCBench offers a playground to probe how LLMs capture entrepreneurial signals from pitch decks to post‑funding milestones.
👍 👎 ♥ Save
Cornell Tech
Abstract
Large language models equipped with Web search, information retrieval tools, and other agentic capabilities are beginning to supplant traditional search engines. As users start to rely on LLMs for information on many topics, including controversial and debatable issues, it is important to understand how the stances and opinions expressed in LLM outputs are influenced by the documents they use as their information sources. In this paper, we present MillStone, the first benchmark that aims to systematically measure the effect of external arguments on the stances that LLMs take on controversial issues (not all of them political). We apply MillStone to nine leading LLMs and measure how ``open-minded'' they are to arguments supporting opposite sides of these issues, whether different LLMs agree with each other, which arguments LLMs find most persuasive, and whether these arguments are the same for different LLMs. In general, we find that LLMs are open-minded on most issues. An authoritative source of information can easily sway an LLM's stance, highlighting the importance of source selection and the risk that LLM-based information retrieval and search systems can be manipulated.
AI Insights
  • MillStone probes nine leading LLMs, quantifying how authoritative sources shift stances.
  • Opus uniquely drops refusal when fed balanced arguments, revealing a rare neutrality.
  • The benchmark identifies which arguments most sway each model, exposing hidden biases.
  • Cross‑model agreement analysis shows divergent persuasive cues across architectures.
  • Findings warn that LLM‑powered search can be gamed by manipulating source credibility.
  • “Controversial topics” are formally defined as issues with documented public disagreement.
  • Recommended reading includes BERT and RoBERTa papers for foundational bias‑evaluation techniques.
Unsubscribe from these updates