Economics of Productivity

Intuition to Evidence: Measuring AI's True Impact on Developer Productivity

1mgcom

Rate this image: 😍 👍 👎

Abstract
We present a comprehensive real-world evaluation of AI-assisted software development tools deployed at enterprise scale. Over one year, 300 engineers across multiple teams integrated an in-house AI platform (DeputyDev) that combines code generation and automated review capabilities into their daily workflows. Through rigorous cohort analysis, our study demonstrates statistically significant productivity improvements, including an overall 31.8% reduction in PR review cycle time. Developer adoption was strong, with 85% satisfaction for code review features and 93% expressing a desire to continue using the platform. Adoption patterns showed systematic scaling from 4% engagement in month 1 to 83% peak usage by month 6, stabilizing at 60% active engagement. Top adopters achieved a 61% increase in code volume pushed to production, contributing to approximately 30 to 40% of code shipped to production through this tool, accounting for an overall 28% increase in code shipment volume. Unlike controlled benchmark evaluations, our longitudinal analysis provides empirical evidence from production environments, revealing both the transformative potential and practical deployment challenges of integrating AI into enterprise software development workflows.

AI Insights

Propensity score matching balanced productivity across teams, revealing nuanced adoption effects.
Multilevel modeling controlled for team‑level variance, isolating true productivity gains.
Data quality, bias, and transparency surfaced as the top challenges for AI code review.
Fine‑tuned transformer and LLM reviewers improved accuracy but risked overfitting.
Code‑generation usage lagged behind review, hinting at a trust gap developers must bridge.
Cohen‑style power analysis confirmed the 31.8% cycle‑time reduction was statistically robust.
Long Code Arena and Qiu et al.’s benchmarks set a rigorous baseline for long‑context code model evaluation.

👍 👎 ♥ Save

Research and development as a driver of innovation and economic growth; case of developing economies

Fayyaz, A, Bartha, Z

Abstract
The goal of this research is to uncover the channels through which research and development (R&D) impacts economic growth in developing countries. The study employed nine variables from three broader categories in the World Economic Forum database, each covering 32 countries from the lower-middle-income group for the year 2019. The theoretical framework is based on the R&D ecosystem, which includes components such as Institutions, Human capital, Capital market, R&D, and Innovation. Each of these components can contribute to the economic development of the country. Using Structural Equation Modelling (SEM), we build a path diagram to visualize and confirm a potential relationship between the components. R&D features had a positive impact on innovation (regression weight estimate: +0.34, p = 0.001), as did capital market institutions (regression weight estimate: +0.12, p = 0.007), but neither had a significant impact on growth. According to the Schumpeterian institutional interpretation, R&D and innovation efforts may not lead to sustained growth in middle-income countries. We find no significant connection between innovation performance and economic growth. This suggests that while R&D and capital markets may contribute to innovation through entrepreneurship, this contribution is not impactful enough to drive economic growth in developing countries. Our findings provide further evidence of the middle-income trap.

AI Insights

The study draws on 32 lower‑middle‑income economies from the 2019 World Economic Forum database, a rare cross‑country R&D snapshot.
A structural equation model links nine variables across institutions, human capital, and capital markets, mapping a nuanced R&D ecosystem.
The paper cites Schumpeter’s Theory of Economic Development and Shaw’s Financial Deepening as foundational texts for understanding growth dynamics.
Key terms—Middle‑Income Trap, Institutions, Human Capital, Technological Advancements—are operationalized, offering a clear taxonomy for future research.
The World Bank’s 2024 World Development Report is highlighted as a pivotal policy guide for escaping the middle‑income trap.

AI for Productivity Tools

👍 👎 ♥ Save

Responsible AI Technical Report

KT

Rate this image: 😍 👍 👎

Abstract
KT developed a Responsible AI (RAI) assessment methodology and risk mitigation technologies to ensure the safety and reliability of AI services. By analyzing the Basic Act on AI implementation and global AI governance trends, we established a unique approach for regulatory compliance and systematically identify and manage all potential risk factors from AI development to operation. We present a reliable assessment methodology that systematically verifies model safety and robustness based on KT's AI risk taxonomy tailored to the domestic environment. We also provide practical tools for managing and mitigating identified AI risks. With the release of this report, we also release proprietary Guardrail : SafetyGuard that blocks harmful responses from AI models in real-time, supporting the enhancement of safety in the domestic AI development ecosystem. We also believe these research outcomes provide valuable insights for organizations seeking to develop Responsible AI.

AI Insights

The risk taxonomy categorizes threats into data, model, deployment, and societal dimensions, each with measurable indicators.
A multi‑stage assessment pipeline integrates static code analysis, adversarial testing, and human‑in‑the‑loop audits to quantify robustness.
SafetyGuard employs a lightweight transformer‑based policy network that intercepts outputs in real time, achieving <5 ms latency on edge devices.
Compliance mapping aligns each risk factor with specific clauses of the Basic Act on AI, enabling automated audit reports.
Pilot deployments in Korean telecom and finance sectors demonstrated a 30 % reduction in policy‑violating incidents after Guardrail integration.
The report proposes a future research agenda on explainable mitigation strategies and cross‑border data‑sharing protocols.

LLMs for Productivity

👍 👎 ♥ Save

SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs

University of Illinois, 2

Abstract
Supervised Fine-Tuning (SFT) on domain-specific datasets is a common approach to adapt Large Language Models (LLMs) to specialized tasks but is often believed to degrade their general capabilities. In this work, we revisit this trade-off and present both empirical and theoretical insights. First, we show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation while preserving comparable target-domain performance. We then provide a theoretical analysis that explains these phenomena and further motivates a new method, Token-Adaptive Loss Reweighting (TALR). Building on this, and recognizing that smaller learning rates alone do not fully eliminate general-performance degradation in all cases, we evaluate a range of strategies for reducing general capability loss, including L2 regularization, LoRA, model averaging, FLOW, and our proposed TALR. Experimental results demonstrate that while no method completely eliminates the trade-off, TALR consistently outperforms these baselines in balancing domain-specific gains and general capabilities. Finally, we distill our findings into practical guidelines for adapting LLMs to new domains: (i) using a small learning rate to achieve a favorable trade-off, and (ii) when a stronger balance is further desired, adopt TALR as an effective strategy.

AI Insights

CoT‑adapter lets pre‑trained LLMs generate step‑by‑step reasoning without heavy fine‑tuning.
Chain‑of‑Thought (CoT) boosts few‑shot performance by explicitly modeling intermediate reasoning.
Prompt engineering remains critical: the right cue can unlock CoT behavior in any model.
Computational overhead rises with CoT, yet the gains in interpretability often outweigh the cost.
The paper’s TALR method shares a similar spirit—reweighting tokens to preserve generality.
Read “Deep Learning” and “Natural Language Processing (almost) from Scratch”, plus key papers like “Chain‑of‑Thought Prompt Engineering for Conversational AI” and “Reasoning Augmentation of Pre‑trained Models with Chain‑of‑Thought Adapters”.

👍 👎 ♥ Save

Agentic AutoSurvey: Let LLMs Survey LLMs

Lehigh University, 2Uni

Abstract
The exponential growth of scientific literature poses unprecedented challenges for researchers attempting to synthesize knowledge across rapidly evolving fields. We present \textbf{Agentic AutoSurvey}, a multi-agent framework for automated survey generation that addresses fundamental limitations in existing approaches. Our system employs four specialized agents (Paper Search Specialist, Topic Mining \& Clustering, Academic Survey Writer, and Quality Evaluator) working in concert to generate comprehensive literature surveys with superior synthesis quality. Through experiments on six representative LLM research topics from COLM 2024 categories, we demonstrate that our multi-agent approach achieves significant improvements over existing baselines, scoring 8.18/10 compared to AutoSurvey's 4.77/10. The multi-agent architecture processes 75--443 papers per topic (847 total across six topics) while targeting high citation coverage (often $\geq$80\% on 75--100-paper sets; lower on very large sets such as RLHF) through specialized agent orchestration. Our 12-dimension evaluation captures organization, synthesis integration, and critical analysis beyond basic metrics. These findings demonstrate that multi-agent architectures represent a meaningful advancement for automated literature survey generation in rapidly evolving scientific domains.

Help us improve your experience!