AGI

Improving AGI Evaluation: A Data Science Perspective

Pingla Institute, Sydney

Rate this image: 😍 👍 👎

Abstract
Evaluation of potential AGI systems and methods is difficult due to the breadth of the engineering goal. We have no methods for perfect evaluation of the end state, and instead measure performance on small tests designed to provide directional indication that we are approaching AGI. In this work we argue that AGI evaluation methods have been dominated by a design philosophy that uses our intuitions of what intelligence is to create synthetic tasks, that have performed poorly in the history of AI. Instead we argue for an alternative design philosophy focused on evaluating robust task execution that seeks to demonstrate AGI through competence. This perspective is developed from common practices in data science that are used to show that a system can be reliably deployed. We provide practical examples of what this would mean for AGI evaluation.

AI Insights

Out‑of‑time cross‑validation is proposed to guard against dataset shift in AGI benchmarks.
Leave‑one‑out distinguishability tests whether a model truly generalises beyond memorised samples.
Cluster‑data splits expose hidden structure that can inflate performance estimates.
The paper argues that robust task execution, not synthetic intuition‑driven tests, should define AGI competence.
It stresses a multidisciplinary panel—data scientists, AI theorists, philosophers—to audit evaluation protocols.
Practical examples illustrate how to deploy these data‑science checks in real‑world AGI trials.
The authors warn that over‑fitting and memorisation can masquerade as intelligence if cross‑validation is ignored.

👍 👎 ♥ Save

Moravec's Paradox and Restrepo's Model: Limits of AGI Automation in Growth

ESADE Business School

Abstract
This note extends Restrepo (2025)'s model of economic growth under AGI by incorporating Moravec's Paradox -the observation that tasks requiring sensorimotor skills remain computationally expensive relative to cognitive tasks. We partition the task space into cognitive and physical components with differential automation costs, allowing infinite costs for some physical bottlenecks. Our key result shows that when physical tasks constitute economic bottlenecks with sufficiently high (or infinite) computational requirements, the labor share of income converges to a positive constant in the finite-compute regime (rather than zero). This fundamentally alters the distributional implications of AGI while preserving the growth dynamics for cognitive-intensive economies.

Job Displacement

👍 👎 ♥ Save

Labor Market Reforms, Flexibility, and Employment Transitions Across Formal and Informal Sectors

Rate this image: 😍 👍 👎

Abstract
In this paper, I investigate the 2017 labor market reform in Benin, which reduced firing costs and allowed firms to renew short-term contracts indefinitely. Using micro-data from the Harmonized Household Living Standards Surveys and a two-way fixed effect approach with nearby countries as the control group, I assess the reform's impact on employment, worker tenure, contract types, and wages. My empirical results reveal a 2.6 percentage point (24.5 percent) increase in formal sector employment and a 2.8 percentage point (3.2 percent) reduction in informal employment. Formal sector tenure decreased by 0.23 months for short-term contract workers, reflecting higher turnover, while long-term contract tenure increased by 0.15 months. The likelihood of securing a permanent contract rose by 23.2 percentage points (41.6 percent) in the formal sector, indicating that firms used long-term contracts to retain high-productivity workers. Wages in the formal sector increased by 33.6 USD per month on average, with workers on short-term contracts experiencing a wage increase of 19.6 USD and those on long-term contracts seeing an increase of 23.4 USD. I complement these findings with a theoretical job search model, which explains the mechanisms through which lowered firing costs affected firm hiring decisions, market tightness, and the sorting of workers across sectors. This study provides robust evidence of labor market reallocation and highlights the complex trade-offs between flexibility, employment stability, and wages in a developing country context.

Changes in the Labor Market

👍 👎 ♥ Save

Extracting O*NET Features from the NLx Corpus to Build Public Use Aggregate Labor Market Data

equitablegrowthorg

Rate this image: 😍 👍 👎

Abstract
Data from online job postings are difficult to access and are not built in a standard or transparent manner. Data included in the standard taxonomy and occupational information database (O*NET) are updated infrequently and based on small survey samples. We adopt O*NET as a framework for building natural language processing tools that extract structured information from job postings. We publish the Job Ad Analysis Toolkit (JAAT), a collection of open-source tools built for this purpose, and demonstrate its reliability and accuracy in out-of-sample and LLM-as-a-Judge testing. We extract more than 10 billion data points from more than 155 million online job ads provided by the National Labor Exchange (NLx) Research Hub, including O*NET tasks, occupation codes, tools, and technologies, as well as wages, skills, industry, and more features. We describe the construction of a dataset of occupation, state, and industry level features aggregated by monthly active jobs from 2015 - 2025. We illustrate the potential for research and future uses in education and workforce development.

AI Agents

👍 👎 ♥ Save

Agentic Services Computing

Zhejiang University, The

Abstract
The rise of LLM-powered agents is driving a fundamental transformation in services computing: from static, request-response functions to dynamic, goal-oriented, and autonomous multi-agent ecosystems. In response to this shift, we introduce Agentic Service Computing (ASC), a new paradigm that reimagines services as intelligent, self-adaptive, and socially embedded entities. This comprehensive survey presents a lifecycle-driven framework for ASC, structured around four core phases: Design, Deployment, Operation, and Evolution. We systematically analyze ASC through four foundational research dimensions: (1) Perception, Context, and Environment Modeling, (2) Autonomous Decision-Making and Task Execution, (3) Multi-Agent Collaboration and Organization, and (4) Evaluation, Value Alignment, and Trustworthiness. We examine how these dimensions are instantiated, integrated, and continuously adapted across the service lifecycle. Our synthesis reveals that agentic services are not merely assembled but orchestrated: contextual awareness enables robust deployment; autonomous reasoning supports real-time operation; collaborative structures emerge and evolve through interaction; and trustworthiness must be upheld as a cross-cutting, lifelong imperative. We further identify and discuss emerging trends shaping the future of ASC. By integrating classical principles of services computing with advances in LLM-based multi-agent systems, this work establishes a holistic and forward-looking foundation for ASC. It provides a unified reference for researchers and practitioners aiming to develop adaptive, accountable, and human-centered intelligent services.

AI Insights

Federated learning enables privacy‑preserving on‑device updates for agentic services.
Formal verification can guarantee safety of autonomous decision modules in multi‑agent ecosystems.
Dynamic resource schedulers adapt to workload shifts, preserving QoS in agentic clusters.
OpenAPI extensions for agentic interactions standardize cross‑domain collaboration.
Benchmarks that score explainability, latency, and trust guide agentic framework comparison.
Human‑in‑the‑loop UIs let users steer agentic goals while preserving autonomy.
Edge‑centric deployments cut latency and boost resilience for distributed agentic services.

👍 👎 ♥ Save

Beyond the Algorithm: A Field Guide to Deploying AI Agents in Clinical Practice

Mass General Brigham, MIT

Abstract
Large language models (LLMs) integrated into agent-driven workflows hold immense promise for healthcare, yet a significant gap exists between their potential and practical implementation within clinical settings. To address this, we present a practitioner-oriented field manual for deploying generative agents that use electronic health record (EHR) data. This guide is informed by our experience deploying the "irAE-Agent", an automated system to detect immune-related adverse events from clinical notes at Mass General Brigham, and by structured interviews with 20 clinicians, engineers, and informatics leaders involved in the project. Our analysis reveals a critical misalignment in clinical AI development: less than 20% of our effort was dedicated to prompt engineering and model development, while over 80% was consumed by the sociotechnical work of implementation. We distill this effort into five "heavy lifts": data integration, model validation, ensuring economic value, managing system drift, and governance. By providing actionable solutions for each of these challenges, this field manual shifts the focus from algorithmic development to the essential infrastructure and implementation work required to bridge the "valley of death" and successfully translate generative AI from pilot projects into routine clinical care.

AI and Society

👍 👎 ♥ Save

Signaling in the Age of AI: Evidence from Cover Letters

Yale Department of Econom

Rate this image: 😍 👍 👎

Abstract
We study how generative AI affects labor market signaling using the introduction of an AI-powered cover letter writing tool on Freelancer.com. Our data track both access to the tool and usage at the application level. Difference-in-differences estimates show that access to the AI tool increased textual alignment between cover letters and job posts--which we refer to as cover letter tailoring--and raised callback likelihoods. Workers with weaker pre-AI writing skills saw larger improvements in cover letters, indicating that AI substitutes for workers' own skills. Although only a minority of applications used the tool, the overall correlation between cover letter tailoring and callbacks fell by 51%, implying that cover letters became less informative signals of worker ability in the age of AI. Employers correspondingly shifted toward alternative signals, such as workers' past reviews, which became more predictive of hiring. Finally, within the treated group, greater time spent editing AI drafts was associated with higher hiring success.

AI Insights

A regression discontinuity design around the platform’s eligibility cutoff isolates the AI tool’s effect.
The screening model shows AI tailoring substitutes weaker writers’ effort, reshaping employer weighting.
Hiring probability rises 3.5 percentage points, especially for high‑skill candidates.
AI cuts editing time by 2.5 minutes per application, yet more editing boosts success.
Post‑AI, tailoring‑callback correlation drops 51%, shifting focus to past review scores.
Findings suggest policy incentives for AI adoption but warn that over‑reliance may erode narrative value.
Read “The Impact of Artificial Intelligence on Society” and the Journal of Labor Economics’ AI special issue for deeper context.

👍 👎 ♥ Save

Know Thyself? On the Incapability and Implications of AI Self-Recognition

Abstract
Self-recognition is a crucial metacognitive capability for AI systems, relevant not only for psychological analysis but also for safety, particularly in evaluative scenarios. Motivated by contradictory interpretations of whether models possess self-recognition (Panickssery et al., 2024; Davidson et al., 2024), we introduce a systematic evaluation framework that can be easily applied and updated. Specifically, we measure how well 10 contemporary larger language models (LLMs) can identify their own generated text versus text from other models through two tasks: binary self-recognition and exact model prediction. Different from prior claims, our results reveal a consistent failure in self-recognition. Only 4 out of 10 models predict themselves as generators, and the performance is rarely above random chance. Additionally, models exhibit a strong bias toward predicting GPT and Claude families. We also provide the first evaluation of model awareness of their own and others' existence, as well as the reasoning behind their choices in self-recognition. We find that the model demonstrates some knowledge of its own existence and other models, but their reasoning reveals a hierarchical bias. They appear to assume that GPT, Claude, and occasionally Gemini are the top-tier models, often associating high-quality text with them. We conclude by discussing the implications of our findings on AI safety and future directions to develop appropriate AI self-awareness.

Research Automation with AI

👍 👎 ♥ Save

Atlas of Human-AI Interaction (v1): An Interactive Meta-Science Platform for Large-Scale Research Literature Sensemaking

MIT Media Lab, USA

Rate this image: 😍 👍 👎

Abstract
Human-AI interaction researchers face an overwhelming challenge: synthesizing insights from thousands of empirical studies to understand how AI impacts people and inform effective design. Existing approach for literature reviews cluster papers by similarities, keywords or citations, missing the crucial cause-and-effect relationships that reveal how design decisions impact user outcomes. We introduce the Atlas of Human-AI Interaction, an interactive web interface that provides the first systematic mapping of empirical findings across 1,000+ HCI papers using LLM-powered knowledge extraction. Our approach identifies causal relationships, and visualizes them through an AI-enabled interactive web interface as a navigable knowledge graph. We extracted 2,037 empirical findings, revealing research topic clusters, common themes, and disconnected areas. Expert evaluation with 20 researchers revealed the system's effectiveness for discovering research gaps. This work demonstrates how AI can transform literature synthesis itself, offering a scalable framework for evidence-based design, opening new possibilities for computational meta-science across HCI and beyond.

👍 👎 ♥ Save

On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search

Northwestern University

Abstract
Investigative journalists routinely confront large document collections. Large language models (LLMs) with retrieval-augmented generation (RAG) capabilities promise to accelerate the process of document discovery, but newsroom adoption remains limited due to hallucination risks, verification burden, and data privacy concerns. We present a journalist-centered approach to LLM-powered document search that prioritizes transparency and editorial control through a five-stage pipeline -- corpus summarization, search planning, parallel thread execution, quality evaluation, and synthesis -- using small, locally-deployable language models that preserve data security and maintain complete auditability through explicit citation chains. Evaluating three quantized models (Gemma 3 12B, Qwen 3 14B, and GPT-OSS 20B) on two corpora, we find substantial variation in reliability. All models achieved high citation validity and ran effectively on standard desktop hardware (e.g., 24 GB of memory), demonstrating feasibility for resource-constrained newsrooms. However, systematic challenges emerged, including error propagation through multi-stage synthesis and dramatic performance variation based on training data overlap with corpus content. These findings suggest that effective newsroom AI deployment requires careful model selection and system design, alongside human oversight for maintaining standards of accuracy and accountability.

AGI: Artificial General Intelligence

👍 👎 ♥ Save

Scaling Generalist Data-Analytic Agents

Zhejiang University,AlibA

Abstract
Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.

AI Insights

The paper reveals a modular prompt framework that orchestrates judge and trajectory sampling models for systematic evaluation of data‑analytic agents.
It introduces a hierarchical tagging scheme that encodes output formatting, enabling automated parsing of multi‑turn code rollouts.
The evaluation protocol explicitly separates question generation, answer generation, and answer assessment, mirroring a formal peer‑review workflow.
A key insight is that the prompts embed domain‑specific constraints, allowing the agent to adapt to diverse data‑analysis tasks without manual re‑engineering.
The authors note that the formal language of the prompts may hinder accessibility, suggesting future work on user‑friendly interfaces.
The lack of contextual background in the prompts highlights a gap that could be bridged by integrating explanatory metadata into the prompt design.
The framework’s reliance on pre‑defined evaluation tags points to an opportunity for automated metric extraction and continuous learning.

Deep Learning

👍 👎 ♥ Save

Mathematical Modeling and Convergence Analysis of Deep Neural Networks with Dense Layer Connectivities in Deep Learning

Nankai University, Tianjn

Abstract
In deep learning, dense layer connectivity has become a key design principle in deep neural networks (DNNs), enabling efficient information flow and strong performance across a range of applications. In this work, we model densely connected DNNs mathematically and analyze their learning problems in the deep-layer limit. For a broad applicability, we present our analysis in a framework setting of DNNs with densely connected layers and general non-local feature transformations (with local feature transformations as special cases) within layers, which is called dense non-local (DNL) framework and includes standard DenseNets and variants as special examples. In this formulation, the densely connected networks are modeled as nonlinear integral equations, in contrast to the ordinary differential equation viewpoint commonly adopted in prior works. We study the associated training problems from an optimal control perspective and prove convergence results from the network learning problem to its continuous-time counterpart. In particular, we show the convergence of optimal values and the subsequence convergence of minimizers, using a piecewise linear extension and $\Gamma$-convergence analysis. Our results provide a mathematical foundation for understanding densely connected DNNs and further suggest that such architectures can offer stability of training deep models.

AI Insights

Forward‑backward‑splitting networks converge in the deep‑layer limit, revealing new stability insights.
Learned primal‑dual schemes are modeled as dynamical systems with a linear operator K, enabling Lyapunov‑style analysis.
A piecewise‑linear extension links discrete layers to continuous time, yielding a Γ‑convergence proof of optimal values.
The framework includes DenseNet variants and non‑local feature transforms, suggesting unexplored hybrid architectures.
Brunner’s “Volterra Integral Equations” and Braides’ “Γ‑convergence for Beginners” are key resources for the theory.
The work builds on Haber, Lu, and Ruthotto’s PDE‑inspired DNN research, situating it in physics‑informed deep learning.
Though mathematically dense, the paper encourages experimenting with forward‑backward‑splitting layers for future stability gains.

👍 👎 ♥ Save

A Generalized Information Bottleneck Theory of Deep Learning

University College London

Abstract
The Information Bottleneck (IB) principle offers a compelling theoretical framework to understand how neural networks (NNs) learn. However, its practical utility has been constrained by unresolved theoretical ambiguities and significant challenges in accurate estimation. In this paper, we present a \textit{Generalized Information Bottleneck (GIB)} framework that reformulates the original IB principle through the lens of synergy, i.e., the information obtainable only through joint processing of features. We provide theoretical and empirical evidence demonstrating that synergistic functions achieve superior generalization compared to their non-synergistic counterparts. Building on these foundations we re-formulate the IB using a computable definition of synergy based on the average interaction information (II) of each feature with those remaining. We demonstrate that the original IB objective is upper bounded by our GIB in the case of perfect estimation, ensuring compatibility with existing IB theory while addressing its limitations. Our experimental results demonstrate that GIB consistently exhibits compression phases across a wide range of architectures (including those with \textit{ReLU} activations where the standard IB fails), while yielding interpretable dynamics in both CNNs and Transformers and aligning more closely with our understanding of adversarial robustness.

Interests not found

Help us improve your experience!