Hi!

Your personalized paper recommendations for 01 to 05 December, 2025.
AI Air Consumption
Mercor
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
Abstract
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform high-value consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) (55.2%) and GPT 5.1 (Thinking = High) (55.1%). Models differ across domains, and in Shopping the top model scores under 50%. For some requests (such as giving the correct price or providing working links), models are highly prone to hallucination. Overall, ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.
AI Summary
  • The ACE benchmark is a comprehensive evaluation of conversational AI models, covering various domains such as DIY, food, gaming, and shopping. [3]
  • The benchmark consists of multiple workflows for each domain, with specific instructions and criteria for the models to follow. [3]
  • Gemini 2.5 Flash (On), Gemini 3 Pro (High), and o3 (On) also demonstrate strong performance across various domains. [3]
  • The benchmark highlights the strengths and weaknesses of each model, providing valuable insights for developers and researchers to improve their conversational AI systems. [3]
  • ACE-v1-heldout: A subset of the ACE benchmark used for evaluation, consisting of 100 cases per domain. [3]
  • Bootstrapped confidence intervals: A statistical method used to estimate the uncertainty of mean scores by resampling with replacement from the original dataset. [3]
  • Domain: A specific category or area of expertise within the ACE benchmark, such as DIY, food, gaming, or shopping. [3]
  • Model: A conversational AI system being evaluated on the ACE benchmark, including models like Gemini 2.5 Flash (On), GPT-5 (High), and o3 (On). [3]
  • The ACE benchmark provides a comprehensive evaluation of conversational AI models across various domains. [3]
  • The results show that GPT-5 (High) and GPT-5.1 (High) perform exceptionally well in most domains, achieving high mean scores and confidence intervals. [2]
Zhejiang University
Abstract
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
AI Summary
  • The proposed benchmark may not capture all aspects of human intelligence, such as common sense or creativity. [3]
  • These models are like super-smart computers that can understand and generate human-like text. [3]
  • The authors want to make sure these models can solve math problems, which is an important part of being intelligent. [3]
  • The paper discusses the challenges of evaluating large language models (LLMs) and proposes a new benchmark for measuring their performance. [2]
AI Energy Consumption
arXiv
Abstract
Energy system models are increasingly employed to guide long-term planning in multi-sectoral environments where decisions span electricity, heat, transport, land use, and industry. While these models provide rigorous quantitative insights, their outputs are often highly technical, making them difficult to interpret for non-expert stakeholders such as policymakers, planners, and the public. This communication gap limits the accessibility and practical impact of scenario-based modeling, particularly as energy transitions grow more complex with rising shares of renewables, sectoral integration, and deep uncertainties. To address this challenge, we propose the Renewable Energy Large Language Model (RE-LLM), a hybrid framework that integrates Large Language Models (LLMs) directly into the energy system modeling workflow. RE-LLM combines three core elements: (i) optimization-based scenario exploration, (ii) machine learning surrogates that accelerate computationally intensive simulations, and (iii) LLM-powered natural language generation that translates complex results into clear, stakeholder-oriented explanations. This integrated design not only reduces computational burden but also enhances inter-pretability, enabling real-time reasoning about trade-offs, sensitivities, and policy implications. The framework is adaptable across different optimization platforms and energy system models, ensuring broad applicability beyond the case study presented. By merging speed, rigor, and interpretability, RE-LLM advances a new paradigm of human-centric energy modeling. It enables interactive, multilingual, and accessible engagement with future energy pathways, ultimately bridging the final gap between data-driven analysis and actionable decision-making for sustainable transitions.
AI Summary
  • The paper presents a framework for horizon-based optimization and multi-factor sensitivity analysis using machine learning models. [2]
AI Impacts on Society
Polytechnic Institute of
Abstract
This article introduces the concept of the 'dual footprint' as a heuristic device to capture the commonalities and interdependencies between the different impacts of artificial intelligence (AI) on the natural and social surroundings that supply resources for its production and use. Two in-depth case studies, each illustrating international flows of raw materials and of data work services, portray the AI industry as a value chain that spans national boundaries and perpetuates inherited global inequalities. The countries that drive AI development generate a massive demand for inputs and trigger social costs that, through the value chain, largely fall on more peripheral actors. The arrangements in place distribute the costs and benefits of AI unequally, resulting in unsustainable practices and preventing the upward mobility of more disadvantaged countries. The dual footprint grasps how the environmental and social dimensions of the dual footprint emanate from similar underlying socioeconomic processes and geographical trajectories.
AI Summary
  • The carbon (and water) footprints of data centre functioning, model training, and inference mainly occur in countries that lead AI development, such as the United States and France. [3]
  • The supply of data work for countries like the United States and France comes from areas with lower labour costs, including middle- and lower-income countries like Argentina and Madagascar. [3]
  • The 'dual' nature of the footprint is illuminated by the fact that the same country exports both mining products and data work services, with imports flowing towards countries leading the worldwide AI race. [3]
  • AI value chain: The series of activities involved in developing and deploying artificial intelligence systems, from raw materials extraction to software development and deployment. [3]
  • Carbon footprint: The amount of greenhouse gas emissions associated with a particular activity or product. [3]
  • The analysis takes a step back from stricter interpretations of the footprint concept as an accounting method and instead focuses on a bird's eye view, revealing who is impacted by pressure on resources and related effects spread along the AI value chain. [2]
University of Tunis
Abstract
Two 2025 publications, "AI 2027" (Kokotajlo et al., 2025) and "If Anyone Builds It, Everyone Dies" (Yudkowsky & Soares, 2025), assert that superintelligent artificial intelligence will almost certainly destroy or render humanity obsolete within the next decade. Both rest on the classic chain formulated by Good (1965) and Bostrom (2014): intelligence explosion, superintelligence, lethal misalignment. This article subjects each link to the empirical record of 2023-2025. Sixty years after Good's speculation, none of the required phenomena (sustained recursive self-improvement, autonomous strategic awareness, or intractable lethal misalignment) have been observed. Current generative models remain narrow, statistically trained artefacts: powerful, opaque, and imperfect, but devoid of the properties that would make the catastrophic scenarios plausible. Following Whittaker (2025a, 2025b, 2025c) and Zuboff (2019, 2025), we argue that the existential-risk thesis functions primarily as an ideological distraction from the ongoing consolidation of surveillance capitalism and extreme concentration of computational power. The thesis is further inflated by the 2025 AI speculative bubble, where trillions in investments in rapidly depreciating "digital lettuce" hardware (McWilliams, 2025) mask lagging revenues and jobless growth rather than heralding superintelligence. The thesis remains, in November 2025, a speculative hypothesis amplified by a speculative financial bubble rather than a demonstrated probability.
AI Water Consumption
Stanford University
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
Abstract
The energy transition through increased electrification has put the worlds attention on critical mineral exploration Even with increased investments a decrease in new discoveries has taken place over the last two decades Here I propose a solution to this problem where AI is implemented as the enabler of a rigorous scientific method for mineral exploration that aims to reduce cognitive bias and false positives drive down the cost of exploration I propose a new scientific method that is based on a philosophical approach founded on the principles of Bayesianism and falsification In this approach data acquisition is in the first place seen as a means to falsify human generated hypothesis Decision of what data to acquire next is quantified with verifiable metrics and based on rational decision making A practical protocol is provided that can be used as a template in any exploration campaign However in order to make this protocol practical various form of artificial intelligence are needed I will argue that the most important form are one novel unsupervised learning methods that collaborate with domain experts to better understand data and generate multiple competing geological hypotheses and two humanintheloop AI algorithms that can optimally plan various geological geophysical geochemical and drilling data acquisition where uncertainty reduction of geological hypothesis precedes the uncertainty reduction on grade and tonnage
AI Summary
  • Efficacy of information (EI): a metric that quantifies how much future data will reduce uncertainty on average on some quantity of interest. [3]
  • The author advocates for a new scientific method for mineral exploration, focusing on decision-making rather than traditional geophysical inversion. [2]
  • Epistemic uncertainty: the lack of understanding we still have about the nature of orebodies. [1]
AI for Social Equality
University of Bologna
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
Abstract
Machines have at times equalized physical strength by substituting for human effort, and at other times amplified these differences. Artificial intelligence (AI) may likewise narrow or widen disparities in cognitive ability. Recent evidence from the Information and Communication Technology (ICT) revolution suggests that computers increased inequality by education but reduced it by cognitive ability. Early research on generative AI shows larger productivity gains for less-skilled than for high-skilled workers. Whether AI ultimately acts as an equalizer or an amplifier of human cognitive differences is especially crucial for education systems, which must decide whether -- and how -- to allow students to use AI in coursework and exams. This decision is urgent because employers value workers who can leverage AI effectively rather than operate independently of it.
AI Summary
  • The article discusses the impact of Artificial Intelligence (AI) on human cognition and creativity, with a focus on its effects on education and policy. [3]
  • found that college attendance has increased social mobility within and across generations in the UK between 1960-2004. [3]
  • Acemoglu and Autor argue that skills, tasks, and technologies have implications for employment and earnings, while Tafti's work suggests that technology, skills, and performance are interconnected. [3]
  • The article also references research on the effects of AI feedback on learning, skill gaps, and intellectual diversity, as well as studies on generative AI at work and its impact on productivity and creativity. [3]
  • Finally, it mentions a study by Kosmyna et al. [3]
  • that found accumulation of cognitive debt when using an AI assistant for essay writing tasks. [3]
  • Cognitive ability: The capacity to process information, learn from experience, and adapt to new situations. [3]
  • Generative AI: A type of AI that can create new content, such as text, images, or music, based on a given prompt or input. [3]
  • A study by Ichino et al. [2]
  • Creativity: The ability to generate new ideas, products, or solutions. [1]
IBM
Abstract
The rapid deployment of large language model (LLM)-based agents introduces a new class of risks, driven by their capacity for autonomous planning, multi-step tool integration, and emergent interactions. It raises some risk factors for existing governance approaches as they remain fragmented: Existing frameworks are either static taxonomies driven; however, they lack an integrated end-to-end pipeline from risk identification to operational assurance, especially for an agentic platform. We propose AGENTSAFE, a practical governance framework for LLM-based agentic systems. The framework operationalises the AI Risk Repository into design, runtime, and audit controls, offering a governance framework for risk identification and assurance. The proposed framework, AGENTSAFE, profiles agentic loops (plan -> act -> observe -> reflect) and toolchains, and maps risks onto structured taxonomies extended with agent-specific vulnerabilities. It introduces safeguards that constrain risky behaviours, escalates high-impact actions to human oversight, and evaluates systems through pre-deployment scenario banks spanning security, privacy, fairness, and systemic safety. During deployment, AGENTSAFE ensures continuous governance through semantic telemetry, dynamic authorization, anomaly detection, and interruptibility mechanisms. Provenance and accountability are reinforced through cryptographic tracing and organizational controls, enabling measurable, auditable assurance across the lifecycle of agentic AI systems. The key contributions of this paper are: (1) a unified governance framework that translates risk taxonomies into actionable design, runtime, and audit controls; (2) an Agent Safety Evaluation methodology that provides measurable pre-deployment assurance; and (3) a set of runtime governance and accountability mechanisms that institutionalise trust in agentic AI ecosystems.
AI Summary
  • AGENTSAFE is an ethics-grounded governance framework that translates abstract safety principles into concrete, testable, and auditable practices. [2]
AI for Social Equity
Massachusetts Institute
Abstract
Beneficial societal outcomes cannot be guaranteed by aligning individual AI systems with the intentions of their operators or users. Even an AI system that is perfectly aligned to the intentions of its operating organization can lead to bad outcomes if the goals of that organization are misaligned with those of other institutions and individuals. For this reason, we need full-stack alignment, the concurrent alignment of AI systems and the institutions that shape them with what people value. This can be done without imposing a particular vision of individual or collective flourishing. We argue that current approaches for representing values, such as utility functions, preference orderings, or unstructured text, struggle to address these and other issues effectively. They struggle to distinguish values from other signals, to support principled normative reasoning, and to model collective goods. We propose thick models of value will be needed. These structure the way values and norms are represented, enabling systems to distinguish enduring values from fleeting preferences, to model the social embedding of individual choices, and to reason normatively, applying values in new domains. We demonstrate this approach in five areas: AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, and democratic regulatory institutions.
AI for Social Fairness
University of Pittsburgh
Abstract
Trustworthy machine learning in healthcare requires strong predictive performance, fairness, and explanations. While it is known that improving fairness can affect predictive performance, little is known about how fairness improvements influence explainability, an essential ingredient for clinical trust. Clinicians may hesitate to rely on a model whose explanations shift after fairness constraints are applied. In this study, we examine how enhancing fairness through bias mitigation techniques reshapes Shapley-based feature rankings. We quantify changes in feature importance rankings after applying fairness constraints across three datasets: pediatric urinary tract infection risk, direct anticoagulant bleeding risk, and recidivism risk. We also evaluate multiple model classes on the stability of Shapley-based rankings. We find that increasing model fairness across racial subgroups can significantly alter feature importance rankings, sometimes in different ways across groups. These results highlight the need to jointly consider accuracy, fairness, and explainability in model assessment rather than in isolation.
AI Summary
  • Applying bias mitigation can substantially alter feature importance rankings, especially in complex, nonlinear models. [3]
  • Explainability: The ability to understand how a machine learning model makes predictions or decisions. [3]
  • SHAP (SHapley Additive exPlanations): An algorithm for explaining the output of a machine learning model by assigning a value to each feature for a specific prediction. [3]
  • The study highlights the complex trade-off between performance, fairness, and explainability in ML. [2]
  • EOD (Equal Opportunity Demographic): A fairness metric that measures whether a model's predictions are biased towards certain demographic groups. [1]
TU Delft
Abstract
This chapter discusses the ethics of generative AI. It provides a technical primer to show how generative AI affords experiencing technology as if it were human, and this affordance provides a fruitful focus for the philosophical ethics of generative AI. It then shows how generative AI can both aggravate and alleviate familiar ethical concerns in AI ethics, including responsibility, privacy, bias and fairness, and forms of alienation and exploitation. Finally, the chapter examines ethical questions that arise specifically from generative AI's mimetic generativity, such as debates about authorship and credit, the emergence of as-if social relationships with machines, and new forms of influence, persuasion, and manipulation.
AI Summary
  • Generative AI systems can produce outputs that resemble meaningful human expression, making it natural for users to experience their outputs as if they were intentional or expressive. [3]
  • The affordance of experience-as-real is a feature that distinguishes generative AI from other forms of machine learning and connects the analysis to existing work in moral psychology and the philosophy of technology. [3]
  • Generative AI turns abstract philosophical puzzles into urgent design problems. [3]
  • Affordance: A feature that makes it natural for users to experience a technological system in a certain way. [3]
  • In this case, the affordance is the ability of generative AI systems to produce outputs that resemble meaningful human expression. [3]
  • Generative AI: A type of machine learning that enables systems to generate new content, such as text or images, based on patterns and structures learned from large datasets. [3]
  • The affordance of experience-as-real is a key feature of generative AI systems that has significant ethical implications. [3]
  • Generative AI raises questions about the value of authorship, the basis of interpersonal relationships, and the nature of permissible influence. [3]
  • Reflection on generative AI may shed light on aspects of human agency and communication that are often taken for granted. [3]
  • The article does not provide a comprehensive overview of the current state of research on generative AI. [3]
  • Ethical evaluation must take account of how users interpret and respond to system behavior, not only of the systems' internal mechanisms. [2]
AI for Social Good
ulamai
Abstract
We extend the moduli-theoretic framework of psychometric batteries to the domain of dynamical systems. While previous work established the AAI capability score as a static functional on the space of agent representations, this paper formalizes the agent as a flow $ν_r$ parameterized by computational resource $r$, governed by a recursive Generator-Verifier-Updater (GVU) operator. We prove that this operator generates a vector field on the parameter manifold $Θ$, and we identify the coefficient of self-improvement $κ$ as the Lie derivative of the capability functional along this flow. The central contribution of this work is the derivation of the Variance Inequality, a spectral condition that is sufficient (under mild regularity) for the stability of self-improvement. We show that a sufficient condition for $κ> 0$ is that, up to curvature and step-size effects, the combined noise of generation and verification must be small enough. We then apply this formalism to unify the recent literature on Language Self-Play (LSP), Self-Correction, and Synthetic Data bootstrapping. We demonstrate that architectures such as STaR, SPIN, Reflexion, GANs and AlphaZero are specific topological realizations of the GVU operator that satisfy the Variance Inequality through filtration, adversarial discrimination, or grounding in formal systems.
AI Summary
  • The GVU framework is used to analyze the stability of self-improvement in AI systems. [3]
  • The Variance Inequality (Theorem 4.1) provides a sufficient condition for stable self-improvement, requiring a high Signal-to-Noise Ratio (SNR) for both the generator and the verifier. [3]
  • AI slop event at parameter ΞΈ AI slop mass and slop regime The paper provides a framework for understanding the stability of self-improvement in AI systems, highlighting the importance of high SNR for both generators and verifiers. [3]
  • The paper defines AI slop as a region where the internal Verifier ranks outputs among its top fraction, but they actually lie in the bottom fraction of the true battery score. [2]
  • The paper introduces the Generalized Verifier-Generator Update (GVU) framework, which models the interaction between a generator and its verifier. [1]
University of Florida
Abstract
As generative artificial intelligence (GAI) enters the mental health landscape, questions arise about how individuals weigh AI tools against human therapists. Drawing on the Health Belief Model (HBM), this study examined belief-based predictors of intention to use GAI and therapists across two populations: a university sample (N = 1,155) and a nationally representative adult sample (N = 651). Using repeated-measures ANOVA and LASSO regression, we found that therapists were consistently valued for emotional, relational, and personalization benefits, while GAI was favored for accessibility and affordability. Yet structural advantages alone did not predict adoption; emotional benefit and personalization emerged as decisive factors. Adoption patterns diverged across groups: students treated GAI as a complement, whereas national adults approached it as a substitute. Concerns about privacy and reliability constrained GAI use in both groups. These findings extend HBM to multi-modality contexts and highlight design implications for trustworthy, emotionally resonant digital mental health tools.
AI Summary
  • LASSO regression is a statistical method used to identify the most influential predictors from a comprehensive set of perceived benefits and barriers. [3]
  • The study provides insights into the psychological reasoning behind help-seeking behavior, highlighting the importance of emotional benefits, personalization, affordability, and reliability in decision-making. [3]
  • The study found that individuals choose between GAI tools and human therapists based on different belief structures. [2]
  • The Health Belief Model (HBM) is a theoretical framework used to understand how individuals make decisions about their health behaviors. [1]
AI for Social Justice
University of Glasgow
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
Abstract
Commercial or in-house developments of probabilistic AI systems are introduced in policing and the wider criminal justice (CJ) system worldwide, often on a force-by-force basis. We developed a systematic way to characterise probabilistic AI tools across the CJ stages in a form of mapping with the aim to provide a coherent presentation of the probabilistic AI ecosystem in CJ. We use the CJ system in England and Wales as a paradigm. This map will help us better understand the extent of AI's usage in this domain (how, when, and by whom), its purpose and potential benefits, its impact on people's lives, compare tools, and identify caveats (bias, obscured or misinterpreted probabilistic outputs, cumulative effects by AI systems feeding each other, and breaches in the protection of sensitive data), as well as opportunities for future implementations. In this paper we present our methodology for systematically mapping the probabilistic AI tools in CJ stages and characterising them based on the modes of data consumption or production. We also explain how we collect the data and present our initial findings. This research is ongoing and we are engaging with UK Police organisations, and government and legal bodies. Our findings so far suggest a strong reliance on private sector providers, and that there is a growing interest in generative technologies and specifically Large Language Models (LLMs).
AI Summary
  • The use of AI in the criminal justice system is widespread, with 64% of tools using analysis, 33% using synthesis, and 26% using generation. [3]
  • Inference mode: The way in which a tool uses AI to make decisions or predictions. [3]
  • The use of AI in the criminal justice system is widespread, with 64% of tools using analysis, 33% using synthesis, and 26% using generation. [3]
  • Automated facial recognition has been used as evidence and trigger for police intervention. [3]
  • Probabilistic AI ecosystem: The use of artificial intelligence (AI) in the criminal justice system to make decisions or predictions. [2]
  • Analysis, Synthesis, and Generation are the three main inference modes. [1]
Kaiasm Ltd
Abstract
In this preprint, we present A collaborative human-AI approach to building an inspectable semantic layer for Agentic AI. AI agents first propose candidate knowledge structures from diverse data sources; domain experts then validate, correct, and extend these structures, with their feedback used to improve subsequent models. Authors show how this process captures tacit institutional knowledge, improves response quality and efficiency, and mitigates institutional amnesia. We argue for a shift from post-hoc explanation to justifiable Agentic AI, where decisions are grounded in explicit, inspectable evidence and reasoning accessible to both experts and non-specialists.
AI on Air
IBM
Abstract
The rapid shift from stateless large language models (LLMs) to autonomous, goal-driven agents raises a central question: When is agentic AI truly necessary? While agents enable multi-step reasoning, persistent memory, and tool orchestration, deploying them indiscriminately leads to higher cost, complexity, and risk. We present STRIDE (Systematic Task Reasoning Intelligence Deployment Evaluator), a framework that provides principled recommendations for selecting between three modalities: (i) direct LLM calls, (ii) guided AI assistants, and (iii) fully autonomous agentic AI. STRIDE integrates structured task decomposition, dynamism attribution, and self-reflection requirement analysis to produce an Agentic Suitability Score, ensuring that full agentic autonomy is reserved for tasks with inherent dynamism or evolving context. Evaluated across 30 real-world tasks spanning SRE, compliance, and enterprise automation, STRIDE achieved 92% accuracy in modality selection, reduced unnecessary agent deployments by 45%, and cut resource costs by 37%. Expert validation over six months in SRE and compliance domains confirmed its practical utility, with domain specialists agreeing that STRIDE effectively distinguishes between tasks requiring simple LLM calls, guided assistants, or full agentic autonomy. This work reframes agent adoption as a necessity-driven design decision, ensuring autonomy is applied only when its benefits justify the costs.
AI Summary
  • The framework can be used in conjunction with existing benchmarks to evaluate the performance of agentic AI systems. [3]
  • Future extensions to STRIDE will include multimodal tasks, reinforcement learning for weight tuning, and validation at enterprise scale. [3]
  • STRIDE's scoring functions are heuristic by design, striking a balance between interpretability and generality. [3]
  • STRIDE (Systematic Task Reasoning Intelligence Deployment Evaluator) is a framework that determines when tasks require agentic AI, AI assistants, or simple LLM calls. [2]
  • STRIDE integrates five analytical dimensions: structured task decomposition, dynamic reasoning and tool-interaction scoring, dynamism attribution analysis, self-reflection requirement assessment, and agentic suitability inference. [1]
Alibaba Cloud Computing
Abstract
The rapid advancement of Large Language Models (LLMs) is catalyzing a shift towards autonomous AI Agents capable of executing complex, multi-step tasks. However, these agents remain brittle when faced with real-world exceptions, making Human-in-the-Loop (HITL) supervision essential for mission-critical applications. In this paper, we present AgentBay, a novel sandbox service designed from the ground up for hybrid interaction. AgentBay provides secure, isolated execution environments spanning Windows, Linux, Android, Web Browsers, and Code interpreters. Its core contribution is a unified session accessible via a hybrid control interface: An AI agent can interact programmatically via mainstream interfaces (MCP, Open Source SDK), while a human operator can, at any moment, seamlessly take over full manual control. This seamless intervention is enabled by Adaptive Streaming Protocol (ASP). Unlike traditional VNC/RDP, ASP is specifically engineered for this hybrid use case, delivering an ultra-low-latency, smoother user experience that remains resilient even in weak network environments. It achieves this by dynamically blending command-based and video-based streaming, adapting its encoding strategy based on network conditions and the current controller (AI or human). Our evaluation demonstrates strong results in security, performance, and task completion rates. In a benchmark of complex tasks, the AgentBay (Agent + Human) model achieved more than 48% success rate improvement. Furthermore, our ASP protocol reduces bandwidth consumption by up to 50% compared to standard RDP, and in end-to-end latency with around 5% reduction, especially under poor network conditions. We posit that AgentBay provides a foundational primitive for building the next generation of reliable, human-supervised autonomous systems.
AI on Education
Mohamed bin Zayed
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
Abstract
We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization. This tool is aimed at education stakeholders as well as *ACL community at large, as it supports learning and can also be used to collect user feedback and annotations.
AI Summary
  • LoMTL uses a multi-task learning approach with LoRA-based fine-tuning and has 2 billion parameters, which is approximately 0.7% of the parameter count of the top-performing BJTU model. [3]
  • The LoMTL model outperforms GPT-5 on the Mistake Identification and Actionability dimensions but underperforms by 3 and 6 percentage points in terms of macro-F1 for Mistake Location and Providing Guidance, respectively. [2]
  • The LoMTL model is a lightweight, automated evaluation model for AI tutors that achieves comparable performance to top-performing teams in the BEA shared task while being significantly more efficient. [1]
Fondazione Bruno Kessler
Abstract
The rise of Artificial Intelligence (AI) language technologies, particularly generative AI (GenAI) chatbots accessible via conversational interfaces, is transforming digital interactions. While these tools hold societal promise, they also risk widening digital divides due to uneven adoption and low awareness of their limitations. This study presents the first comprehensive empirical mapping of GenAI adoption, usage patterns, and literacy in Italy, based on newly collected survey data from 1,906 Italian-speaking adults. Our findings reveal widespread adoption for both work and personal use, including sensitive tasks like emotional support and medical advice. Crucially, GenAI is supplanting other technologies to become a primary information source: this trend persists despite low user digital literacy, posing a risk as users struggle to recognize errors or misinformation. Moreover, we identify a significant gender divide -- particularly pronounced in older generations -- where women are half as likely to adopt GenAI and use it less frequently than men. While we find literacy to be a key predictor of adoption, it only partially explains this disparity, suggesting that other barriers are at play. Overall, our data provide granular insights into the multipurpose usage of GenAI, highlighting the dual need for targeted educational initiatives and further investigation into the underlying barriers to equitable participation that competence alone cannot explain.
AI Summary
  • There was a notable generational divide in LT literacy, with younger cohorts showing improved competence but older age groups experiencing significant deterioration. [3]
  • Users reported encountering errors or biased examples in the output of GenAI chatbots, with 39.5% confirming such experiences and 24.3% unsure. [3]
  • GenAI: Generalized Artificial Intelligence LTs: Language Technologies The study highlights a significant knowledge gap in LT literacy among the Italian population, with limited prior education and self-assessed competence. [3]
  • The majority of participants reported receiving no formal training on Language Technologies (LTs) or AI, with 76.2% of non-users reporting no training at all. [2]
AI on Energy
Scuola Superiore Meridion
Abstract
The ability to compose acquired skills to plan and execute behaviors is a hallmark of natural intelligence. Yet, despite remarkable cross-disciplinary efforts, a principled account of how task structure shapes gating and how such computations could be delivered in neural circuits, remains elusive. Here we introduce GateMod, an interpretable theoretically grounded computational model linking the emergence of gating to the underlying decision-making task, and to a neural circuit architecture. We first develop GateFrame, a normative framework casting policy gating into the minimization of the free energy. This framework, relating gating rules to task, applies broadly across neuroscience, cognitive and computational sciences. We then derive GateFlow, a continuous-time energy based dynamics that provably converges to GateFrame optimal solution. Convergence, exponential and global, follows from a contractivity property that also yields robustness and other desirable properties. Finally, we derive a neural circuit from GateFlow, GateNet. This is a soft-competitive recurrent circuit whose components perform local and contextual computations consistent with known dendritic and neural processing motifs. We evaluate GateMod across two different settings: collective behaviors in multi-agent systems and human decision-making in multi-armed bandits. In all settings, GateMod provides interpretable mechanistic explanations of gating and quantitatively matches or outperforms established models. GateMod offers a unifying framework for neural policy gating, linking task objectives, dynamical computation, and circuit-level mechanisms. It provides a framework to understand gating in natural agents beyond current explanations and to equip machines with this ability.
AI Summary
  • The model is grounded in a normative framework, GateFrame, which is flexible enough to capture various decision-making problems across behavioral/cognitive sciences, neuroscience, and machine learning. [3]
  • GateNet is a soft-competitive recurrent neural circuit implementing GateFrame, with each element clear and interpretable in view of the agent task. [3]
  • The approach harmonizes decision-making, complex dynamical systems, and neural principles. [3]
  • Two domains were considered for evaluation: collective behaviors and multi-armed bandits tasks. [3]
  • GateNet: Soft-competitive recurrent neural circuit implementing GateFrame. [3]
  • Results show that GateMod can recover well-known phenomena in collective behaviors and provide insights into exploration/exploitation balance in multi-armed bandits tasks. [2]
  • GateMod is an interpretable computational model that links the emergence of gating mechanisms to decision-making tasks and neural circuits. [1]
AI on Food
Stanford University
Abstract
Generative AI systems may pose serious risks to individuals vulnerable to eating disorders. Existing safeguards tend to overlook subtle but clinically significant cues, leaving many risks unaddressed. To better understand the nature of these risks, we conducted semi-structured interviews with 15 clinicians, researchers, and advocates with expertise in eating disorders. Using abductive qualitative analysis, we developed an expert-guided taxonomy of generative AI risks across seven categories: (1) providing generalized health advice; (2) encouraging disordered behaviors; (3) supporting symptom concealment; (4) creating thinspiration; (5) reinforcing negative self-beliefs; (6) promoting excessive focus on the body; and (7) perpetuating narrow views about eating disorders. Our results demonstrate how certain user interactions with generative AI systems intersect with clinical features of eating disorders in ways that may intensify risk. We discuss implications of our work, including approaches for risk assessment, safeguard design, and participatory evaluation practices with domain experts.
AI Summary
  • Generative AI tools can amplify harm by providing personalized and authoritative advice on concealing disordered eating, creating thinspiration images, reinforcing negative self-beliefs, and promoting unhealthy appearance standards. [3]
  • AI tools capable of offering more personalized or socially responsive interactions may exert an even stronger pull toward harmful appearance standards for individuals with anorexia who struggle with interpreting subtle social cues. [3]
  • Thinspiration: AI-generated content that inspires or pressures individuals to conform to idealized body standardsβ€”often through aggressive weight loss, body transformation, or culturally accepted and endorsed dieting practices. [3]
  • AI systems can feel more private and personalized than advice from peers, which could reduce feelings of stigma that might prevent someone from seeking guidance on concealing disordered eating. [3]
  • The risks of generative AI for eating disorders are not limited to images; textual content can also be harmful, particularly when it presents restrictive eating patterns as aspirational. [2]
AI on Healthcare
Georgia Institute of Tech
Abstract
The rapid growth of Artificial Intelligence (AI) in healthcare has sparked interest in Trustworthy AI and AI Implementation Science, both of which are essential for accelerating clinical adoption. However, strict regulations, gaps between research and clinical settings, and challenges in evaluating AI systems continue to hinder real-world implementation. This study presents an AI implementation case study within Shriners Childrens (SC), a large multisite pediatric system, showcasing the modernization of SCs Research Data Warehouse (RDW) to OMOP CDM v5.4 within a secure Microsoft Fabric environment. We introduce a Python-based data quality assessment tool compatible with SCs infrastructure, extending OHDsi's R/Java-based Data Quality Dashboard (DQD) and integrating Trustworthy AI principles using the METRIC framework. This extension enhances data quality evaluation by addressing informative missingness, redundancy, timeliness, and distributional consistency. We also compare systematic and case-specific AI implementation strategies for Craniofacial Microsomia (CFM) using the FHIR standard. Our contributions include a real-world evaluation of AI implementations, integration of Trustworthy AI principles into data quality assessment, and insights into hybrid implementation strategies that blend systematic infrastructure with use-case-driven approaches to advance AI in healthcare.
AI Summary
  • They also discuss the use of Observational Health Data Sciences and Informatics (OHDSI) and the OMOP Common Data Model (CDM) for data harmonization and analysis. [3]
  • Data Quality: refers to the accuracy, completeness, and consistency of data used in analysis or decision-making. [3]
  • Harmonization: refers to the process of converting data from different sources into a common format for analysis or comparison. [3]
  • The article discusses the challenges of implementing artificial intelligence (AI) in healthcare, particularly in terms of data quality and harmonization. [2]
University of Colorado An
Abstract
Medical imaging data plays a vital role in disease diagnosis, monitoring, and clinical research discovery. Biomedical data managers and clinical researchers must navigate a complex landscape of medical imaging infrastructure, input/output tools and data reliability workflow configurations taking months to operationalize. While standard formats exist for medical imaging data, standard operating procedures (SOPs) for data management are lacking. These data management SOPs are key for developing Findable, Accessible, Interoperable, and Reusable (FAIR) data, a prerequisite for AI-ready datasets. The National Institutes of Health (NIH) Bridge to Artificial Intelligence (Bridge2AI) Standards Working Group members and domain-expert stakeholders from the Bridge2AI Grand Challenges teams developed data management SOPs for the Digital Imaging and Communications in Medicine (DICOM) format. We describe novel SOPs applying to both static and cutting edge video imaging modalities. We emphasize steps required for centralized data aggregation, validation, and de-identification, including a review of new defacing methods for facial DICOM scans, anticipating adversarial AI/ML data re-identification methods. Data management vignettes based on Bridge2AI datasets include example parameters for efficient capture of a wide modality spectrum, including datasets from new ophthalmology retinal scans DICOM modalities.

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.
  • AI for Society
You can edit or add more interests any time.