🎯 Top Personalized Recommendations
RPTU KaiserslauternLand
Why we think this paper is great for you:
This paper directly addresses the energy footprint of AI systems, which is a key area of interest for you. It provides valuable insights into benchmarking and estimating the energy consumption of powerful AI agents.
Abstract
Web agents, like OpenAI's Operator and Google's Project Mariner, are powerful
agentic systems pushing the boundaries of Large Language Models (LLM). They can
autonomously interact with the internet at the user's behest, such as
navigating websites, filling search masks, and comparing price lists. Though
web agent research is thriving, induced sustainability issues remain largely
unexplored. To highlight the urgency of this issue, we provide an initial
exploration of the energy and $CO_2$ cost associated with web agents from both
a theoretical -via estimation- and an empirical perspective -by benchmarking.
Our results show how different philosophies in web agent creation can severely
impact the associated expended energy, and that more energy consumed does not
necessarily equate to better results. We highlight a lack of transparency
regarding disclosing model parameters and processes used for some web agents as
a limiting factor when estimating energy consumption. Our work contributes
towards a change in thinking of how we evaluate web agents, advocating for
dedicated metrics measuring energy consumption in benchmarks.
AI Summary - Web agent design philosophies severely impact energy consumption, where more energy consumed does not necessarily equate to better performance, as demonstrated by AutoWebGLM being both the most energy-efficient and best-performing agent on Mind2Web. [3]
- Theoretical estimation of energy consumption for proprietary LLM-driven web agents is highly unreliable, showing discrepancies up to a factor of 7 compared to empirical measurements, primarily due to a lack of transparency in model parameters and internal processes. [3]
- The absence of energy consumption metrics in current web agent benchmarks creates a disincentive for developers to prioritize sustainable implementations, necessitating the adoption of energy per benchmark as a core evaluation metric. [3]
- A critical problem is the lack of transparency regarding web agent energy consumption to end-users, suggesting that displaying estimated CO2 emissions per task could raise awareness and guide users towards more sustainable agents. [3]
- Step Success Rate (SSR): The de facto standard performance metric for the Mind2Web benchmark, representing the ratio of successful steps taken towards completing a task to the total number of steps. [3]
- Empirical energy benchmarking for open-source LLM-driven web agents is feasible and should be integrated into standard evaluation practices to provide a holistic assessment of performance and sustainability. [2]
- Effective preprocessing and the strategic use of smaller, fine-tuned open-source LLMs (e.g., MindAct's approach) are critical for achieving significant energy efficiency compared to relying on large proprietary models like GPT-4. [2]
- To enable comparability for proprietary LLM-driven agents where direct benchmarking is impossible, developers should report at least the energy consumption per token and the total number of tokens consumed for established benchmarks. [2]
- Energy per token: A metric used to evaluate the trade-off between performance and sustainability, particularly for LLM inference, by quantifying the energy consumed per processed token. [2]
- Web Agents: LLM-powered agentic systems capable of autonomously interacting with the internet, such as navigating websites and filling forms, mimicking human user behavior. [1]
Harvard University andor
Why we think this paper is great for you:
This paper directly tackles algorithmic fairness in criminal justice, a critical area aligning with your strong interest in AI for social justice and fairness. It explores different fairness concepts and their implications in high-stakes domains.
Abstract
Algorithmic fairness has grown rapidly as a research area, yet key concepts
remain unsettled, especially in criminal justice. We review group, individual,
and process fairness and map the conditions under which they conflict. We then
develop a simple modification to standard group fairness. Rather than exact
parity across protected groups, we minimize a weighted error loss while keeping
differences in false negative rates within a small tolerance. This makes
solutions easier to find, can raise predictive accuracy, and surfaces the
ethical choice of error costs. We situate this proposal within three classes of
critique: biased and incomplete data, latent affirmative action, and the
explosion of subgroup constraints. Finally, we offer a practical framework for
deployment in public decision systems built on three pillars: need-based
decisions, Transparency and accountability, and narrowly tailored definitions
and solutions. Together, these elements link technical design to legitimacy and
provide actionable guidance for agencies that use risk assessment and related
tools.
INTELI Instituto de Tec
Why we think this paper is great for you:
This paper directly explores the evolving role of AI in education, a topic you've expressed significant interest in. It discusses how AI is transforming learning environments and human intellectual labor.
Abstract
The debate over whether "thinking machines" could replace human intellectual
labor has existed in both public and expert discussions since the mid-twentieth
century, when the concept and terminology of Artificial Intelligence (AI) first
emerged. For decades, this idea remained largely theoretical. However, with the
recent advent of Generative AI - particularly Large Language Models (LLMs) -
and the widespread adoption of tools such as ChatGPT, the issue has become a
practical reality. Many fields that rely on human intellectual effort are now
being reshaped by AI tools that both expand human capabilities and challenge
the necessity of certain forms of work once deemed uniquely human but now
easily automated. Education, somewhat unexpectedly, faces a pivotal
responsibility: to devise long-term strategies for cultivating human skills
that will remain relevant in an era of pervasive AI in the intellectual domain.
In this context, we identify the limitations of current AI systems - especially
those rooted in LLM technology - argue that the fundamental causes of these
weaknesses cannot be resolved through existing methods, and propose directions
within the constructivist paradigm for transforming education to preserve the
long-term advantages of human intelligence over AI tools.
German Cancer Research D
Why we think this paper is great for you:
This paper addresses critical challenges in the development and adoption of AI in surgery, which is highly relevant to your interest in AI's application in healthcare. It highlights the importance of robust validation practices for clinical AI.
Abstract
Surgical data science (SDS) is rapidly advancing, yet clinical adoption of
artificial intelligence (AI) in surgery remains severely limited, with
inadequate validation emerging as a key obstacle. In fact, existing validation
practices often neglect the temporal and hierarchical structure of
intraoperative videos, producing misleading, unstable, or clinically irrelevant
results. In a pioneering, consensus-driven effort, we introduce the first
comprehensive catalog of validation pitfalls in AI-based surgical video
analysis that was derived from a multi-stage Delphi process with 91
international experts. The collected pitfalls span three categories: (1) data
(e.g., incomplete annotation, spurious correlations), (2) metric selection and
configuration (e.g., neglect of temporal stability, mismatch with clinical
needs), and (3) aggregation and reporting (e.g., clinically uninformative
aggregation, failure to account for frame dependencies in hierarchical data
structures). A systematic review of surgical AI papers reveals that these
pitfalls are widespread in current practice, with the majority of studies
failing to account for temporal dynamics or hierarchical data structure, or
relying on clinically uninformative metrics. Experiments on real surgical video
datasets provide the first empirical evidence that ignoring temporal and
hierarchical data structures can lead to drastic understatement of uncertainty,
obscure critical failure modes, and even alter algorithm rankings. This work
establishes a framework for the rigorous validation of surgical video analysis
algorithms, providing a foundation for safe clinical translation, benchmarking,
regulatory review, and future reporting standards in the field.
The University of Melboun
Why we think this paper is great for you:
This paper directly addresses agentic AI applications within the transportation sector, aligning perfectly with your interest in AI's role in transportation. It proposes a novel approach for intention recognition in these systems.
Abstract
In this study, a modular, data-free pipeline for multi-label intention
recognition is proposed for agentic AI applications in transportation. Unlike
traditional intent recognition systems that depend on large, annotated corpora
and often struggle with fine-grained, multi-label discrimination, our approach
eliminates the need for costly data collection while enhancing the accuracy of
multi-label intention understanding. Specifically, the overall pipeline, named
DMTC, consists of three steps: 1) using prompt engineering to guide large
language models (LLMs) to generate diverse synthetic queries in different
transport scenarios; 2) encoding each textual query with a Sentence-T5 model to
obtain compact semantic embeddings; 3) training a lightweight classifier using
a novel online focal-contrastive (OFC) loss that emphasizes hard samples and
maximizes inter-class separability. The applicability of the proposed pipeline
is demonstrated in an agentic AI application in the maritime transportation
context. Extensive experiments show that DMTC achieves a Hamming loss of 5.35%
and an AUC of 95.92%, outperforming state-of-the-art multi-label classifiers
and recent end-to-end SOTA LLM-based baselines. Further analysis reveals that
Sentence-T5 embeddings improve subset accuracy by at least 3.29% over
alternative encoders, and integrating the OFC loss yields an additional 0.98%
gain compared to standard contrastive objectives. In conclusion, our system
seamlessly routes user queries to task-specific modules (e.g., ETA information,
traffic risk evaluation, and other typical scenarios in the transportation
domain), laying the groundwork for fully autonomous, intention-aware agents
without costly manual labelling.
University of Washington
Why we think this paper is great for you:
This paper presents an AI application for food logging, directly addressing your interest in AI's role in food and its connections to health outcomes. It offers practical insights into AI tools for dietary management.
Abstract
Food logging, both self-directed and prescribed, plays a critical role in
uncovering correlations between diet, medical, fitness, and health outcomes.
Through conversations with nutritional experts and individuals who practice
dietary tracking, we find current logging methods, such as handwritten and
app-based journaling, are inflexible and result in low adherence and
potentially inaccurate nutritional summaries. These findings, corroborated by
prior literature, emphasize the urgent need for improved food logging methods.
In response, we propose SnappyMeal, an AI-powered dietary tracking system that
leverages multimodal inputs to enable users to more flexibly log their food
intake. SnappyMeal introduces goal-dependent follow-up questions to
intelligently seek missing context from the user and information retrieval from
user grocery receipts and nutritional databases to improve accuracy. We
evaluate SnappyMeal through publicly available nutrition benchmarks and a
multi-user, 3-week, in-the-wild deployment capturing over 500 logged food
instances. Users strongly praised the multiple available input methods and
reported a strong perceived accuracy. These insights suggest that multimodal AI
systems can be leveraged to significantly improve dietary tracking flexibility
and context-awareness, laying the groundwork for a new class of intelligent
self-tracking applications.
KFUPM King Fahd Univeris
Why we think this paper is great for you:
This paper investigates the critical alignment of AI systems with human values, which is central to your interest in responsible AI and its societal implications. It explores how LLMs can be developed to uphold ethical principles.
Abstract
Large Language Models (LLMs) are increasingly employed in software
engineering tasks such as requirements elicitation, design, and evaluation,
raising critical questions regarding their alignment with human judgments on
responsible AI values. This study investigates how closely LLMs' value
preferences align with those of two human groups: a US-representative sample
and AI practitioners. We evaluate 23 LLMs across four tasks: (T1) selecting key
responsible AI values, (T2) rating their importance in specific contexts, (T3)
resolving trade-offs between competing values, and (T4) prioritizing software
requirements that embody those values. The results show that LLMs generally
align more closely with AI practitioners than with the US-representative
sample, emphasizing fairness, privacy, transparency, safety, and
accountability. However, inconsistencies appear between the values that LLMs
claim to uphold (Tasks 1-3) and the way they prioritize requirements (Task 4),
revealing gaps in faithfulness between stated and applied behavior. These
findings highlight the practical risk of relying on LLMs in requirements
engineering without human oversight and motivate the need for systematic
approaches to benchmark, interpret, and monitor value alignment in AI-assisted
software development.
AI for Social Good
University of Texas at D
Abstract
Effective human-AI collaboration requires humans to accurately gauge AI
capabilities and calibrate their trust accordingly. Humans often have
context-dependent private information, referred to as Unique Human Knowledge
(UHK), that is crucial for deciding whether to accept or override AI's
recommendations. We examine how displaying AI reasoning affects trust and UHK
utilization through a pre-registered, incentive-compatible experiment (N =
752). We find that revealing AI reasoning, whether brief or extensive, acts as
a powerful persuasive heuristic that significantly increases trust and
agreement with AI recommendations. Rather than helping participants
appropriately calibrate their trust, this transparency induces over-trust that
crowds out UHK utilization. Our results highlight the need for careful
consideration when revealing AI reasoning and call for better information
design in human-AI collaboration systems.
Tsinghua University, Mon
Abstract
Generative AI is increasingly positioned as a peer in collaborative learning,
yet its effects on ethical deliberation remain unclear. We report a
between-subjects experiment with university students (N=217) who discussed an
autonomous-vehicle dilemma in triads under three conditions: human-only
control, supportive AI teammate, or contrarian AI teammate. Using moral
foundations lexicons, argumentative coding from the augmentative knowledge
construction framework, semantic trajectory modelling with BERTopic and dynamic
time warping, and epistemic network analysis, we traced how AI personas reshape
moral discourse. Supportive AIs increased grounded/qualified claims relative to
control, consolidating integrative reasoning around care/fairness, while
contrarian AIs modestly broadened moral framing and sustained value pluralism.
Both AI conditions reduced thematic drift compared with human-only groups,
indicating more stable topical focus. Post-discussion justification complexity
was only weakly predicted by moral framing and reasoning quality, and shifts in
final moral decisions were driven primarily by participants' initial stance
rather than condition. Overall, AI teammates altered the process, the
distribution and connection of moral frames and argument quality, more than the
outcome of moral choice, highlighting the potential of generative AI agents as
teammates for eliciting reflective, pluralistic moral reasoning in
collaborative learning.
AI on Food
John A Paulson School of
Abstract
Scientific experiment and manufacture rely on complex, multi-step procedures
that demand continuous human expertise for precise execution and
decision-making. Despite advances in machine learning and automation,
conventional models remain confined to virtual domains, while real-world
experiment and manufacture still rely on human supervision and expertise. This
gap between machine intelligence and physical execution limits reproducibility,
scalability, and accessibility across scientific and manufacture workflows.
Here, we introduce human-AI co-embodied intelligence, a new form of physical AI
that unites human users, agentic AI, and wearable hardware into an integrated
system for real-world experiment and intelligent manufacture. In this paradigm,
humans provide precise execution and control, while agentic AI contributes
memory, contextual reasoning, adaptive planning, and real-time feedback. The
wearable interface continuously captures the experimental and manufacture
processes, facilitates seamless communication between humans and AI for
corrective guidance and interpretable collaboration. As a demonstration, we
present Agentic-Physical Experimentation (APEX) system, coupling agentic
reasoning with physical execution through mixed-reality. APEX observes and
interprets human actions, aligns them with standard operating procedures,
provides 3D visual guidance, and analyzes every step. Implemented in a
cleanroom for flexible electronics fabrication, APEX system achieves
context-aware reasoning with accuracy exceeding general multimodal large
language models, corrects errors in real time, and transfers expertise to
beginners. These results establish a new class of agentic-physical-human
intelligence that extends agentic reasoning beyond computation into the
physical domain, transforming scientific research and manufacturing into
autonomous, traceable, interpretable, and scalable processes.