🎯 Top Personalized Recommendations
Adobe Inc
Why we think this paper is great for you:
This paper directly explores the implementation and evaluation of an AI assistant with human involvement, offering practical insights into building effective interactive AI systems. You will find its focus on real-world application particularly relevant for understanding human-AI collaboration.
Abstract
Generative AI assistants offer significant potential to enhance productivity,
streamline information access, and improve user experience in enterprise
contexts. In this work, we present Summit Concierge, a domain-specific AI
assistant developed for Adobe Summit. The assistant handles a wide range of
event-related queries and operates under real-world constraints such as data
sparsity, quality assurance, and rapid deployment. To address these challenges,
we adopt a human-in-the-loop development workflow that combines prompt
engineering, retrieval grounding, and lightweight human validation. We describe
the system architecture, development process, and real-world deployment
outcomes. Our experience shows that agile, feedback-driven development enables
scalable and reliable AI assistants, even in cold-start scenarios.
AI Summary - Agile, feedback-driven development, including daily human-in-the-loop reviews and continuous monitoring of user feedback, enables rapid identification and resolution of issues in production. [3]
- Product Knowledge: General information queries answered using unstructured content, such as guidebooks. [3]
- Human-in-the-loop (HITL) development is crucial for rapidly deploying reliable, domain-specific generative AI assistants, especially in cold-start scenarios with data sparsity. [2]
- Combining prompt engineering, documentation-aware retrieval, and synthetic data augmentation effectively bootstraps AI assistants when historical interaction data is limited. [2]
- A multi-faceted evaluation framework, integrating correctness scoring, side-by-side comparisons, and brand compliance checks with LLM-as-judge and human validation, ensures high-quality responses. [2]
- Reasoning-oriented LLM prompt rewriting with chain-of-thought and few-shot examples significantly improves handling of ambiguous multi-turn dialogues, reducing rewrite error rates and enhancing routing accuracy. [2]
- The system demonstrated practical benefits in a real-world deployment, improving user experience and reducing operational overhead by effectively managing event-related queries. [2]
- Summit Concierge: A domain-specific generative AI assistant developed for Adobe Summit to handle event-related queries. [2]
- Human-in-the-loop (HITL) development paradigm: An iterative development approach that integrates human expertise for data curation, response validation, and quality monitoring to ensure reliability and rapid iteration. [2]
- Leveraging tools like SQLSynth for programmatic question generation from database schemas guarantees in-scope and diverse evaluation datasets, critical for stress testing intent detection and NL2SQL modules. [1]
KFUPM King Fahd Univeris
Why we think this paper is great for you:
This work investigates the crucial alignment between AI models and human values, which is essential for developing responsible and trustworthy AI. It provides valuable perspectives on how human judgment guides the ethical development of AI systems.
Abstract
Large Language Models (LLMs) are increasingly employed in software
engineering tasks such as requirements elicitation, design, and evaluation,
raising critical questions regarding their alignment with human judgments on
responsible AI values. This study investigates how closely LLMs' value
preferences align with those of two human groups: a US-representative sample
and AI practitioners. We evaluate 23 LLMs across four tasks: (T1) selecting key
responsible AI values, (T2) rating their importance in specific contexts, (T3)
resolving trade-offs between competing values, and (T4) prioritizing software
requirements that embody those values. The results show that LLMs generally
align more closely with AI practitioners than with the US-representative
sample, emphasizing fairness, privacy, transparency, safety, and
accountability. However, inconsistencies appear between the values that LLMs
claim to uphold (Tasks 1-3) and the way they prioritize requirements (Task 4),
revealing gaps in faithfulness between stated and applied behavior. These
findings highlight the practical risk of relying on LLMs in requirements
engineering without human oversight and motivate the need for systematic
approaches to benchmark, interpret, and monitor value alignment in AI-assisted
software development.
The University of Tokyo
Why we think this paper is great for you:
You'll appreciate this paper's focus on ensuring trustworthy and sustainable AI progress by understanding the capabilities and risks of autonomous AI systems. It highlights the importance of oversight in advanced AI applications.
Abstract
Understanding the current capabilities and risks of AI Scientist systems is
essential for ensuring trustworthy and sustainable AI-driven scientific
progress while preserving the integrity of the academic ecosystem. To this end,
we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system
that mimics the core research workflow of a novice student researcher: Given
the baseline paper from the human mentor, it analyzes its limitations,
formulates novel hypotheses for improvement, validates them through rigorous
experimentation, and writes a paper with the results. Unlike previous
approaches that assume full automation or operate on small-scale code, Jr. AI
Scientist follows a well-defined research workflow and leverages modern coding
agents to handle complex, multi-file implementations, leading to scientifically
valuable contributions. For evaluation, we conducted automated assessments
using AI Reviewers, author-led evaluations, and submissions to Agents4Science,
a venue dedicated to AI-driven scientific contributions. The findings
demonstrate that Jr. AI Scientist generates papers receiving higher review
scores than existing fully automated systems. Nevertheless, we identify
important limitations from both the author evaluation and the Agents4Science
reviews, indicating the potential risks of directly applying current AI
Scientist systems and key challenges for future research. Finally, we
comprehensively report various risks identified during development. We hope
these insights will deepen understanding of current progress and risks in AI
Scientist development.
Stanford University, USC
Why we think this paper is great for you:
This paper delves into systems for human control of humanoid robots, offering insights into direct human interaction with complex autonomous entities. It explores the practical aspects of human-system interfaces in robotics.
Abstract
Large-scale data has driven breakthroughs in robotics, from language models
to vision-language-action models in bimanual manipulation. However, humanoid
robotics lacks equally effective data collection frameworks. Existing humanoid
teleoperation systems either use decoupled control or depend on expensive
motion capture setups. We introduce TWIST2, a portable, mocap-free humanoid
teleoperation and data collection system that preserves full whole-body control
while advancing scalability. Our system leverages PICO4U VR for obtaining
real-time whole-body human motions, with a custom 2-DoF robot neck (cost around
$250) for egocentric vision, enabling holistic human-to-humanoid control. We
demonstrate long-horizon dexterous and mobile humanoid skills and we can
collect 100 demonstrations in 15 minutes with an almost 100% success rate.
Building on this pipeline, we propose a hierarchical visuomotor policy
framework that autonomously controls the full humanoid body based on egocentric
vision. Our visuomotor policy successfully demonstrates whole-body dexterous
manipulation and dynamic kicking tasks. The entire system is fully reproducible
and open-sourced at https://yanjieze.com/TWIST2 . Our collected dataset is also
open-sourced at https://twist-data.github.io .
Edison Scientific Inc, 1
Why we think this paper is great for you:
This paper discusses the limitations of autonomous AI agents in scientific discovery, implicitly suggesting areas where human guidance or intervention could enhance long-term effectiveness. It provides context for where human oversight becomes critical.
Abstract
Data-driven scientific discovery requires iterative cycles of literature
search, hypothesis generation, and data analysis. Substantial progress has been
made towards AI agents that can automate scientific research, but all such
agents remain limited in the number of actions they can take before losing
coherence, thus limiting the depth of their findings. Here we present Kosmos,
an AI scientist that automates data-driven discovery. Given an open-ended
objective and a dataset, Kosmos runs for up to 12 hours performing cycles of
parallel data analysis, literature search, and hypothesis generation before
synthesizing discoveries into scientific reports. Unlike prior systems, Kosmos
uses a structured world model to share information between a data analysis
agent and a literature search agent. The world model enables Kosmos to
coherently pursue the specified objective over 200 agent rollouts, collectively
executing an average of 42,000 lines of code and reading 1,500 papers per run.
Kosmos cites all statements in its reports with code or primary literature,
ensuring its reasoning is traceable. Independent scientists found 79.4% of
statements in Kosmos reports to be accurate, and collaborators reported that a
single 20-cycle Kosmos run performed the equivalent of 6 months of their own
research time on average. Furthermore, collaborators reported that the number
of valuable scientific findings generated scales linearly with Kosmos cycles
(tested up to 20 cycles). We highlight seven discoveries made by Kosmos that
span metabolomics, materials science, neuroscience, and statistical genetics.
Three discoveries independently reproduce findings from preprinted or
unpublished manuscripts that were not accessed by Kosmos at runtime, while four
make novel contributions to the scientific literature.
Shanghai Jiaotong Univer
Why we think this paper is great for you:
This research focuses on evaluating the problem-solving abilities of AI agents in complex tasks, which often necessitates human-defined metrics or expert assessment. It offers insights into how human evaluation contributes to improving AI performance.
Abstract
Large language model (LLM) agents have exhibited strong problem-solving
competence across domains like research and coding. Yet, it remains
underexplored whether LLM agents can tackle compounding real-world problems
that require a diverse set of tools to complete. Given a broad, heterogeneous
tool repository, LLM agents must not only select appropriate tools based on
task planning analysis but also strategically schedule the execution order to
ensure efficiency. This paper introduces TPS-Bench to benchmark the ability of
LLM agents in solving such problems that demand Tool Planning and Scheduling.
TPS-Bench collects 200 compounding tasks of two difficulty levels, based on a
tool repository containing hundreds of model context protocol (MCP) tools. In
particular, each task is composed of multiple subtasks, such as web search, map
navigation, calendar checking, etc., and each subtask can be completed by a
basic tool. Our evaluation emphasizes both task completion rate and efficiency.
The empirical studies on popular closed-source and open-source LLMs indicate
that most models can perform reasonable tool planning, but differ in
scheduling. For example, GLM-4.5 achieves an outperforming task completion rate
of 64.72% with extensive sequential tool calls, hence suffering from
significantly long execution time. By contrast, GPT-4o prioritizes parallel
tool calls but achieves only a 45.08% completion rate. Considering
reinforcement learning (RL) can be a viable way to improve the scheduling
efficiency without compromising performance, we perform an initial study on
Qwen3-1.7B and witness a 14% reduction in execution time alongside a 6% gain in
task completion rate based on rarely 100 RL training samples. Our code is
available https://github.com/hanwenxu1/mcp-agent.
University of Washington
Why we think this paper is great for you:
This paper addresses the brittleness of AI agents in complex environments, a common challenge that often requires human intervention or feedback to ensure robustness. It highlights scenarios where human input can significantly improve agent reliability.
Abstract
LLM agents excel in compact environments requiring deep reasoning but remain
brittle when operating in broader, more complex contexts that demand robustness
across diverse tools and schemas. Building bespoke environments for training is
heavy, brittle, and limits progress. In this paper, we demonstrate that LLMs
can simulate realistic environment feedback without access to actual testbed
data or APIs. Inspired by this capability, we propose two frameworks:
Simia-SFT, a pipeline that synthesizes SFT data by amplifying small seed sets
into diverse trajectories in an environment-agnostic manner, and Simia-RL, a
framework that enables RL training without real environment implementations
through LLM-simulated feedback. Fine-tuning open models yields consistent
improvements across multiple benchmarks, surpassing GPT-4o and approaching
o4-mini on $\tau^2$-Bench. Together, Simia-SFT and Simia-RL enable scalable
agent training without environment engineering, replacing heavy and brittle
implementations with flexible LLM-based simulation.
AGI: Artificial General Intelligence
Abstract
Geothermal field development typically involves complex processes that
require multi-disciplinary expertise in each process. Thus, decision-making
often demands the integration of geological, geophysical, reservoir
engineering, and operational data under tight time constraints. We present
Geothermal Analytics and Intelligent Agent, or GAIA, an AI-based system for
automation and assistance in geothermal field development. GAIA consists of
three core components: GAIA Agent, GAIA Chat, and GAIA Digital Twin, or DT,
which together constitute an agentic retrieval-augmented generation (RAG)
workflow. Specifically, GAIA Agent, powered by a pre-trained large language
model (LLM), designs and manages task pipelines by autonomously querying
knowledge bases and orchestrating multi-step analyses. GAIA DT encapsulates
classical and surrogate physics models, which, combined with built-in
domain-specific subroutines and visualization tools, enable predictive modeling
of geothermal systems. Lastly, GAIA Chat serves as a web-based interface for
users, featuring a ChatGPT-like layout with additional functionalities such as
interactive visualizations, parameter controls, and in-context document
retrieval. To ensure GAIA's specialized capability for handling complex
geothermal-related tasks, we curate a benchmark test set comprising various
geothermal-related use cases, and we rigorously and continuously evaluate the
system's performance. We envision GAIA as a pioneering step toward intelligent
geothermal field development, capable of assisting human experts in
decision-making, accelerating project workflows, and ultimately enabling
automation of the development process.
Deep Learning
City St Georges, Univer
Abstract
Artificial Intelligence (AI) is a powerful new language of science as
evidenced by recent Nobel Prizes in chemistry and physics that recognized
contributions to AI applied to those areas. Yet, this new language lacks
semantics, which makes AI's scientific discoveries unsatisfactory at best. With
the purpose of uncovering new facts but also improving our understanding of the
world, AI-based science requires formalization through a framework capable of
translating insight into comprehensible scientific knowledge. In this paper, we
argue that logic offers an adequate framework. In particular, we use logic in a
neurosymbolic framework to offer a much needed semantics for deep learning, the
neural network-based technology of current AI. Deep learning and neurosymbolic
AI lack a general set of conditions to ensure that desirable properties are
satisfied. Instead, there is a plethora of encoding and knowledge extraction
approaches designed for particular cases. To rectify this, we introduced a
framework for semantic encoding, making explicit the mapping between neural
networks and logic, and characterizing the common ingredients of the various
existing approaches. In this paper, we describe succinctly and exemplify how
logical semantics and neural networks are linked through this framework, we
review some of the most prominent approaches and techniques developed for
neural encoding and knowledge extraction, provide a formal definition of our
framework, and discuss some of the difficulties of identifying a semantic
encoding in practice in light of analogous problems in the philosophy of mind.
VISTAMILK, Dublin City Un
Abstract
Grasslands, constituting the world's second-largest terrestrial carbon sink,
play a crucial role in biodiversity and the regulation of the carbon cycle.
Currently, the Irish dairy sector, a significant economic contributor, grapples
with challenges related to profitability and sustainability. Presently, grass
growth forecasting relies on impractical mechanistic models. In response, we
propose deep learning models tailored for univariate datasets, presenting
cost-effective alternatives. Notably, a temporal convolutional network designed
for forecasting Perennial Ryegrass growth in Cork exhibits high performance,
leveraging historical grass height data with RMSE of 2.74 and MAE of 3.46.
Validation across a comprehensive dataset spanning 1,757 weeks over 34 years
provides insights into optimal model configurations. This study enhances our
understanding of model behavior, thereby improving reliability in grass growth
forecasting and contributing to the advancement of sustainable dairy farming
practices.
We did not find tons of content matching your interests we've included some additional topics that are popular.
Also be aware that if the topics is not present in arxiv we wont be able to recommend it.
AI Agents
University of Washington
Abstract
LLM agents excel in compact environments requiring deep reasoning but remain
brittle when operating in broader, more complex contexts that demand robustness
across diverse tools and schemas. Building bespoke environments for training is
heavy, brittle, and limits progress. In this paper, we demonstrate that LLMs
can simulate realistic environment feedback without access to actual testbed
data or APIs. Inspired by this capability, we propose two frameworks:
Simia-SFT, a pipeline that synthesizes SFT data by amplifying small seed sets
into diverse trajectories in an environment-agnostic manner, and Simia-RL, a
framework that enables RL training without real environment implementations
through LLM-simulated feedback. Fine-tuning open models yields consistent
improvements across multiple benchmarks, surpassing GPT-4o and approaching
o4-mini on $\tau^2$-Bench. Together, Simia-SFT and Simia-RL enable scalable
agent training without environment engineering, replacing heavy and brittle
implementations with flexible LLM-based simulation.
Shanghai Jiaotong Univer
Abstract
Large language model (LLM) agents have exhibited strong problem-solving
competence across domains like research and coding. Yet, it remains
underexplored whether LLM agents can tackle compounding real-world problems
that require a diverse set of tools to complete. Given a broad, heterogeneous
tool repository, LLM agents must not only select appropriate tools based on
task planning analysis but also strategically schedule the execution order to
ensure efficiency. This paper introduces TPS-Bench to benchmark the ability of
LLM agents in solving such problems that demand Tool Planning and Scheduling.
TPS-Bench collects 200 compounding tasks of two difficulty levels, based on a
tool repository containing hundreds of model context protocol (MCP) tools. In
particular, each task is composed of multiple subtasks, such as web search, map
navigation, calendar checking, etc., and each subtask can be completed by a
basic tool. Our evaluation emphasizes both task completion rate and efficiency.
The empirical studies on popular closed-source and open-source LLMs indicate
that most models can perform reasonable tool planning, but differ in
scheduling. For example, GLM-4.5 achieves an outperforming task completion rate
of 64.72% with extensive sequential tool calls, hence suffering from
significantly long execution time. By contrast, GPT-4o prioritizes parallel
tool calls but achieves only a 45.08% completion rate. Considering
reinforcement learning (RL) can be a viable way to improve the scheduling
efficiency without compromising performance, we perform an initial study on
Qwen3-1.7B and witness a 14% reduction in execution time alongside a 6% gain in
task completion rate based on rarely 100 RL training samples. Our code is
available https://github.com/hanwenxu1/mcp-agent.
AI and Society
KFUPM King Fahd Univeris
Abstract
Large Language Models (LLMs) are increasingly employed in software
engineering tasks such as requirements elicitation, design, and evaluation,
raising critical questions regarding their alignment with human judgments on
responsible AI values. This study investigates how closely LLMs' value
preferences align with those of two human groups: a US-representative sample
and AI practitioners. We evaluate 23 LLMs across four tasks: (T1) selecting key
responsible AI values, (T2) rating their importance in specific contexts, (T3)
resolving trade-offs between competing values, and (T4) prioritizing software
requirements that embody those values. The results show that LLMs generally
align more closely with AI practitioners than with the US-representative
sample, emphasizing fairness, privacy, transparency, safety, and
accountability. However, inconsistencies appear between the values that LLMs
claim to uphold (Tasks 1-3) and the way they prioritize requirements (Task 4),
revealing gaps in faithfulness between stated and applied behavior. These
findings highlight the practical risk of relying on LLMs in requirements
engineering without human oversight and motivate the need for systematic
approaches to benchmark, interpret, and monitor value alignment in AI-assisted
software development.
The University of Tokyo
Abstract
Understanding the current capabilities and risks of AI Scientist systems is
essential for ensuring trustworthy and sustainable AI-driven scientific
progress while preserving the integrity of the academic ecosystem. To this end,
we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system
that mimics the core research workflow of a novice student researcher: Given
the baseline paper from the human mentor, it analyzes its limitations,
formulates novel hypotheses for improvement, validates them through rigorous
experimentation, and writes a paper with the results. Unlike previous
approaches that assume full automation or operate on small-scale code, Jr. AI
Scientist follows a well-defined research workflow and leverages modern coding
agents to handle complex, multi-file implementations, leading to scientifically
valuable contributions. For evaluation, we conducted automated assessments
using AI Reviewers, author-led evaluations, and submissions to Agents4Science,
a venue dedicated to AI-driven scientific contributions. The findings
demonstrate that Jr. AI Scientist generates papers receiving higher review
scores than existing fully automated systems. Nevertheless, we identify
important limitations from both the author evaluation and the Agents4Science
reviews, indicating the potential risks of directly applying current AI
Scientist systems and key challenges for future research. Finally, we
comprehensively report various risks identified during development. We hope
these insights will deepen understanding of current progress and risks in AI
Scientist development.
Research Automation with AI
Edison Scientific Inc, 1
Abstract
Data-driven scientific discovery requires iterative cycles of literature
search, hypothesis generation, and data analysis. Substantial progress has been
made towards AI agents that can automate scientific research, but all such
agents remain limited in the number of actions they can take before losing
coherence, thus limiting the depth of their findings. Here we present Kosmos,
an AI scientist that automates data-driven discovery. Given an open-ended
objective and a dataset, Kosmos runs for up to 12 hours performing cycles of
parallel data analysis, literature search, and hypothesis generation before
synthesizing discoveries into scientific reports. Unlike prior systems, Kosmos
uses a structured world model to share information between a data analysis
agent and a literature search agent. The world model enables Kosmos to
coherently pursue the specified objective over 200 agent rollouts, collectively
executing an average of 42,000 lines of code and reading 1,500 papers per run.
Kosmos cites all statements in its reports with code or primary literature,
ensuring its reasoning is traceable. Independent scientists found 79.4% of
statements in Kosmos reports to be accurate, and collaborators reported that a
single 20-cycle Kosmos run performed the equivalent of 6 months of their own
research time on average. Furthermore, collaborators reported that the number
of valuable scientific findings generated scales linearly with Kosmos cycles
(tested up to 20 cycles). We highlight seven discoveries made by Kosmos that
span metabolomics, materials science, neuroscience, and statistical genetics.
Three discoveries independently reproduce findings from preprinted or
unpublished manuscripts that were not accessed by Kosmos at runtime, while four
make novel contributions to the scientific literature.
AGI: Artificial General Intelligence
Abstract
Geothermal field development typically involves complex processes that
require multi-disciplinary expertise in each process. Thus, decision-making
often demands the integration of geological, geophysical, reservoir
engineering, and operational data under tight time constraints. We present
Geothermal Analytics and Intelligent Agent, or GAIA, an AI-based system for
automation and assistance in geothermal field development. GAIA consists of
three core components: GAIA Agent, GAIA Chat, and GAIA Digital Twin, or DT,
which together constitute an agentic retrieval-augmented generation (RAG)
workflow. Specifically, GAIA Agent, powered by a pre-trained large language
model (LLM), designs and manages task pipelines by autonomously querying
knowledge bases and orchestrating multi-step analyses. GAIA DT encapsulates
classical and surrogate physics models, which, combined with built-in
domain-specific subroutines and visualization tools, enable predictive modeling
of geothermal systems. Lastly, GAIA Chat serves as a web-based interface for
users, featuring a ChatGPT-like layout with additional functionalities such as
interactive visualizations, parameter controls, and in-context document
retrieval. To ensure GAIA's specialized capability for handling complex
geothermal-related tasks, we curate a benchmark test set comprising various
geothermal-related use cases, and we rigorously and continuously evaluate the
system's performance. We envision GAIA as a pioneering step toward intelligent
geothermal field development, capable of assisting human experts in
decision-making, accelerating project workflows, and ultimately enabling
automation of the development process.
Deep Learning
City St Georges, Univer
Abstract
Artificial Intelligence (AI) is a powerful new language of science as
evidenced by recent Nobel Prizes in chemistry and physics that recognized
contributions to AI applied to those areas. Yet, this new language lacks
semantics, which makes AI's scientific discoveries unsatisfactory at best. With
the purpose of uncovering new facts but also improving our understanding of the
world, AI-based science requires formalization through a framework capable of
translating insight into comprehensible scientific knowledge. In this paper, we
argue that logic offers an adequate framework. In particular, we use logic in a
neurosymbolic framework to offer a much needed semantics for deep learning, the
neural network-based technology of current AI. Deep learning and neurosymbolic
AI lack a general set of conditions to ensure that desirable properties are
satisfied. Instead, there is a plethora of encoding and knowledge extraction
approaches designed for particular cases. To rectify this, we introduced a
framework for semantic encoding, making explicit the mapping between neural
networks and logic, and characterizing the common ingredients of the various
existing approaches. In this paper, we describe succinctly and exemplify how
logical semantics and neural networks are linked through this framework, we
review some of the most prominent approaches and techniques developed for
neural encoding and knowledge extraction, provide a formal definition of our
framework, and discuss some of the difficulties of identifying a semantic
encoding in practice in light of analogous problems in the philosophy of mind.
VISTAMILK, Dublin City Un
Abstract
Grasslands, constituting the world's second-largest terrestrial carbon sink,
play a crucial role in biodiversity and the regulation of the carbon cycle.
Currently, the Irish dairy sector, a significant economic contributor, grapples
with challenges related to profitability and sustainability. Presently, grass
growth forecasting relies on impractical mechanistic models. In response, we
propose deep learning models tailored for univariate datasets, presenting
cost-effective alternatives. Notably, a temporal convolutional network designed
for forecasting Perennial Ryegrass growth in Cork exhibits high performance,
leveraging historical grass height data with RMSE of 2.74 and MAE of 3.46.
Validation across a comprehensive dataset spanning 1,757 weeks over 34 years
provides insights into optimal model configurations. This study enhances our
understanding of model behavior, thereby improving reliability in grass growth
forecasting and contributing to the advancement of sustainable dairy farming
practices.