LLMs for AI Agents

Agentic LLMs for Question Answering over Tabular Data

Rishit Tyagityagirishit2

Abstract
Question Answering over Tabular Data (Table QA) presents unique challenges due to the diverse structure, size, and data types of real-world tables. The SemEval 2025 Task 8 (DataBench) introduced a benchmark composed of large-scale, domain-diverse datasets to evaluate the ability of models to accurately answer structured queries. We propose a Natural Language to SQL (NL-to-SQL) approach leveraging large language models (LLMs) such as GPT-4o, GPT-4o-mini, and DeepSeek v2:16b to generate SQL queries dynamically. Our system follows a multi-stage pipeline involving example selection, SQL query generation, answer extraction, verification, and iterative refinement. Experiments demonstrate the effectiveness of our approach, achieving 70.5\% accuracy on DataBench QA and 71.6\% on DataBench Lite QA, significantly surpassing baseline scores of 26\% and 27\% respectively. This paper details our methodology, experimental results, and alternative approaches, providing insights into the strengths and limitations of LLM-driven Table QA.

AI Insights

Agentic LLMs autonomously select relevant table rows before generating SQL, cutting hallucinations.
The pipeline includes a self‑verification step that re‑executes the query and checks results against a sanity‑check heuristic.
Iterative refinement employs a lightweight LLM loop that rewrites failing queries until a confidence threshold is met.
Experiments show GPT‑4o‑mini matches GPT‑4o performance when guided by the agentic framework, despite its smaller size.
A fine‑grained error taxonomy distinguishes schema‑mapping, aggregation, and join mistakes, guiding targeted improvements.
Cross‑domain transfer tests reveal the model retains 65 % of its accuracy on unseen medical tables, highlighting robustness.
Recommended reading: “SQL Generation with Large Language Models” (ACL 2023) and the DataBench GitHub repository.

👍 👎 ♥ Save

Evaluating LLMs Without Oracle Feedback: Agentic Annotation Evaluation Through Unsupervised Consistency Signals

University of Technology

Abstract
Large Language Models (LLMs), when paired with prompt-based tasks, have significantly reduced data annotation costs and reliance on human annotators. However, evaluating the quality of their annotations remains challenging in dynamic, unsupervised environments where oracle feedback is scarce and conventional methods fail. To address this challenge, we propose a novel agentic annotation paradigm, where a student model collaborates with a noisy teacher (the LLM) to assess and refine annotation quality without relying on oracle feedback. The student model, acting as an unsupervised feedback mechanism, employs a user preference-based majority voting strategy to evaluate the consistency of the LLM outputs. To systematically measure the reliability of LLM-generated annotations, we introduce the Consistent and Inconsistent (CAI) Ratio, a novel unsupervised evaluation metric. The CAI Ratio not only quantifies the annotation quality of the noisy teacher under limited user preferences but also plays a critical role in model selection, enabling the identification of robust LLMs in dynamic, unsupervised environments. Applied to ten open-domain NLP datasets across four LLMs, the CAI Ratio demonstrates a strong positive correlation with LLM accuracy, establishing it as an essential tool for unsupervised evaluation and model selection in real-world settings.

AI Insights

CAI Ratio multiplies consistency count by accuracy on those samples, turning agreement into a quality score.
A student model votes on user‑preferred outputs, turning noisy LLM answers into a self‑checking oracle.
Only two passes of the teacher per dataset are needed, slashing annotation cost while keeping strength.
Across ten open‑domain benchmarks, CAI correlates strongly with downstream accuracy, proving its predictive power.
Majority‑voting is agnostic to task type, making it a universal unsupervised sanity check.
For deeper dives, “Attention Is All You Need” and HuggingFace Transformers illuminate the backbone behind the teacher.
Curious about implementation? Check huggingface/transformers for ready‑to‑run scripts.

AI Agents

👍 👎 ♥ Save

LightAgent: Production-level Open-source Agentic AI Framework

Shanghai University of F

Rate this image: 😍 👍 👎

Abstract
With the rapid advancement of large language models (LLMs), Multi-agent Systems (MAS) have achieved significant progress in various application scenarios. However, substantial challenges remain in designing versatile, robust, and efficient platforms for agent deployment. To address these limitations, we propose \textbf{LightAgent}, a lightweight yet powerful agentic framework, effectively resolving the trade-off between flexibility and simplicity found in existing frameworks. LightAgent integrates core functionalities such as Memory (mem0), Tools, and Tree of Thought (ToT), while maintaining an extremely lightweight structure. As a fully open-source solution, it seamlessly integrates with mainstream chat platforms, enabling developers to easily build self-learning agents. We have released LightAgent at \href{https://github.com/wxai-space/LightAgent}{https://github.com/wxai-space/LightAgent}

AI Insights

LightAgent’s swarm design lets dozens of agents coordinate via one LightSwarm instance, boosting throughput.
Each agent carries a distinct instruction set, enabling domain‑specific roles such as code synthesis or data retrieval.
A built‑in text UI turns user prompts into executable code snippets, streamlining rapid prototyping.
Tree‑of‑Thought logic lets agents iteratively refine plans, cutting hallucinations and improving accuracy.
The lightweight core keeps memory usage under 200 MB on a single GPU while still supporting custom tool plugins.
Advanced features can be daunting for beginners, and highly specialized tasks may still need manual tuning.
LightAgent has been applied to robotics, finance, and healthcare, proving its versatility beyond chat‑bot demos.

👍 👎 ♥ Save

We Need a New Ethics for a World of AI Agents

Abstract
The deployment of capable AI agents raises fresh questions about safety, human-machine relationships and social coordination. We argue for greater engagement by scientists, scholars, engineers and policymakers with the implications of a world increasingly populated by AI agents. We explore key challenges that must be addressed to ensure that interactions between humans and agents, and among agents themselves, remain broadly beneficial.

AI and Society

👍 👎 ♥ Save

AI for Scientific Discovery is a Social Problem

Hugging Face

Rate this image: 😍 👍 👎

Abstract
Artificial intelligence promises to accelerate scientific discovery, yet its benefits remain unevenly distributed. While technical obstacles such as scarce data, fragmented standards, and unequal access to computation are significant, we argue that the primary barriers are social and institutional. Narratives that defer progress to speculative "AI scientists," the undervaluing of data and infrastructure contributions, misaligned incentives, and gaps between domain experts and machine learning researchers all constrain impact. We highlight four interconnected challenges: community dysfunction, research priorities misaligned with upstream needs, data fragmentation, and infrastructure inequities. We argue that their roots lie in cultural and organizational practices. Addressing them requires not only technical innovation but also intentional community-building, cross-disciplinary education, shared benchmarks, and accessible infrastructure. We call for reframing AI for science as a collective social project, where sustainable collaboration and equitable participation are treated as prerequisites for technical progress.

AI Insights

Democratizing advanced cyberinfrastructure unlocks responsible AI research across global labs.
Only 5 % of Africa’s AI talent accesses sufficient compute, underscoring regional inequity.
Pre‑trained transformer models now generate multi‑omics, multi‑species, multi‑tissue samples.
Quantization‑aware training yields efficient neural PDE‑solvers showcased at recent conferences.
The FAIR Guiding Principles guide scientific data stewardship, enhancing reproducibility.
MAGE‑Tab’s spreadsheet‑based format standardizes microarray data for seamless sharing.
Resources like The Human Cell Atlas and pymatgen empower interdisciplinary material‑genomics research.

Research Automation with AI

👍 👎 ♥ Save

The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems

Carnegie Mellon Universt

Abstract
AI scientist systems, capable of autonomously executing the full research workflow from hypothesis generation and experimentation to paper writing, hold significant potential for accelerating scientific discovery. However, the internal workflow of these systems have not been closely examined. This lack of scrutiny poses a risk of introducing flaws that could undermine the integrity, reliability, and trustworthiness of their research outputs. In this paper, we identify four potential failure modes in contemporary AI scientist systems: inappropriate benchmark selection, data leakage, metric misuse, and post-hoc selection bias. To examine these risks, we design controlled experiments that isolate each failure mode while addressing challenges unique to evaluating AI scientist systems. Our assessment of two prominent open-source AI scientist systems reveals the presence of several failures, across a spectrum of severity, which can be easily overlooked in practice. Finally, we demonstrate that access to trace logs and code from the full automated workflow enables far more effective detection of such failures than examining the final paper alone. We thus recommend journals and conferences evaluating AI-generated research to mandate submission of these artifacts alongside the paper to ensure transparency, accountability, and reproducibility.

👍 👎 ♥ Save

Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents

Stanford University

Abstract
We introduce Paper2Agent, an automated framework that converts research papers into AI agents. Paper2Agent transforms research output from passive artifacts into active systems that can accelerate downstream use, adoption, and discovery. Conventional research papers require readers to invest substantial effort to understand and adapt a paper's code, data, and methods to their own work, creating barriers to dissemination and reuse. Paper2Agent addresses this challenge by automatically converting a paper into an AI agent that acts as a knowledgeable research assistant. It systematically analyzes the paper and the associated codebase using multiple agents to construct a Model Context Protocol (MCP) server, then iteratively generates and runs tests to refine and robustify the resulting MCP. These paper MCPs can then be flexibly connected to a chat agent (e.g. Claude Code) to carry out complex scientific queries through natural language while invoking tools and workflows from the original paper. We demonstrate Paper2Agent's effectiveness in creating reliable and capable paper agents through in-depth case studies. Paper2Agent created an agent that leverages AlphaGenome to interpret genomic variants and agents based on ScanPy and TISSUE to carry out single-cell and spatial transcriptomics analyses. We validate that these paper agents can reproduce the original paper's results and can correctly carry out novel user queries. By turning static papers into dynamic, interactive AI agents, Paper2Agent introduces a new paradigm for knowledge dissemination and a foundation for the collaborative ecosystem of AI co-scientists.

AI Insights

Paper2Agent’s six‑step pipeline: locate code, set up environment, discover tutorials, audit execution, extract tools, assemble MCP server.
The orchestrator agent coordinates sub‑agents, ensuring reliable execution of complex workflows across the MCP ecosystem.
An MCP server exposes a paper’s tools via a standardized API, enabling reproducible, production‑ready analysis.
Agents built for AlphaGenome, TISSUE, and Scanpy reproduced original results and answered novel queries.
Generating agents demands high computational resources for large datasets, a noted limitation.

Deep Learning

👍 👎 ♥ Save

Towards Interpretable Deep Neural Networks for Tabular Data

Marburg University

Abstract
Tabular data is the foundation of many applications in fields such as finance and healthcare. Although DNNs tailored for tabular data achieve competitive predictive performance, they are blackboxes with little interpretability. We introduce XNNTab, a neural architecture that uses a sparse autoencoder (SAE) to learn a dictionary of monosemantic features within the latent space used for prediction. Using an automated method, we assign human-interpretable semantics to these features. This allows us to represent predictions as linear combinations of semantically meaningful components. Empirical evaluations demonstrate that XNNTab attains performance on par with or exceeding that of state-of-the-art, black-box neural models and classical machine learning approaches while being fully interpretable.

AI Insights

XNNTab’s sparse autoencoder learns monosemantic dictionary features that map to human‑readable rules.
On the ADULT benchmark, these dictionary features are generated by applying data‑driven rules to age, education, and capital gain.
In the CHURN dataset, rule‑derived dictionary features uncover subtle customer‑attrition signals missed by conventional models.
Empirical tests show XNNTab matches or exceeds black‑box DNNs while providing transparent linear explanations.
The approach depends heavily on training‑data quality, so noisy or biased data can distort dictionary semantics.
Future work may automate rule discovery or use transfer learning to broaden applicability across domains.
The subjectivity in rule selection still poses a challenge for reproducibility and generalization.

👍 👎 ♥ Save

An Interpretable Deep Learning Model for General Insurance Pricing

UNSW Sydney NSW 2052, AU

Abstract
This paper introduces the Actuarial Neural Additive Model, an inherently interpretable deep learning model for general insurance pricing that offers fully transparent and interpretable results while retaining the strong predictive power of neural networks. This model assigns a dedicated neural network (or subnetwork) to each individual covariate and pairwise interaction term to independently learn its impact on the modeled output while implementing various architectural constraints to allow for essential interpretability (e.g. sparsity) and practical requirements (e.g. smoothness, monotonicity) in insurance applications. The development of our model is grounded in a solid foundation, where we establish a concrete definition of interpretability within the insurance context, complemented by a rigorous mathematical framework. Comparisons in terms of prediction accuracy are made with traditional actuarial and state-of-the-art machine learning methods using both synthetic and real insurance datasets. The results show that the proposed model outperforms other methods in most cases while offering complete transparency in its internal logic, underscoring the strong interpretability and predictive capability.

Help us improve your experience!