Travel Ranking

TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation

Abstract
Travel planning is a valuable yet complex task that poses significant challenges even for advanced large language models (LLMs). While recent benchmarks have advanced in evaluating LLMs' planning capabilities, they often fall short in evaluating feasibility, reliability, and engagement of travel plans. We introduce a comprehensive benchmark for travel planning that unifies fine-grained criteria into a single reward, enabling direct comparison of plan quality and seamless integration with reinforcement learning (RL). Our evaluator achieves moderate agreement with travel-expert annotations (60.75%) and outperforms multiple LLM-as-judge baselines. We further release a large-scale dataset of 4,870 queries including 219 real-world, free-form requests for generalization to authentic user intent. Using this benchmark, we conduct extensive experiments across diverse methods and LLMs, including test-time computation, neuro-symbolic approaches, supervised fine-tuning, and RL via GRPO. Across base models, RL generally improves itinerary feasibility over prompt-only and supervised baselines, yielding higher unified reward scores.

👍 👎 ♥ Save

Travel Bans vs. Social Distancing: A Mathematical Analysis

Abstract
As the world grows increasingly connected, infectious disease transmission and outbreaks become a pressing global concern for public health officials and policymakers. While policy interventions to contain and prevent the spread of disease have been proposed and implemented, there has been little rigorous quantitative analysis of the effectiveness of such interventions. In this paper, we study the susceptible-infected-recovered (SIR) infection process on a dynamic network model that models two communities with travel between them. In particular, we consider two Erd\H{o}s--R\'enyi graphs where edges are dynamically changing based on node travel between the graphs. We characterize the time evolution of the outbreaks in both communities and pin down the time for when the infection first reaches the second community. Finally, we analyze two interventions, social distancing and travel bans, and show that while social distancing is effective at reducing the burden of the disease in the second community, travel bans are not.

Travel Recommendations

👍 👎 ♥ Save

MICROTRIPS: MICRO-geography TRavel Intelligence and Pattern Synthesis

University of Michigan

Rate this image: 😍 👍 👎

Abstract
This study presents a novel small-area estimation framework to enhance urban transportation planning through detailed characterization of travel behavior. Our approach improves on the four-step travel model by employing publicly available microdata files and machine learning methods to predict travel behavior for a representative, synthetic population at small geographic areas. This approach enables high-resolution estimation of trip generation, trip distribution, mode choice, and route assignment. Validation using ACS/PUMS work-commute datasets demonstrates that our framework achieves higher accuracy compared to conventional approaches. The resulting granular insights enable the tailoring of interventions to address localized situations and support a range of policy applications and targeted interventions, including the optimal placement of micro-fulfillment centers, effective curb-space management, and the design of more inclusive transportation solutions particularly for vulnerable communities.

Travel Personalization

👍 👎 ♥ Save

Modelling and evaluating travel information during disruptions: An illustrative example from Swedish railways

Department of Science and

Rate this image: 😍 👍 👎

Abstract
Accurate and timely travel information is an asset for enhancing passenger travel experience during normal traffic, and for mitigating the discomforts during disruptions. With longer and more frequent disruptions as well as increasing ridership, traffic delays can incur substantial costs for passengers and other transport stakeholders, e.g., operators and infrastructure managers. Such costs can, however, be reduced thanks to effective travel information strategies during traffic disruptions. In this paper, we introduce an evaluation model to assess the value of travel information under different scenarios. Focusing on real-time travel information to train passengers, accessibility benefits are quantified in monetary terms based on historical delay distributions, timing of travel information (pre/on-trip) and ridership. Using a case study from the Swedish railways, the model is showcased and applied to a commuter line in Stockholm. The experimental results indicate individual valuations that are higher than references and savings at the system level of at least 23% of the delay costs. Further testing of the model, e.g., on larger-scale scenarios, and including transfer trips, is a possible direction for future works.

AI Insights

Real‑time info shifts passenger route choices, cutting last‑minute transfers during disruptions.
Passengers value pre‑trip alerts more than on‑trip updates, showing a psychological benefit beyond cost savings.
The 23 % savings estimate assumes a 10 % rise in on‑time arrivals, illustrating system‑wide benefits of timely updates.
Small sample size and self‑reported data caution against generalizing findings to national networks.
Future work could add transfer‑trip dynamics, potentially boosting the 23 % savings figure.
Comparative studies show real‑time info can outpace comfort or convenience in passenger valuation, reinforcing this study’s insights.

AI Agents

👍 👎 ♥ Save

Safe, Untrusted, "Proof-Carrying" AI Agents: toward the agentic lakehouse

Abstract
Data lakehouses run sensitive workloads, where AI-driven automation raises concerns about trust, correctness, and governance. We argue that API-first, programmable lakehouses provide the right abstractions for safe-by-design, agentic workflows. Using Bauplan as a case study, we show how data branching and declarative environments extend naturally to agents, enabling reproducibility and observability while reducing the attack surface. We present a proof-of-concept in which agents repair data pipelines using correctness checks inspired by proof-carrying code. Our prototype demonstrates that untrusted AI agents can operate safely on production data and outlines a path toward a fully agentic lakehouse.

👍 👎 ♥ Save

ProSEA: Problem Solving via Exploration Agents

Aitomatic, Inc

Abstract
Large language models (LLMs) have empowered AI agents to tackle increasingly complex tasks. However, most existing agents remain limited to static planning and brittle interactions, falling short of true collaboration or adaptive reasoning. We introduce ProSEA, a modular, general-purpose multi-agent framework designed for iterative problem solving through exploration and plan evolution. ProSEA features a hierarchical architecture in which a Manager Agent orchestrates domain-specialized Expert Agents, decomposes tasks, and adaptively replans based on structured feedback from failed attempts. Unlike prior systems, ProSEA agents report not only success or failure but also detailed reasons for failure and newly discovered constraints, enabling dynamic plan refinement informed by exploratory traces. The framework operates autonomously but supports seamless integration with human collaborators when needed. Experiments on the challenging FinanceBench benchmark demonstrate that ProSEA, even without human feedback, outperforms state-of-the-art baselines and achieves robust performance across reasoning-heavy tasks. These results underscore ProSEA's potential as a foundation for more transparent, adaptive, and human-aligned AI agents.

AI and Society

👍 👎 ♥ Save

Measuring What Matters: The AI Pluralism Index

Universit de Montral

Rate this image: 😍 👍 👎

Abstract
Artificial intelligence systems increasingly mediate knowledge, communication, and decision making. Development and governance remain concentrated within a small set of firms and states, raising concerns that technologies may encode narrow interests and limit public agency. Capability benchmarks for language, vision, and coding are common, yet public, auditable measures of pluralistic governance are rare. We define AI pluralism as the degree to which affected stakeholders can shape objectives, data practices, safeguards, and deployment. We present the AI Pluralism Index (AIPI), a transparent, evidence-based instrument that evaluates producers and system families across four pillars: participatory governance, inclusivity and diversity, transparency, and accountability. AIPI codes verifiable practices from public artifacts and independent evaluations, explicitly handling "Unknown" evidence to report both lower-bound ("evidence") and known-only scores with coverage. We formalize the measurement model; implement a reproducible pipeline that integrates structured web and repository analysis, external assessments, and expert interviews; and assess reliability with inter-rater agreement, coverage reporting, cross-index correlations, and sensitivity analysis. The protocol, codebook, scoring scripts, and evidence graph are maintained openly with versioned releases and a public adjudication process. We report pilot provider results and situate AIPI relative to adjacent transparency, safety, and governance frameworks. The index aims to steer incentives toward pluralistic practice and to equip policymakers, procurers, and the public with comparable evidence.

AI Insights

Imagine model cards closing the AI accountability gap by transparently reporting model behavior.
OECD AI Recommendation pushes for human‑centered, explainable, and fair AI.
UNESCO Ethics Recommendation embeds human values to turn AI into societal good.
HELM from Stanford’s CRFM holistically benchmarks language models on safety and impact.
NIST AI RMF offers a risk‑management cycle for responsible AI governance.
WCAG 2.2 ensures AI interfaces are accessible to users with disabilities.
Krippendorff’s content‑analysis method quantifies stakeholder participation in AI governance.

👍 👎 ♥ Save

AI and Consciousness

Abstract
This is a skeptical overview of the literature on AI consciousness. We will soon create AI systems that are conscious according to some influential, mainstream theories of consciousness but are not conscious according to other influential, mainstream theories of consciousness. We will not be in a position to know which theories are correct and whether we are surrounded by AI systems as richly and meaningfully conscious as human beings or instead only by systems as experientially blank as toasters. None of the standard arguments either for or against AI consciousness takes us far. Table of Contents Chapter One: Hills and Fog Chapter Two: What Is Consciousness? What Is AI? Chapter Three: Ten Possibly Essential Features of Consciousness Chapter Four: Against Introspective and Conceptual Arguments for Essential Features Chapter Five: Materialism and Functionalism Chapter Six: The Turing Test and the Chinese Room Chapter Seven: The Mimicry Argument Against AI Consciousness Chapter Eight: Global Workspace Theories and Higher Order Theories Chapter Nine: Integrated Information, Local Recurrence, Associative Learning, and Iterative Natural Kinds Chapter Ten: Does Biological Substrate Matter? Chapter Eleven: The Problem of Strange Intelligence Chapter Twelve: The Leapfrog Hypothesis and the Social Semi-Solution

Research Automation with AI

👍 👎 ♥ Save

Automated Research Article Classification and Recommendation Using NLP and ML

York University, North YO

Rate this image: 😍 👍 👎

Abstract
In the digital era, the exponential growth of scientific publications has made it increasingly difficult for researchers to efficiently identify and access relevant work. This paper presents an automated framework for research article classification and recommendation that leverages Natural Language Processing (NLP) techniques and machine learning. Using a large-scale arXiv.org dataset spanning more than three decades, we evaluate multiple feature extraction approaches (TF--IDF, Count Vectorizer, Sentence-BERT, USE, Mirror-BERT) in combination with diverse machine learning classifiers (Logistic Regression, SVM, Na\"ive Bayes, Random Forest, Gradient Boosted Trees, and k-Nearest Neighbour). Our experiments show that Logistic Regression with TF--IDF consistently yields the best classification performance, achieving an accuracy of 69\%. To complement classification, we incorporate a recommendation module based on the cosine similarity of vectorized articles, enabling efficient retrieval of related research papers. The proposed system directly addresses the challenge of information overload in digital libraries and demonstrates a scalable, data-driven solution to support literature discovery.

AI Insights

Hybrid ensemble of Logistic Regression, SVM, and Random Forest boosts accuracy beyond single models!
Cross‑dataset validation on arXiv, PubMed, and CiteSeer demonstrates robust generalizability.
User‑feedback loops enable adaptive re‑ranking, refining recommendations over time!
Word2Vec and GloVe embeddings enrich semantic vectors, improving classification precision.
Deep‑learning extraction of patent semantics showcases the framework’s extensibility!
The study omits bias analysis and detailed preprocessing, highlighting future research gaps.
Recommended reading: LDA for topic modeling and the WebFind tool for global paper discovery.

👍 👎 ♥ Save

TinyScientist: An Interactive, Extensible, and Controllable Framework for Building Research Agents

University of Illinois at

Abstract
Automatic research with Large Language Models (LLMs) is rapidly gaining importance, driving the development of increasingly complex workflows involving multi-agent systems, planning, tool usage, code execution, and human-agent interaction to accelerate research processes. However, as more researchers and developers begin to use and build upon these tools and platforms, the complexity and difficulty of extending and maintaining such agentic workflows have become a significant challenge, particularly as algorithms and architectures continue to advance. To address this growing complexity, TinyScientist identifies the essential components of the automatic research workflow and proposes an interactive, extensible, and controllable framework that easily adapts to new tools and supports iterative growth. We provide an open-source codebase, an interactive web demonstration, and a PyPI Python package to make state-of-the-art auto-research pipelines broadly accessible to every researcher and developer.

AI Insights

TinyScientist’s “checker” can automatically assess task risk, flagging potential safety issues before execution.
Its “drawer” component produces ML‑centric diagrams on the fly, easing visual communication in papers.
The framework ships with evaluation rubrics that score content richness, reference quality, clarity, depth, and completeness on a 1‑5 scale.
A full ML pipeline—data collection, cleaning, feature engineering, training, evaluation, deployment—is built into the system for end‑to‑end reproducibility.
The paper cites meta‑learning advances such as Neural Tangent Kernel methods and memory‑augmented networks, highlighting their cross‑domain success.
Users should note that the checker’s risk scores can be imperfect and the generated diagrams may need manual tweaking.
TinyScientist’s open‑source Python package and interactive web demo make state‑of‑the‑art auto‑research pipelines accessible to all.

AGI: Artificial General Intelligence

👍 👎 ♥ Save

BuilderBench -- A benchmark for generalist agents

Princeton University

Abstract
Today's AI models learn primarily through mimicry and sharpening, so it is not surprising that they struggle to solve problems beyond the limits set by existing data. To solve novel problems, agents should acquire skills for exploring and learning through experience. Finding a scalable learning mechanism for developing agents that learn through interaction remains a major open problem. In this work, we introduce BuilderBench, a benchmark to accelerate research into agent pre-training that centers open-ended exploration. BuilderBench requires agents to learn how to build any structure using blocks. BuilderBench is equipped with $(1)$ a hardware accelerated simulator of a robotic agent interacting with various physical blocks, and $(2)$ a task-suite with over 42 diverse target structures that are carefully curated to test an understanding of physics, mathematics, and long-horizon planning. During training, agents have to explore and learn general principles about the environment without any external supervision. During evaluation, agents have to build the unseen target structures from the task suite. Solving these tasks requires a sort of \emph{embodied reasoning} that is not reflected in words but rather in actions, experimenting with different strategies and piecing them together. Our experiments show that many of these tasks challenge the current iteration of algorithms. Hence, we also provide a ``training wheels'' protocol, in which agents are trained and evaluated to build a single target structure from the task suite. Finally, we provide single-file implementations of six different algorithms as a reference point for researchers.

AI Insights

BuilderBench tasks explicitly probe a gripper’s pick‑and‑place precision, sequential logic, and packing‑problem solving in a physics‑rich simulation.
The benchmark includes scaffolding challenges that force agents to build temporary support structures for stability.
Adaptive decision‑making is tested by varying block configurations, compelling agents to react to changing environments.
The platform supplies a full toolchain for task creation, simulation, and performance analysis, enabling rapid prototyping.
Recommended reading: “Robotics: Modelling, Planning and Control” and surveys on robot learning from demonstration for foundational theory.
Key literature: “Learning to Grasp and Manipulate Objects with a Robotic Hand” and “Building Support Structures with a Robotic Gripper” provide state‑of‑the‑art methods.

Deep Learning

👍 👎 ♥ Save

Rethinking deep learning: linear regression remains a key benchmark in predicting terrestrial water storage

Rate this image: 😍 👍 👎

Abstract
Recent advances in machine learning such as Long Short-Term Memory (LSTM) models and Transformers have been widely adopted in hydrological applications, demonstrating impressive performance amongst deep learning models and outperforming physical models in various tasks. However, their superiority in predicting land surface states such as terrestrial water storage (TWS) that are dominated by many factors such as natural variability and human driven modifications remains unclear. Here, using the open-access, globally representative HydroGlobe dataset - comprising a baseline version derived solely from a land surface model simulation and an advanced version incorporating multi-source remote sensing data assimilation - we show that linear regression is a robust benchmark, outperforming the more complex LSTM and Temporal Fusion Transformer for TWS prediction. Our findings highlight the importance of including traditional statistical models as benchmarks when developing and evaluating deep learning models. Additionally, we emphasize the critical need to establish globally representative benchmark datasets that capture the combined impact of natural variability and human interventions.

👍 👎 ♥ Save

An in-depth look at approximation via deep and narrow neural networks

University of Hamburg

Abstract
In 2017, Hanin and Sellke showed that the class of arbitrarily deep, real-valued, feed-forward and ReLU-activated networks of width w forms a dense subset of the space of continuous functions on R^n, with respect to the topology of uniform convergence on compact sets, if and only if w>n holds. To show the necessity, a concrete counterexample function f:R^n->R was used. In this note we actually approximate this very f by neural networks in the two cases w=n and w=n+1 around the aforementioned threshold. We study how the approximation quality behaves if we vary the depth and what effect (spoiler alert: dying neurons) cause that behavior.

AI Insights

Depth lowers error until dying ReLU forces a constant output, even when width equals input dimension.
With width n+1, deeper nets keep improving, showing w>n is not a hard limit.
Minimal‑width ReLU nets can approximate any continuous function, confirming Hanin & Sellke’s theorem.
The constant N0≡1/8 is the best uniform approximator for the counterexample, achieving error 1/8 for all depths.
Experiments show the depth‑benefit plateau occurs earlier in higher dimensions due to dying neurons.
Beise et al.’s decision‑region analysis explains constant outputs in narrow deep nets.
Bresler & Nagaraj’s sharp representation theorems give a depth‑dependence framework matching the results.

Interests not found

Help us improve your experience!