Hi!

Your personalized paper recommendations for 19 to 23 January, 2026.

Towards Execution-Grounded Automated AI Research

Stanford University

Rate paper: 👍 👎 ♥ Save

AI Insights

LLM: Large Language Model RL: Reinforcement Learning ML engineering tasks: Machine learning tasks that heavily depend on feature engineering and hyper-parameter tuning rather than algorithm development. (ML: 0.97)👍👎
The paper demonstrates the feasibility and potential of automated execution feedback loops in LLM research problems, but highlights remaining limitations that need to be addressed. (ML: 0.96)👍👎
Execution grounding for code: The idea of learning from execution feedback in the code generation domain. (ML: 0.96)👍👎
Future work should focus on improving generalizability testing, exploring richer learning signals from execution trajectories, developing more capable execution agents, and incorporating alternative metrics such as idea novelty and interestingness. (ML: 0.95)👍👎
They find that models tend to converge on simple ideas to improve the average reward but lose diversity and do not improve the upper-bound. (ML: 0.95)👍👎
The paper presents a large-scale parallel executor for automatically executing model-generated ideas to verify their effectiveness on open-ended LLM research problems. (ML: 0.92)👍👎
The authors analyze the effectiveness of execution-guided evolutionary search and reinforcement learning with execution rewards. (ML: 0.86)👍👎
The paper highlights the limitations of current experiments, including a lack of generalizability testing, limited exploration incentives in RL objectives, and noise in the reward signal due to the execution agent's capabilities. (ML: 0.84)👍👎

Abstract
Automated AI research holds great potential to accelerate scientific discovery. However, current LLMs often generate plausible-looking but ineffective ideas. Execution grounding may help, but it is unclear whether automated execution is feasible and whether LLMs can learn from the execution feedback. To investigate these, we first build an automated executor to implement ideas and launch large-scale parallel GPU experiments to verify their effectiveness. We then convert two realistic research problems - LLM pre-training and post-training - into execution environments and demonstrate that our automated executor can implement a large fraction of the ideas sampled from frontier LLMs. We analyze two methods to learn from the execution feedback: evolutionary search and reinforcement learning. Execution-guided evolutionary search is sample-efficient: it finds a method that significantly outperforms the GRPO baseline (69.4% vs 48.0%) on post-training, and finds a pre-training recipe that outperforms the nanoGPT baseline (19.7 minutes vs 35.9 minutes) on pre-training, all within just ten search epochs. Frontier LLMs often generate meaningful algorithmic ideas during search, but they tend to saturate early and only occasionally exhibit scaling trends. Reinforcement learning from execution reward, on the other hand, suffers from mode collapse. It successfully improves the average reward of the ideator model but not the upper-bound, due to models converging on simple ideas. We thoroughly analyze the executed ideas and training dynamics to facilitate future efforts towards execution-grounded automated AI research.

Why we are recommending this paper?
Because research automation with ai is a popular topic and you have less than 3 interests with available recommendations

This paper directly addresses the need for practical advancements in automated AI research, a key area for career development within data science. Exploring execution grounding offers valuable insights into how LLMs can be effectively utilized for accelerating scientific discovery, aligning with your interest in data science career paths.

APEX-Agents

Mercor

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

McNemar's exact test: A statistical test used to compare the performance of two related samples. (ML: 0.97)👍👎
Pass@1: The proportion of tasks completed correctly by an agent. (ML: 0.95)👍👎
Significance tests using McNemar's exact test with Benjamini-Hochberg correction show that Kimi-K2-Thinking significantly outperforms Gemini-3-flash-preview (p=5.68e-23), GPT-OSS-120B (p=1.0000), and GPT-5.2 (p=7.29e-10). (ML: 0.95)👍👎
The APEX–Agents benchmark highlights the importance of developing AI models that can perform complex tasks in various professional domains, with a focus on toolbelt approaches, context window management, and intentional termination. (ML: 0.94)👍👎
Benjamini-Hochberg correction: A method for controlling false discovery rate in multiple testing. (ML: 0.94)👍👎
The APEX–Agents benchmark is a comprehensive evaluation of AI models' ability to perform complex tasks in various professional domains. (ML: 0.93)👍👎
The most frequently used tools by agents are code execution (256,000), add tool to the toolbelt (200,000), list files in the file system (163,874), read spreadsheet tab (127,000), and search the PDF (86,000). (ML: 0.93)👍👎
The benchmark consists of 227 tasks, covering finance, law, and management consulting, with each task requiring the model to complete a specific task using a set of provided tools. (ML: 0.89)👍👎
The top-performing models on the APEX–Agents benchmark are Gemini 3 Flash, GPT-5.2, and Kimi K2 Thinking, with Pass@1 scores of 0.555, 0.497, and 0.391 respectively. (ML: 0.88)👍👎
ReAct paradigm: A toolbelt approach where reasoning and acting are interleaved in a single loop. (ML: 0.79)👍👎

Abstract
We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.

Why we are recommending this paper?
Because ai agents is a popular topic and you have less than 3 interests with available recommendations

APEX-Agents focuses on evaluating AI agents across complex, real-world tasks, mirroring the challenges and opportunities within data science automation. Understanding how agents navigate diverse problems is crucial for developing effective data science solutions, aligning with your career interests.

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

Hong Kong Polytechnic University

Rate paper: 👍 👎 ♥ Save

AI Insights

The task granularity is flexible, and every reasoning chain must start from the raw data or a logically prior step. (ML: 0.97)👍👎
The instructions may be too complex or detailed for some users, potentially leading to confusion. (ML: 0.95)👍👎
The provided Jupyter Notebook content is a template for generating data science questions based on an answered notebook. (ML: 0.95)👍👎
QRA: Question-Reasoning-Answer triplet JSON: JavaScript Object Notation Generating high-quality data science questions based on an answered notebook requires careful analysis and adherence to specific guidelines. (ML: 0.94)👍👎
The output format requires a valid JSON object with specific keys such as 'data_type', 'domain', 'task_type', 'language', 'question', 'reasoning', 'answer', 'best_score (Optional)', and 'confidence'. (ML: 0.89)👍👎
The final output must be a valid JSON object with the specified structure. (ML: 0.82)👍👎
The instructions provide detailed guidelines for generating QRA triplets, including the importance of not mentioning the notebook and ensuring diversity across task types. (ML: 0.79)👍👎
The output format must conform to a valid JSON object with specified keys, ensuring that the generated QRA triplets are accurate and comprehensive. (ML: 0.77)👍👎

Abstract
Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., vision and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 11 advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, GPT-5.2 is the most efficient, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04% to 11.30%. Overall, while current data science agents perform well on structured data and routine data anlysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions to advance the development of data science agents.

Why we are recommending this paper?
Due to your Interest in Data Career Path

This work directly tackles the evaluation of data science agents, a critical area for understanding and deploying AI in data science workflows. It's a foundational paper in the emerging field of data science agents, which is highly relevant to your career exploration.

Creativity in the Age of AI: Rethinking the Role of Intentional Agency

University of Amsterdam

Rate paper: 👍 👎 ♥ Save

AI Insights

The consistency requirement proposed by the authors is not just statistical frequency but having context-relative grounds for expecting further outputs of comparable novelty and value. (ML: 0.97)👍👎
The concept of creativity should remain flexible across different domains of creativity, and the indeterminacy of the consistency requirement allows for this flexibility. (ML: 0.96)👍👎
The consistency requirement proposed by the authors is a more inclusive and functional approach to defining creativity, allowing for non-human natural processes to be labelled 'creative'. (ML: 0.96)👍👎
The consistency requirement proposed by the authors may not be applicable in all contexts, especially where authenticity conditions the value of the products being generated or examined. (ML: 0.94)👍👎
The IAC has functional value in specific local contexts, such as cognitive science, jurisprudence, and certain domains of creative practice where authenticity conditions the value of the products being generated or examined. (ML: 0.94)👍👎
New Standard Definition (NSD) of Creativity: An object is creative if it is novel, valuable, and the product of a system that can consistently generate novel and valuable objects. (ML: 0.92)👍👎
The NSD states that an object is creative if it is novel, valuable, and the product of a system that can consistently generate novel and valuable objects. (ML: 0.91)👍👎
The article proposes a new standard definition (NSD) of creativity, which drops the intentional agency condition (IAC) as a necessary condition of creativity. (ML: 0.89)👍👎
The article does not provide a comprehensive account of where the IAC ought to be applied. (ML: 0.89)👍👎
The IAC should be excluded from our definition of the genus of creativity but retained as a means of distinguishing between certain species of creativity. (ML: 0.89)👍👎
Intentional Agency Condition (IAC): A necessary condition of creativity that requires an agent to intentionally endeavor to express themselves. (ML: 0.82)👍👎

Abstract
Many theorists of creativity maintain that intentional agency is a necessary condition of creativity. We argue that this requirement, which we call the Intentional Agency Condition (IAC), should be rejected as a general condition of creativity, while retaining its relevance in specific contexts. We show that recent advances in generative AI have rendered the IAC increasingly problematic, both descriptively and functionally. We offer two reasons for abandoning it at the general level. First, we present corpus evidence indicating that authors and journalists are increasingly comfortable ascribing creativity to generative AI, despite its lack of intentional agency. This development places pressure on the linguistic intuitions that have traditionally been taken to support the IAC. Second, drawing on the method of conceptual engineering, we argue that the IAC no longer fulfils its core social function. Rather than facilitating the identification and encouragement of reliable sources of novel and valuable products, it now feeds into biases that distort our assessments of AI-generated outputs. We therefore propose replacing the IAC with a consistency requirement, according to which creativity tracks the reliable generation of novel and valuable products. Nonetheless, we explain why the IAC should be retained in specific local domains.

Why we are recommending this paper?
Because ai and society is a popular topic and you have less than 3 interests with available recommendations

This paper delves into the fundamental question of agency in AI systems, a topic increasingly important as AI becomes more integrated into creative processes. Exploring the role of intentionality offers valuable context for understanding the future of AI and its impact on data science applications.

Emergent, not Immanent: A Baradian Reading of Explainable AI

Sony

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The paper concludes that current XAI methods are based on flawed assumptions and lack a clear understanding of the relationship between humans and machines. (ML: 0.98)👍👎
Apparatuses: The technical tools, methods, and narratives that constitute what is made intelligible and what is excluded from intelligibility in XAI practices. (ML: 0.97)👍👎
The paper critiques the current state of Explainable AI (XAI) methods, arguing that they are based on flawed assumptions and lack a clear understanding of the relationship between humans and machines. (ML: 0.97)👍👎
The paper highlights the limitations of current XAI methods, including their reliance on simplifications and abstractions that erase the original system, and their failure to account for human-machine incommensurability. (ML: 0.96)👍👎
The authors propose an agential realist approach to XAI, which views interpretation as a relational co-production of interpretable phenomena through intra-actions between human and non-human agencies. (ML: 0.96)👍👎
Agential cut: The moment at which an interpretive apparatus enacts a relational co-production of interpretable phenomena through intra-actions between human and non-human agencies. (ML: 0.96)👍👎
Agential realism: A philosophical framework that views knowledge as an intra-action between human and non-human agencies. (ML: 0.94)👍👎
Intra-action: The process by which human and non-human agencies co-produce interpretable phenomena through their entanglements. (ML: 0.92)👍👎
The authors suggest that a diffractive optic offers a more philosophically robust reading of XAI practices, one that acknowledges the emergent nature of interpretation and the importance of situated contexts. (ML: 0.90)👍👎
This approach challenges the dominant reflectivity and refractivity optics in XAI, which assume that meaning pre-exists the practices and beings that produce it. (ML: 0.75)👍👎

Abstract
Explainable AI (XAI) is frequently positioned as a technical problem of revealing the inner workings of an AI model. This position is affected by unexamined onto-epistemological assumptions: meaning is treated as immanent to the model, the explainer is positioned outside the system, and a causal structure is presumed recoverable through computational techniques. In this paper, we draw on Barad's agential realism to develop an alternative onto-epistemology of XAI. We propose that interpretations are material-discursive performances that emerge from situated entanglements of the AI model with humans, context, and the interpretative apparatus. To develop this position, we read a comprehensive set of XAI methods through agential realism and reveal the assumptions and limitations that underpin several of these methods. We then articulate the framework's ethical dimension and propose design directions for XAI interfaces that support emergent interpretation, using a speculative text-to-music interface as a case study.

Why we are recommending this paper?
Because ai and society is a popular topic and you have less than 3 interests with available recommendations

This paper examines the underlying assumptions of Explainable AI (XAI), a critical area for understanding the technical challenges and limitations of AI systems. Considering these foundational questions is essential for navigating the evolving landscape of AI and its applications within data science.

LLM-in-Sandbox Elicits General Agentic Intelligence

Renmin University of China

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Agentic capabilities: Fundamental skills like exploration, tool use, and self-verification. (ML: 0.96)👍👎
Current results have limitations, such as generated videos being limited to simple animations and composed music lacking expressiveness and creativity. (ML: 0.95)👍👎
The agentic capability benchmark provided by LLM-in-Sandbox can be used to evaluate models' ability to leverage computational environments. (ML: 0.94)👍👎
Strong LLMs exhibit emergent capabilities to leverage the sandbox environment for general tasks. (ML: 0.92)👍👎
LLM-in-Sandbox can be used as an agentic capability benchmark, measuring fundamental skills like exploration, tool use, and self-verification. (ML: 0.91)👍👎
The metric ∆=LLM-in-Sandbox−LLM offers a meaningful indicator of a model's ability to leverage computational environments. (ML: 0.90)👍👎
LLM-in-Sandbox has the potential to become the default paradigm for serving LLMs, enabling them to perform general tasks and produce actual outputs rather than text descriptions. (ML: 0.88)👍👎
LLM-in-Sandbox: A paradigm that grants LLMs access to a virtual computer and enables them to leverage this environment for general tasks. (ML: 0.85)👍👎
Sandbox-native model training: Training models to interact with the sandbox environment as a first-class objective. (ML: 0.82)👍👎
LLM-in-Sandbox is a paradigm that grants LLMs access to a virtual computer and enables them to leverage this environment for general tasks. (ML: 0.80)👍👎

Abstract
We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.

Why we are recommending this paper?
Because ai agents is a popular topic and you have less than 3 interests with available recommendations

AI Agents vs. Human Investigators: Balancing Automation, Security, and Expertise in Cyber Forensic Analysis

Florida Institute of Technology

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

However, AI also has its limitations and challenges, including the issue of impostor bias, where AI systems may mistakenly identify a legitimate file or activity as malicious. (ML: 0.96)👍👎
Limited accuracy: AI systems may not always accurately identify malicious files or activities. (ML: 0.96)👍👎
Impostor bias: AI systems may mistakenly identify a legitimate file or activity as malicious. (ML: 0.94)👍👎
To address these challenges, researchers are working on developing more accurate and reliable AI systems for digital forensics. (ML: 0.93)👍👎
To address these challenges, researchers are working on developing more accurate and reliable AI systems for digital forensics. (ML: 0.93)👍👎
Artificial Intelligence (AI): A type of computer system that can perform tasks that would typically require human intelligence, such as learning, problem-solving, and decision-making. (ML: 0.92)👍👎
Digital Forensics: The process of collecting, analyzing, and preserving evidence related to cybercrime and other digital crimes. (ML: 0.92)👍👎
The use of artificial intelligence (AI) in digital forensics is becoming increasingly important as cybercrime continues to grow. (ML: 0.92)👍👎
The use of AI in digital forensics is becoming increasingly important, but it also has its limitations and challenges. (ML: 0.90)👍👎
The use of AI in digital forensics is becoming increasingly important, but it also has its limitations and challenges. (ML: 0.90)👍👎

Abstract
In an era where cyber threats are rapidly evolving, the reliability of cyber forensic analysis has become increasingly critical for effective digital investigations and cybersecurity responses. AI agents are being adopted across digital forensic practices due to their ability to automate processes such as anomaly detection, evidence classification, and behavioral pattern recognition, significantly enhancing scalability and reducing investigation timelines. However, the characteristics that make AI indispensable also introduce notable risks. AI systems, often trained on biased or incomplete datasets, can produce misleading results, including false positives and false negatives, thereby jeopardizing the integrity of forensic investigations. This study presents a meticulous comparative analysis of the effectiveness of the most used AI agent, ChatGPT, and human forensic investigators in the realm of cyber forensic analysis. Our research reveals critical limitations within AI-driven approaches, demonstrating scenarios in which sophisticated or novel cyber threats remain undetected due to the rigid pattern-based nature of AI systems. Conversely, our analysis highlights the crucial role that human forensic investigators play in mitigating these risks. Through adaptive decision-making, ethical reasoning, and contextual understanding, human investigators effectively identify subtle anomalies and threats that may evade automated detection systems. To reinforce our findings, we conducted comprehensive reliability testing of forensic techniques using multiple cyber threat scenarios. These tests confirmed that while AI agents significantly improve the efficiency of routine analyses, human oversight remains crucial in ensuring accuracy and comprehensiveness of the results.

Why we are recommending this paper?
Because research automation with ai is a popular topic and you have less than 3 interests with available recommendations

Benchmarking Deep Learning Models for Raman Spectroscopy Across Open-Source Datasets

Purdue University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The authors acknowledge that their study is limited by its reliance on a small number of datasets. (ML: 0.99)👍👎
Domain shift: a phenomenon where the distribution of data in the training set differs from that of the testing set. (ML: 0.98)👍👎
The development of DSCF highlights the need for large-scale and diverse training data. (ML: 0.97)👍👎
Recent works have proposed unsupervised domain adaptation frameworks, but their effectiveness beyond the originally reported datasets are yet to be independently evaluated. (ML: 0.95)👍👎
The results of this benchmarking experiment have shown that classifying test samples that are in-distribution to the training dataset is significantly easier than test samples suffering from distribution shift due to changes in instruments and acquisition conditions, and additional contaminants. (ML: 0.94)👍👎
Foundation model: a pre-trained model that can be fine-tuned for specific tasks, often using transfer learning. (ML: 0.92)👍👎
SANet demonstrated the best overall performance across the datasets. (ML: 0.84)👍👎
The study benchmarks only five architectures and relies on minimal spectral pre-processing. (ML: 0.77)👍👎
Existing open-source Raman datasets are often restricted in size, chemical diversity or experimental variability. (ML: 0.67)👍👎
Creating large, curated experimental Raman spectral datasets that span multiple instruments, materials and measurement settings is key to developing a Raman-specific foundation model. (ML: 0.61)👍👎
Raman spectroscopy: a technique used to analyze the vibrational modes of molecules. (ML: 0.52)👍👎

Abstract
Deep learning classifiers for Raman spectroscopy are increasingly reported to outperform classical chemometric approaches. However their evaluations are often conducted in isolation or compared against traditional machine learning methods or trivially adapted vision-based architectures that were not originally proposed for Raman spectroscopy. As a result, direct comparisons between existing deep learning models developed specifically for Raman spectral analysis on shared open-source datasets remain scarce. To the best of our knowledge, this study presents one of the first systematic benchmarks comparing three or more published Raman-specific deep learning classifiers across multiple open-source Raman datasets. We evaluate five representative deep learning architectures under a unified training and hyperparameter tuning protocol across three open-source Raman datasets selected to support standard evaluation, fine-tuning, and explicit distribution-shift testing. We report classification accuracies and macro-averaged F1 scores to provide a fair and reproducible comparison of deep learning models for Raman spectra based classification.

Why we are recommending this paper?
Because deep learning is a popular topic and you have less than 3 interests with available recommendations

Deep Learning Approaches to Quantum Error Mitigation

Quantinuum Ltd

Rate paper: 👍 👎 ♥ Save

AI Insights

L1 Relative Change (L1RC): A measure of the difference between two probability distributions. (ML: 0.98)👍👎
Signal-to-Noise Ratio (SNR): The ratio of the signal power to the noise power in a system. (ML: 0.93)👍👎
However, on Real Pauli data the advantage clearly shifts toward the ML-based models, which outperform all baselines in both median L1 relative change and fraction of improved circuits. (ML: 0.93)👍👎
Deep learning models can learn corrections directly from data gathered during circuit runs, more easily capturing correlations. (ML: 0.88)👍👎
The best performing models are comparable to the best baseline methods on Simulated data (both Pauli and Random). (ML: 0.87)👍👎
It is defined as the L1 norm of the difference between the two distributions. (ML: 0.87)👍👎
The learned mapping from P noisy and circuit features to P ideal captures a richer structure that goes beyond coarse depolarization or measurement-error mitigation. (ML: 0.81)👍👎
The PERCEIVER model consistently achieves as good or greater median performance than the baseline mitigation techniques for Pauli circuits. (ML: 0.80)👍👎
The deep learning approaches can generalize across noise regimes, device generations, and circuit families without relying on a predefined noise model. (ML: 0.79)👍👎
The deep learning approaches can generalize across noise regimes, device generations, and circuit families without relying on a predefined noise model. (ML: 0.79)👍👎
The baseline methods retain value as lightweight, interpretable mitigation techniques, particularly for structured, low-depth circuits. (ML: 0.61)👍👎

Abstract
We present a systematic investigation of deep learning methods applied to quantum error mitigation of noisy output probability distributions from measured quantum circuits. We compare different architectures, from fully connected neural networks to transformers, and we test different design/training modalities, identifying sequence-to-sequence, attention-based models as the most effective on our datasets. These models consistently produce mitigated distributions that are closer to the ideal outputs when tested on both simulated and real device data obtained from IBM superconducting quantum processing units (QPU) up to five qubits. Across several different circuit depths, our approach outperforms other baseline error mitigation techniques. We perform a series of ablation studies to examine: how different input features (circuit, device properties, noisy output statistics) affect performance; cross-dataset generalization across circuit families; and transfer learning to a different IBM QPU. We observe that generalization performance across similar devices with the same architecture works effectively, without needing to fully retrain models.

Why we are recommending this paper?
Because deep learning is a popular topic and you have less than 3 interests with available recommendations

Controllable Layered Image Generation for Real-World Editing

UC Santa Cruz

Rate paper: 👍 👎 ♥ Save

AI Insights

However, they often require large amounts of data and computational resources to train, which can be a limitation. (ML: 0.98)👍👎
require large amounts of data and computational resources to train The use of text-to-image diffusion models for image editing has been explored by several researchers, including those who have developed datasets such as Qwen-Image and Omnigen2. (ML: 0.94)👍👎
Text-to-image diffusion models have become increasingly popular in recent years, with many researchers exploring their potential applications. (ML: 0.93)👍👎
The unreasonable effectiveness of deep features as a perceptual metric Text-to-image diffusion models have become increasingly popular in recent years, with many researchers exploring their potential applications. (ML: 0.92)👍👎
Text-to-image diffusion models are a type of artificial intelligence that can generate images from text descriptions. (ML: 0.91)👍👎
They have many potential applications, but require large amounts of data and computational resources to train. (ML: 0.91)👍👎
These models can be used for various tasks such as image editing, object removal, and text-to-image synthesis. (ML: 0.89)👍👎
These models can be used for various tasks such as object removal, text-to-image synthesis, and instruction-guided image editing. (ML: 0.88)👍👎

Abstract
Recent image generation models have shown impressive progress, yet they often struggle to yield controllable and consistent results when users attempt to edit specific elements within an existing image. Layered representations enable flexible, user-driven content creation, but existing approaches often fail to produce layers with coherent compositing relationships, and their object layers typically lack realistic visual effects such as shadows and reflections. To overcome these limitations, we propose LASAGNA, a novel, unified framework that generates an image jointly with its composing layers--a photorealistic background and a high-quality transparent foreground with compelling visual effects. Unlike prior work, LASAGNA efficiently learns correct image composition from a wide range of conditioning inputs--text prompts, foreground, background, and location masks--offering greater controllability for real-world applications. To enable this, we introduce LASAGNA-48K, a new dataset composed of clean backgrounds and RGBA foregrounds with physically grounded visual effects. We also propose LASAGNABENCH, the first benchmark for layer editing. We demonstrate that LASAGNA excels in generating highly consistent and coherent results across multiple image layers simultaneously, enabling diverse post-editing applications that accurately preserve identity and visual effects. LASAGNA-48K and LASAGNABENCH will be publicly released to foster open research in the community. The project page is https://rayjryang.github.io/LASAGNA-Page/.

Why we are recommending this paper?
Because image and video generation is a popular topic and you have less than 3 interests with available recommendations

StableWorld: Towards Stable and Consistent Long Interactive Video Generation

Nanjing University NJU

Rate paper: 👍 👎 ♥ Save

AI Insights

StableWorld also alleviates error accumulation in autoregressive video generation, resulting in more stable, consistent, and higher-quality long videos. (ML: 0.87)👍👎
Autoregressive video generation: A technique where each frame is generated based on the previous one(s), often leading to error accumulation. (ML: 0.85)👍👎
Geometric similarity: A measure of how similar two frames are based on their geometric structure. (ML: 0.82)👍👎
Sliding window approach: A method where a fixed-size window is moved over the sequence, and the most recent frames are used to generate new ones. (ML: 0.82)👍👎
The paper proposes a method called StableWorld for long-horizon interactive video generation, which aims to prevent error accumulation and maintain temporal consistency. (ML: 0.81)👍👎
StableWorld effectively prevents cumulative errors by continuously filtering out degraded frames while maintaining coherent motion, resulting in more stable and temporally consistent interactive video sequences. (ML: 0.77)👍👎
The method's ability to identify and discard a large number of drifted frames during generation has the potential to reduce training cost and aligns naturally with future extensions toward memory-augmented world models. (ML: 0.76)👍👎
ORB (Oriented FAST and Rotated BRIEF): A feature detector that extracts keypoints with their descriptors for matching purposes. (ML: 0.63)👍👎
StableWorld uses a sliding window approach with dynamic frame eviction based on geometric similarity computed using ORB features. (ML: 0.63)👍👎
The method is evaluated on several benchmarks, including Matrix-Game 2.0, Hunyuan-Gamecraft 1.0, Open-Oasis, and Self-Forcing, showing improved stability and consistency in long-horizon generation. (ML: 0.59)👍👎

Abstract
In this paper, we explore the overlooked challenge of stability and temporal consistency in interactive video generation, which synthesizes dynamic and controllable video worlds through interactive behaviors such as camera movements and text prompts. Despite remarkable progress in world modeling, current methods still suffer from severe instability and temporal degradation, often leading to spatial drift and scene collapse during long-horizon interactions. To better understand this issue, we initially investigate the underlying causes of instability and identify that the major source of error accumulation originates from the same scene, where generated frames gradually deviate from the initial clean state and propagate errors to subsequent frames. Building upon this observation, we propose a simple yet effective method, \textbf{StableWorld}, a Dynamic Frame Eviction Mechanism. By continuously filtering out degraded frames while retaining geometrically consistent ones, StableWorld effectively prevents cumulative drift at its source, leading to more stable and temporal consistency of interactive generation. Promising results on multiple interactive video models, \eg, Matrix-Game, Open-Oasis, and Hunyuan-GameCraft, demonstrate that StableWorld is model-agnostic and can be applied to different interactive video generation frameworks to substantially improve stability, temporal consistency, and generalization across diverse interactive scenarios.

Why we are recommending this paper?
Because image and video generation is a popular topic and you have less than 3 interests with available recommendations

We did not find tons of content matching your interests we've included some additional topics that are popular. Also be aware that if the topics is not present in arxiv we wont be able to recommend it.

APEX-Agents

Mercor

Rate paper: 👍 👎 ♥ Save

AI Insights

McNemar's exact test: A statistical test used to compare the performance of two related samples. (ML: 0.97)👍👎
Pass@1: The proportion of tasks completed correctly by an agent. (ML: 0.95)👍👎
Significance tests using McNemar's exact test with Benjamini-Hochberg correction show that Kimi-K2-Thinking significantly outperforms Gemini-3-flash-preview (p=5.68e-23), GPT-OSS-120B (p=1.0000), and GPT-5.2 (p=7.29e-10). (ML: 0.95)👍👎
The APEX–Agents benchmark highlights the importance of developing AI models that can perform complex tasks in various professional domains, with a focus on toolbelt approaches, context window management, and intentional termination. (ML: 0.94)👍👎
Benjamini-Hochberg correction: A method for controlling false discovery rate in multiple testing. (ML: 0.94)👍👎
The APEX–Agents benchmark is a comprehensive evaluation of AI models' ability to perform complex tasks in various professional domains. (ML: 0.93)👍👎
The most frequently used tools by agents are code execution (256,000), add tool to the toolbelt (200,000), list files in the file system (163,874), read spreadsheet tab (127,000), and search the PDF (86,000). (ML: 0.93)👍👎
The benchmark consists of 227 tasks, covering finance, law, and management consulting, with each task requiring the model to complete a specific task using a set of provided tools. (ML: 0.89)👍👎
The top-performing models on the APEX–Agents benchmark are Gemini 3 Flash, GPT-5.2, and Kimi K2 Thinking, with Pass@1 scores of 0.555, 0.497, and 0.391 respectively. (ML: 0.88)👍👎
ReAct paradigm: A toolbelt approach where reasoning and acting are interleaved in a single loop. (ML: 0.79)👍👎

Why we are recommending this paper?
Because ai agents is a popular topic and you have less than 3 interests with available recommendations

LLM-in-Sandbox Elicits General Agentic Intelligence

Renmin University of China

Rate paper: 👍 👎 ♥ Save

AI Insights

Agentic capabilities: Fundamental skills like exploration, tool use, and self-verification. (ML: 0.96)👍👎
Current results have limitations, such as generated videos being limited to simple animations and composed music lacking expressiveness and creativity. (ML: 0.95)👍👎
The agentic capability benchmark provided by LLM-in-Sandbox can be used to evaluate models' ability to leverage computational environments. (ML: 0.94)👍👎
Strong LLMs exhibit emergent capabilities to leverage the sandbox environment for general tasks. (ML: 0.92)👍👎
LLM-in-Sandbox can be used as an agentic capability benchmark, measuring fundamental skills like exploration, tool use, and self-verification. (ML: 0.91)👍👎
The metric ∆=LLM-in-Sandbox−LLM offers a meaningful indicator of a model's ability to leverage computational environments. (ML: 0.90)👍👎
LLM-in-Sandbox has the potential to become the default paradigm for serving LLMs, enabling them to perform general tasks and produce actual outputs rather than text descriptions. (ML: 0.88)👍👎
LLM-in-Sandbox: A paradigm that grants LLMs access to a virtual computer and enables them to leverage this environment for general tasks. (ML: 0.85)👍👎
Sandbox-native model training: Training models to interact with the sandbox environment as a first-class objective. (ML: 0.82)👍👎
LLM-in-Sandbox is a paradigm that grants LLMs access to a virtual computer and enables them to leverage this environment for general tasks. (ML: 0.80)👍👎

Why we are recommending this paper?
Because ai agents is a popular topic and you have less than 3 interests with available recommendations

Creativity in the Age of AI: Rethinking the Role of Intentional Agency

University of Amsterdam

Rate paper: 👍 👎 ♥ Save

AI Insights

The consistency requirement proposed by the authors is not just statistical frequency but having context-relative grounds for expecting further outputs of comparable novelty and value. (ML: 0.97)👍👎
The concept of creativity should remain flexible across different domains of creativity, and the indeterminacy of the consistency requirement allows for this flexibility. (ML: 0.96)👍👎
The consistency requirement proposed by the authors is a more inclusive and functional approach to defining creativity, allowing for non-human natural processes to be labelled 'creative'. (ML: 0.96)👍👎
The consistency requirement proposed by the authors may not be applicable in all contexts, especially where authenticity conditions the value of the products being generated or examined. (ML: 0.94)👍👎
The IAC has functional value in specific local contexts, such as cognitive science, jurisprudence, and certain domains of creative practice where authenticity conditions the value of the products being generated or examined. (ML: 0.94)👍👎
New Standard Definition (NSD) of Creativity: An object is creative if it is novel, valuable, and the product of a system that can consistently generate novel and valuable objects. (ML: 0.92)👍👎
The NSD states that an object is creative if it is novel, valuable, and the product of a system that can consistently generate novel and valuable objects. (ML: 0.91)👍👎
The article proposes a new standard definition (NSD) of creativity, which drops the intentional agency condition (IAC) as a necessary condition of creativity. (ML: 0.89)👍👎
The article does not provide a comprehensive account of where the IAC ought to be applied. (ML: 0.89)👍👎
The IAC should be excluded from our definition of the genus of creativity but retained as a means of distinguishing between certain species of creativity. (ML: 0.89)👍👎
Intentional Agency Condition (IAC): A necessary condition of creativity that requires an agent to intentionally endeavor to express themselves. (ML: 0.82)👍👎

Why we are recommending this paper?
Because ai and society is a popular topic and you have less than 3 interests with available recommendations

Emergent, not Immanent: A Baradian Reading of Explainable AI

Sony

Rate paper: 👍 👎 ♥ Save

AI Insights

The paper concludes that current XAI methods are based on flawed assumptions and lack a clear understanding of the relationship between humans and machines. (ML: 0.98)👍👎
Apparatuses: The technical tools, methods, and narratives that constitute what is made intelligible and what is excluded from intelligibility in XAI practices. (ML: 0.97)👍👎
The paper critiques the current state of Explainable AI (XAI) methods, arguing that they are based on flawed assumptions and lack a clear understanding of the relationship between humans and machines. (ML: 0.97)👍👎
The paper highlights the limitations of current XAI methods, including their reliance on simplifications and abstractions that erase the original system, and their failure to account for human-machine incommensurability. (ML: 0.96)👍👎
The authors propose an agential realist approach to XAI, which views interpretation as a relational co-production of interpretable phenomena through intra-actions between human and non-human agencies. (ML: 0.96)👍👎
Agential cut: The moment at which an interpretive apparatus enacts a relational co-production of interpretable phenomena through intra-actions between human and non-human agencies. (ML: 0.96)👍👎
Agential realism: A philosophical framework that views knowledge as an intra-action between human and non-human agencies. (ML: 0.94)👍👎
Intra-action: The process by which human and non-human agencies co-produce interpretable phenomena through their entanglements. (ML: 0.92)👍👎
The authors suggest that a diffractive optic offers a more philosophically robust reading of XAI practices, one that acknowledges the emergent nature of interpretation and the importance of situated contexts. (ML: 0.90)👍👎
This approach challenges the dominant reflectivity and refractivity optics in XAI, which assume that meaning pre-exists the practices and beings that produce it. (ML: 0.75)👍👎

Why we are recommending this paper?
Because ai and society is a popular topic and you have less than 3 interests with available recommendations

Towards Execution-Grounded Automated AI Research

Stanford University

Rate paper: 👍 👎 ♥ Save

AI Insights

LLM: Large Language Model RL: Reinforcement Learning ML engineering tasks: Machine learning tasks that heavily depend on feature engineering and hyper-parameter tuning rather than algorithm development. (ML: 0.97)👍👎
The paper demonstrates the feasibility and potential of automated execution feedback loops in LLM research problems, but highlights remaining limitations that need to be addressed. (ML: 0.96)👍👎
Execution grounding for code: The idea of learning from execution feedback in the code generation domain. (ML: 0.96)👍👎
Future work should focus on improving generalizability testing, exploring richer learning signals from execution trajectories, developing more capable execution agents, and incorporating alternative metrics such as idea novelty and interestingness. (ML: 0.95)👍👎
They find that models tend to converge on simple ideas to improve the average reward but lose diversity and do not improve the upper-bound. (ML: 0.95)👍👎
The paper presents a large-scale parallel executor for automatically executing model-generated ideas to verify their effectiveness on open-ended LLM research problems. (ML: 0.92)👍👎
The authors analyze the effectiveness of execution-guided evolutionary search and reinforcement learning with execution rewards. (ML: 0.86)👍👎
The paper highlights the limitations of current experiments, including a lack of generalizability testing, limited exploration incentives in RL objectives, and noise in the reward signal due to the execution agent's capabilities. (ML: 0.84)👍👎

Why we are recommending this paper?
Because research automation with ai is a popular topic and you have less than 3 interests with available recommendations

AI Agents vs. Human Investigators: Balancing Automation, Security, and Expertise in Cyber Forensic Analysis

Florida Institute of Technology

Rate paper: 👍 👎 ♥ Save

AI Insights

However, AI also has its limitations and challenges, including the issue of impostor bias, where AI systems may mistakenly identify a legitimate file or activity as malicious. (ML: 0.96)👍👎
Limited accuracy: AI systems may not always accurately identify malicious files or activities. (ML: 0.96)👍👎
Impostor bias: AI systems may mistakenly identify a legitimate file or activity as malicious. (ML: 0.94)👍👎
To address these challenges, researchers are working on developing more accurate and reliable AI systems for digital forensics. (ML: 0.93)👍👎
To address these challenges, researchers are working on developing more accurate and reliable AI systems for digital forensics. (ML: 0.93)👍👎
Artificial Intelligence (AI): A type of computer system that can perform tasks that would typically require human intelligence, such as learning, problem-solving, and decision-making. (ML: 0.92)👍👎
Digital Forensics: The process of collecting, analyzing, and preserving evidence related to cybercrime and other digital crimes. (ML: 0.92)👍👎
The use of artificial intelligence (AI) in digital forensics is becoming increasingly important as cybercrime continues to grow. (ML: 0.92)👍👎
The use of AI in digital forensics is becoming increasingly important, but it also has its limitations and challenges. (ML: 0.90)👍👎
The use of AI in digital forensics is becoming increasingly important, but it also has its limitations and challenges. (ML: 0.90)👍👎

Why we are recommending this paper?
Because research automation with ai is a popular topic and you have less than 3 interests with available recommendations

Benchmarking Deep Learning Models for Raman Spectroscopy Across Open-Source Datasets

Purdue University

Rate paper: 👍 👎 ♥ Save

AI Insights

The authors acknowledge that their study is limited by its reliance on a small number of datasets. (ML: 0.99)👍👎
Domain shift: a phenomenon where the distribution of data in the training set differs from that of the testing set. (ML: 0.98)👍👎
The development of DSCF highlights the need for large-scale and diverse training data. (ML: 0.97)👍👎
Recent works have proposed unsupervised domain adaptation frameworks, but their effectiveness beyond the originally reported datasets are yet to be independently evaluated. (ML: 0.95)👍👎
The results of this benchmarking experiment have shown that classifying test samples that are in-distribution to the training dataset is significantly easier than test samples suffering from distribution shift due to changes in instruments and acquisition conditions, and additional contaminants. (ML: 0.94)👍👎
Foundation model: a pre-trained model that can be fine-tuned for specific tasks, often using transfer learning. (ML: 0.92)👍👎
SANet demonstrated the best overall performance across the datasets. (ML: 0.84)👍👎
The study benchmarks only five architectures and relies on minimal spectral pre-processing. (ML: 0.77)👍👎
Existing open-source Raman datasets are often restricted in size, chemical diversity or experimental variability. (ML: 0.67)👍👎
Creating large, curated experimental Raman spectral datasets that span multiple instruments, materials and measurement settings is key to developing a Raman-specific foundation model. (ML: 0.61)👍👎
Raman spectroscopy: a technique used to analyze the vibrational modes of molecules. (ML: 0.52)👍👎

Why we are recommending this paper?
Because deep learning is a popular topic and you have less than 3 interests with available recommendations

Deep Learning Approaches to Quantum Error Mitigation

Quantinuum Ltd

Rate paper: 👍 👎 ♥ Save

AI Insights

L1 Relative Change (L1RC): A measure of the difference between two probability distributions. (ML: 0.98)👍👎
Signal-to-Noise Ratio (SNR): The ratio of the signal power to the noise power in a system. (ML: 0.93)👍👎
However, on Real Pauli data the advantage clearly shifts toward the ML-based models, which outperform all baselines in both median L1 relative change and fraction of improved circuits. (ML: 0.93)👍👎
Deep learning models can learn corrections directly from data gathered during circuit runs, more easily capturing correlations. (ML: 0.88)👍👎
The best performing models are comparable to the best baseline methods on Simulated data (both Pauli and Random). (ML: 0.87)👍👎
It is defined as the L1 norm of the difference between the two distributions. (ML: 0.87)👍👎
The learned mapping from P noisy and circuit features to P ideal captures a richer structure that goes beyond coarse depolarization or measurement-error mitigation. (ML: 0.81)👍👎
The PERCEIVER model consistently achieves as good or greater median performance than the baseline mitigation techniques for Pauli circuits. (ML: 0.80)👍👎
The deep learning approaches can generalize across noise regimes, device generations, and circuit families without relying on a predefined noise model. (ML: 0.79)👍👎
The deep learning approaches can generalize across noise regimes, device generations, and circuit families without relying on a predefined noise model. (ML: 0.79)👍👎
The baseline methods retain value as lightweight, interpretable mitigation techniques, particularly for structured, low-depth circuits. (ML: 0.61)👍👎

Why we are recommending this paper?
Because deep learning is a popular topic and you have less than 3 interests with available recommendations

Controllable Layered Image Generation for Real-World Editing

UC Santa Cruz

Rate paper: 👍 👎 ♥ Save

AI Insights

However, they often require large amounts of data and computational resources to train, which can be a limitation. (ML: 0.98)👍👎
require large amounts of data and computational resources to train The use of text-to-image diffusion models for image editing has been explored by several researchers, including those who have developed datasets such as Qwen-Image and Omnigen2. (ML: 0.94)👍👎
Text-to-image diffusion models have become increasingly popular in recent years, with many researchers exploring their potential applications. (ML: 0.93)👍👎
The unreasonable effectiveness of deep features as a perceptual metric Text-to-image diffusion models have become increasingly popular in recent years, with many researchers exploring their potential applications. (ML: 0.92)👍👎
Text-to-image diffusion models are a type of artificial intelligence that can generate images from text descriptions. (ML: 0.91)👍👎
They have many potential applications, but require large amounts of data and computational resources to train. (ML: 0.91)👍👎
These models can be used for various tasks such as image editing, object removal, and text-to-image synthesis. (ML: 0.89)👍👎
These models can be used for various tasks such as object removal, text-to-image synthesis, and instruction-guided image editing. (ML: 0.88)👍👎

Why we are recommending this paper?
Because image and video generation is a popular topic and you have less than 3 interests with available recommendations

StableWorld: Towards Stable and Consistent Long Interactive Video Generation

Nanjing University NJU

Rate paper: 👍 👎 ♥ Save

AI Insights

StableWorld also alleviates error accumulation in autoregressive video generation, resulting in more stable, consistent, and higher-quality long videos. (ML: 0.87)👍👎
Autoregressive video generation: A technique where each frame is generated based on the previous one(s), often leading to error accumulation. (ML: 0.85)👍👎
Geometric similarity: A measure of how similar two frames are based on their geometric structure. (ML: 0.82)👍👎
Sliding window approach: A method where a fixed-size window is moved over the sequence, and the most recent frames are used to generate new ones. (ML: 0.82)👍👎
The paper proposes a method called StableWorld for long-horizon interactive video generation, which aims to prevent error accumulation and maintain temporal consistency. (ML: 0.81)👍👎
StableWorld effectively prevents cumulative errors by continuously filtering out degraded frames while maintaining coherent motion, resulting in more stable and temporally consistent interactive video sequences. (ML: 0.77)👍👎
The method's ability to identify and discard a large number of drifted frames during generation has the potential to reduce training cost and aligns naturally with future extensions toward memory-augmented world models. (ML: 0.76)👍👎
ORB (Oriented FAST and Rotated BRIEF): A feature detector that extracts keypoints with their descriptors for matching purposes. (ML: 0.63)👍👎
StableWorld uses a sliding window approach with dynamic frame eviction based on geometric similarity computed using ORB features. (ML: 0.63)👍👎
The method is evaluated on several benchmarks, including Matrix-Game 2.0, Hunyuan-Gamecraft 1.0, Open-Oasis, and Self-Forcing, showing improved stability and consistency in long-horizon generation. (ML: 0.59)👍👎

Why we are recommending this paper?
Because image and video generation is a popular topic and you have less than 3 interests with available recommendations

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.

Data Science Career Guidance
Data Career Development
Data Science Career Advice
Data Careers

You can edit or add more interests any time.

💬 Help Shape Our Pricing

We're exploring pricing options to make this project sustainable. Take 3 minutes to share what you'd be willing to pay (if anything). Your input guides our future investment.

Share Your Feedback

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback