Hi!

Your personalized paper recommendations for 19 to 23 January, 2026.
Florida Institute of Technology
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Insights
  • However, AI also has its limitations and challenges, including the issue of impostor bias, where AI systems may mistakenly identify a legitimate file or activity as malicious. (ML: 0.96)👍👎
  • Limited accuracy: AI systems may not always accurately identify malicious files or activities. (ML: 0.96)👍👎
  • Impostor bias: AI systems may mistakenly identify a legitimate file or activity as malicious. (ML: 0.94)👍👎
  • To address these challenges, researchers are working on developing more accurate and reliable AI systems for digital forensics. (ML: 0.93)👍👎
  • To address these challenges, researchers are working on developing more accurate and reliable AI systems for digital forensics. (ML: 0.93)👍👎
  • Artificial Intelligence (AI): A type of computer system that can perform tasks that would typically require human intelligence, such as learning, problem-solving, and decision-making. (ML: 0.92)👍👎
  • Digital Forensics: The process of collecting, analyzing, and preserving evidence related to cybercrime and other digital crimes. (ML: 0.92)👍👎
  • The use of artificial intelligence (AI) in digital forensics is becoming increasingly important as cybercrime continues to grow. (ML: 0.92)👍👎
  • The use of AI in digital forensics is becoming increasingly important, but it also has its limitations and challenges. (ML: 0.90)👍👎
  • The use of AI in digital forensics is becoming increasingly important, but it also has its limitations and challenges. (ML: 0.90)👍👎
Abstract
In an era where cyber threats are rapidly evolving, the reliability of cyber forensic analysis has become increasingly critical for effective digital investigations and cybersecurity responses. AI agents are being adopted across digital forensic practices due to their ability to automate processes such as anomaly detection, evidence classification, and behavioral pattern recognition, significantly enhancing scalability and reducing investigation timelines. However, the characteristics that make AI indispensable also introduce notable risks. AI systems, often trained on biased or incomplete datasets, can produce misleading results, including false positives and false negatives, thereby jeopardizing the integrity of forensic investigations. This study presents a meticulous comparative analysis of the effectiveness of the most used AI agent, ChatGPT, and human forensic investigators in the realm of cyber forensic analysis. Our research reveals critical limitations within AI-driven approaches, demonstrating scenarios in which sophisticated or novel cyber threats remain undetected due to the rigid pattern-based nature of AI systems. Conversely, our analysis highlights the crucial role that human forensic investigators play in mitigating these risks. Through adaptive decision-making, ethical reasoning, and contextual understanding, human investigators effectively identify subtle anomalies and threats that may evade automated detection systems. To reinforce our findings, we conducted comprehensive reliability testing of forensic techniques using multiple cyber threat scenarios. These tests confirmed that while AI agents significantly improve the efficiency of routine analyses, human oversight remains crucial in ensuring accuracy and comprehensiveness of the results.
Why we are recommending this paper?
Because research automation with ai is a popular topic and you have less than 3 interests with available recommendations

This paper directly addresses the application of AI, specifically agents, within a critical domain – cyber forensic analysis – aligning with the user's interest in AI for product management and security. It explores the balance between automation and human expertise, a key consideration when defining product strategies.
Siemens AG
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Insights
  • The framework enables non-experts using simple prompts to generate visualizations that correctly apply nuanced expert rules, bridging the expertise gap and mitigating the expert bottleneck. (ML: 0.98)👍👎
  • The solution represents a significant step towards democratizing access to specialized expertise through an agent, enabling more efficient and effective data analysis across industries. (ML: 0.95)👍👎
  • Evaluator assessments validate the practical impact of the framework, with baseline outputs deemed unreadable and proposed system outputs receiving praise for showing the optimization process clearly. (ML: 0.94)👍👎
  • The paper proposes a framework for capturing expert domain knowledge and leveraging it to construct LLM-based AI agents capable of autonomous expert-level performance. (ML: 0.90)👍👎
  • The framework integrates a Retrieval-Augmented Generation (RAG) system, codified expert rules, and visualization design principles directly into the Agent. (ML: 0.89)👍👎
  • LLM: Large Language Model RAG: Retrieval-Augmented Generation system Physics-Agnostic design pattern: a design approach that decouples visualization rules from specific physical phenomena The research contributes a robust AI agent for visualization generation and a systematic, validated framework for engineering AI agents with human expert domain knowledge. (ML: 0.85)👍👎
  • Technical validation demonstrates the framework's effectiveness, achieving 206% improvement in output quality across five scenarios spanning three simulation domains. (ML: 0.84)👍👎
Abstract
Critical domain knowledge typically resides with few experts, creating organizational bottlenecks in scalability and decision-making. Non-experts struggle to create effective visualizations, leading to suboptimal insights and diverting expert time. This paper investigates how to capture and embed human domain knowledge into AI agent systems through an industrial case study. We propose a software engineering framework to capture human domain knowledge for engineering AI agents in simulation data visualization by augmenting a Large Language Model (LLM) with a request classifier, Retrieval-Augmented Generation (RAG) system for code generation, codified expert rules, and visualization design principles unified in an agent demonstrating autonomous, reactive, proactive, and social behavior. Evaluation across five scenarios spanning multiple engineering domains with 12 evaluators demonstrates 206% improvement in output quality, with our agent achieving expert-level ratings in all cases versus baseline's poor performance, while maintaining superior code quality with lower variance. Our contributions are: an automated agent-based system for visualization generation and a validated framework for systematically capturing human domain knowledge and codifying tacit expert knowledge into AI agents, demonstrating that non-experts can achieve expert-level outcomes in specialized domains.
Why we are recommending this paper?
Due to your Interest in AI for Product Management

This work focuses on embedding expert knowledge into AI agents, a core element of product strategy and vision setting. The framework presented offers a practical approach to scaling intelligent systems, directly relevant to the user's interest in AI for product management.
Binghamton University, State University of New York
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Insights
  • The development of agentic AI systems may be hindered by the lack of standardization and interoperability between different AI technologies. (ML: 0.93)👍👎
  • The proposed framework may require significant computational resources and data storage capacity. (ML: 0.92)👍👎
  • Neuro-symbolic AI is a key technology for developing agentic systems that can reason, learn, and interact with humans. (ML: 0.92)👍👎
  • Neuro-symbolic AI: A type of AI that combines the strengths of neural networks and symbolic reasoning to enable more efficient and effective decision-making. (ML: 0.91)👍👎
  • Agentic AI has been studied in various fields, including computer science, cognitive psychology, and philosophy. (ML: 0.90)👍👎
  • The application of agentic AI in business processes can lead to improved productivity, reduced costs, and enhanced customer satisfaction. (ML: 0.89)👍👎
  • The proposed framework for designing agentic AI systems using neuro-symbolic AI has the potential to revolutionize business process optimization by enabling more accurate and efficient decision-making. (ML: 0.86)👍👎
  • The authors propose a framework for designing agentic AI systems using neuro-symbolic AI and provide an example of its application in a business process optimization problem. (ML: 0.86)👍👎
  • The paper discusses the concept of agentic AI and its application in business processes. (ML: 0.83)👍👎
  • Agentic AI: A type of artificial intelligence that is capable of reasoning, learning, and interacting with humans in a way that is similar to human agency. (ML: 0.80)👍👎
Abstract
Current business environments require organizations to continuously reconfigure cross-functional processes, yet enterprise systems are still organized around siloed departments, rigid workflows, and hard-coded automation. Meanwhile large language models (LLMs) excel at interpreting natural language and unstructured data but lack deterministic, verifiable execution of complex business logic. To address this gap, here we introduce AUTOBUS, an Autonomous Business System that integrates LLM-based AI agents, predicate-logic programming, and business-semantics-centric enterprise data into a coherent neuro-symbolic AI architecture for orchestrating end-to-end business initiatives. AUTOBUS models an initiative as a network of tasks with explicit pre/post conditions, required data, evaluation rules, and API-level actions. Enterprise data is organized as a knowledge graph whose entities, relationships, and constraints are translated into logic facts and foundational rules, providing the semantic grounding for task reasoning. Core AI agents synthesize task instructions, enterprise semantics, and available tools into task-specific logic programs, which are executed by a logic engine that enforces constraints, coordinates auxiliary tools, and orchestrate execution of actions and outcomes. Humans define and maintain the semantics, policies and task instructions, curate tools, and supervise high-impact or ambiguous decisions, ensuring accountability and adaptability. We detail the AUTOBUS architecture, the anatomy of the AI agent generated logic programs, and the role of humans and auxiliary tools in the lifecycle of a business initiative.
Why we are recommending this paper?
Due to your Interest in AI for Product Management

The paper’s exploration of autonomous business systems through neuro-symbolic AI aligns with the user’s interest in product strategy and vision setting for tech teams. It tackles the challenge of organizational reconfigurations, a complex area central to strategic product development.
Mercor
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Insights
  • McNemar's exact test: A statistical test used to compare the performance of two related samples. (ML: 0.97)👍👎
  • Pass@1: The proportion of tasks completed correctly by an agent. (ML: 0.95)👍👎
  • Significance tests using McNemar's exact test with Benjamini-Hochberg correction show that Kimi-K2-Thinking significantly outperforms Gemini-3-flash-preview (p=5.68e-23), GPT-OSS-120B (p=1.0000), and GPT-5.2 (p=7.29e-10). (ML: 0.95)👍👎
  • The APEX–Agents benchmark highlights the importance of developing AI models that can perform complex tasks in various professional domains, with a focus on toolbelt approaches, context window management, and intentional termination. (ML: 0.94)👍👎
  • Benjamini-Hochberg correction: A method for controlling false discovery rate in multiple testing. (ML: 0.94)👍👎
  • The APEX–Agents benchmark is a comprehensive evaluation of AI models' ability to perform complex tasks in various professional domains. (ML: 0.93)👍👎
  • The most frequently used tools by agents are code execution (256,000), add tool to the toolbelt (200,000), list files in the file system (163,874), read spreadsheet tab (127,000), and search the PDF (86,000). (ML: 0.93)👍👎
  • The benchmark consists of 227 tasks, covering finance, law, and management consulting, with each task requiring the model to complete a specific task using a set of provided tools. (ML: 0.89)👍👎
  • The top-performing models on the APEX–Agents benchmark are Gemini 3 Flash, GPT-5.2, and Kimi K2 Thinking, with Pass@1 scores of 0.555, 0.497, and 0.391 respectively. (ML: 0.88)👍👎
  • ReAct paradigm: A toolbelt approach where reasoning and acting are interleaved in a single loop. (ML: 0.79)👍👎
Abstract
We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.
Why we are recommending this paper?
Because ai agents is a popular topic and you have less than 3 interests with available recommendations

APEX-Agents provides a benchmark for AI agent performance in realistic, complex tasks – mirroring the need for robust evaluation within product management. This directly supports the user’s interest in AI for product management and strategic decision-making.
Uppsala University
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Insights
  • SoftMoE: A variant of MoE models that uses a soft-gating mechanism to select the most relevant experts for each input. (ML: 0.95)👍👎
  • MoE models can generalize robustly in moderate-scale vision tasks when appropriately regularized. (ML: 0.93)👍👎
  • Mixture-of-Experts (MoE) models: A type of neural network architecture that combines multiple experts to make predictions. (ML: 0.92)👍👎
  • Mixture-of-Experts (MoE) models can generalize robustly in moderate-scale vision tasks when appropriately regularized. (ML: 0.92)👍👎
  • SparseMoE: A variant of MoE models that uses a sparse-gating mechanism to select only a subset of experts for each input. (ML: 0.92)👍👎
  • SoftMoE and SparseMoE architectures outperform the dense baseline on validation accuracy when expert utilization is properly regularized. (ML: 0.92)👍👎
  • SoftMoE and SparseMoE architectures outperform the dense baseline on validation accuracy when expert utilization is properly regularized. (ML: 0.92)👍👎
  • Hessian-based curvature analysis: A method used to analyze the geometry of the loss surface in neural networks. (ML: 0.85)👍👎
  • The gap between theoretical and realized efficiency in sparse MoE models arises from the overhead of routing, selection, and aggregation operations in naive implementations. (ML: 0.74)👍👎
  • Hessian-based curvature analysis reveals that SoftMoE converges to solutions with higher local curvature, while Dense and SparseMoE occupy a similar sharpness regime. (ML: 0.74)👍👎
Abstract
Mixture-of-Experts (MoE) architectures enable conditional computation by routing inputs to multiple expert subnetworks and are often motivated as a mechanism for scaling large language models. In this project, we instead study MoE behavior in an image classification setting, focusing on predictive performance, expert utilization, and generalization. We compare dense, SoftMoE, and SparseMoE classifier heads on the CIFAR10 dataset under comparable model capacity. Both MoE variants achieve slightly higher validation accuracy than the dense baseline while maintaining balanced expert utilization through regularization, avoiding expert collapse. To analyze generalization, we compute Hessian-based sharpness metrics at convergence, including the largest eigenvalue and trace of the loss Hessian, evaluated on both training and test data. We find that SoftMoE exhibits higher sharpness by these metrics, while Dense and SparseMoE lie in a similar curvature regime, despite all models achieving comparable generalization performance. Complementary loss surface perturbation analyses reveal qualitative differences in non-local behavior under finite parameter perturbations between dense and MoE models, which help contextualize curvature-based measurements without directly explaining validation accuracy. We further evaluate empirical inference efficiency and show that naively implemented conditional routing does not yield inference speedups on modern hardware at this scale, highlighting the gap between theoretical and realized efficiency in sparse MoE models.
Why we are recommending this paper?
Due to your Interest in Vision Setting for Tech Teams

This paper investigates Mixture-of-Experts models, a significant advancement in scaling vision models – a key area of interest for the user. Understanding these architectures is crucial for evaluating and potentially implementing AI solutions within product development.
University College London
Rate paper: 👍 👎 ♥ Save
AI Insights
  • The study assumes that the pretraining objectives of vision foundation models can be aligned with downstream tasks, but this assumption may not hold in all cases. (ML: 0.98)👍👎
  • There is ongoing research into developing more robust and generalizable vision foundation models. (ML: 0.98)👍👎
  • Vision foundation model transferability is strongly influenced by the alignment between pretraining objectives and downstream tasks. (ML: 0.97)👍👎
  • The pretraining objectives of vision foundation models influence their ability to generalize to downstream tasks. (ML: 0.97)👍👎
  • Strengthening this pretraining-downstream alignment or developing approaches that are invariant to misalignment offers promising directions for future research. (ML: 0.94)👍👎
  • The study relies on a specific dataset (Promis) which may not be representative of all prostate cancer cases. (ML: 0.93)👍👎
  • Vision foundation models have been shown to be effective for a variety of tasks including image classification, object detection, and segmentation. (ML: 0.90)👍👎
  • MAE: Masked Autoencoder DINOv2: Learning robust visual features without supervision Vision foundation models can be adapted to specific tasks by fine-tuning them on the task-specific data. (ML: 0.85)👍👎
Abstract
Foundation models leverage large-scale pretraining to capture extensive knowledge, demonstrating generalization in a wide range of language tasks. By comparison, vision foundation models (VFMs) often exhibit uneven improvements across downstream tasks, despite substantial computational investment. We postulate that this limitation arises from a mismatch between pretraining objectives and the demands of downstream vision-and-imaging tasks. Pretraining strategies like masked image reconstruction or contrastive learning shape representations for tasks such as recovery of generic visual patterns or global semantic structures, which may not align with the task-specific requirements of downstream applications including segmentation, classification, or image synthesis. To investigate this in a concrete real-world clinical area, we assess two VFMs, a reconstruction-focused MAE-based model (ProFound) and a contrastive-learning-based model (ProViCNet), on five prostate multiparametric MR imaging tasks, examining how such task alignment influences transfer performance, i.e., from pretraining to fine-tuning. Our findings indicate that better alignment between pretraining and downstream tasks, measured by simple divergence metrics such as maximum-mean-discrepancy (MMD) between the same features before and after fine-tuning, correlates with greater performance improvements and faster convergence, emphasizing the importance of designing and analyzing pretraining objectives with downstream applicability in mind.
Why we are recommending this paper?
Due to your Interest in Vision Setting for Tech Teams
Renmin University of China
Rate paper: 👍 👎 ♥ Save
AI Insights
  • Agentic capabilities: Fundamental skills like exploration, tool use, and self-verification. (ML: 0.96)👍👎
  • Current results have limitations, such as generated videos being limited to simple animations and composed music lacking expressiveness and creativity. (ML: 0.95)👍👎
  • The agentic capability benchmark provided by LLM-in-Sandbox can be used to evaluate models' ability to leverage computational environments. (ML: 0.94)👍👎
  • Strong LLMs exhibit emergent capabilities to leverage the sandbox environment for general tasks. (ML: 0.92)👍👎
  • LLM-in-Sandbox can be used as an agentic capability benchmark, measuring fundamental skills like exploration, tool use, and self-verification. (ML: 0.91)👍👎
  • The metric ∆=LLM-in-Sandbox−LLM offers a meaningful indicator of a model's ability to leverage computational environments. (ML: 0.90)👍👎
  • LLM-in-Sandbox has the potential to become the default paradigm for serving LLMs, enabling them to perform general tasks and produce actual outputs rather than text descriptions. (ML: 0.88)👍👎
  • LLM-in-Sandbox: A paradigm that grants LLMs access to a virtual computer and enables them to leverage this environment for general tasks. (ML: 0.85)👍👎
  • Sandbox-native model training: Training models to interact with the sandbox environment as a first-class objective. (ML: 0.82)👍👎
  • LLM-in-Sandbox is a paradigm that grants LLMs access to a virtual computer and enables them to leverage this environment for general tasks. (ML: 0.80)👍👎
Abstract
We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.
Why we are recommending this paper?
Because ai agents is a popular topic and you have less than 3 interests with available recommendations
University of Amsterdam
Rate paper: 👍 👎 ♥ Save
AI Insights
  • The consistency requirement proposed by the authors is not just statistical frequency but having context-relative grounds for expecting further outputs of comparable novelty and value. (ML: 0.97)👍👎
  • The concept of creativity should remain flexible across different domains of creativity, and the indeterminacy of the consistency requirement allows for this flexibility. (ML: 0.96)👍👎
  • The consistency requirement proposed by the authors is a more inclusive and functional approach to defining creativity, allowing for non-human natural processes to be labelled 'creative'. (ML: 0.96)👍👎
  • The consistency requirement proposed by the authors may not be applicable in all contexts, especially where authenticity conditions the value of the products being generated or examined. (ML: 0.94)👍👎
  • The IAC has functional value in specific local contexts, such as cognitive science, jurisprudence, and certain domains of creative practice where authenticity conditions the value of the products being generated or examined. (ML: 0.94)👍👎
  • New Standard Definition (NSD) of Creativity: An object is creative if it is novel, valuable, and the product of a system that can consistently generate novel and valuable objects. (ML: 0.92)👍👎
  • The NSD states that an object is creative if it is novel, valuable, and the product of a system that can consistently generate novel and valuable objects. (ML: 0.91)👍👎
  • The article proposes a new standard definition (NSD) of creativity, which drops the intentional agency condition (IAC) as a necessary condition of creativity. (ML: 0.89)👍👎
  • The article does not provide a comprehensive account of where the IAC ought to be applied. (ML: 0.89)👍👎
  • The IAC should be excluded from our definition of the genus of creativity but retained as a means of distinguishing between certain species of creativity. (ML: 0.89)👍👎
  • Intentional Agency Condition (IAC): A necessary condition of creativity that requires an agent to intentionally endeavor to express themselves. (ML: 0.82)👍👎
Abstract
Many theorists of creativity maintain that intentional agency is a necessary condition of creativity. We argue that this requirement, which we call the Intentional Agency Condition (IAC), should be rejected as a general condition of creativity, while retaining its relevance in specific contexts. We show that recent advances in generative AI have rendered the IAC increasingly problematic, both descriptively and functionally. We offer two reasons for abandoning it at the general level. First, we present corpus evidence indicating that authors and journalists are increasingly comfortable ascribing creativity to generative AI, despite its lack of intentional agency. This development places pressure on the linguistic intuitions that have traditionally been taken to support the IAC. Second, drawing on the method of conceptual engineering, we argue that the IAC no longer fulfils its core social function. Rather than facilitating the identification and encouragement of reliable sources of novel and valuable products, it now feeds into biases that distort our assessments of AI-generated outputs. We therefore propose replacing the IAC with a consistency requirement, according to which creativity tracks the reliable generation of novel and valuable products. Nonetheless, we explain why the IAC should be retained in specific local domains.
Why we are recommending this paper?
Because ai and society is a popular topic and you have less than 3 interests with available recommendations
Sony
Rate paper: 👍 👎 ♥ Save
AI Insights
  • The paper concludes that current XAI methods are based on flawed assumptions and lack a clear understanding of the relationship between humans and machines. (ML: 0.98)👍👎
  • Apparatuses: The technical tools, methods, and narratives that constitute what is made intelligible and what is excluded from intelligibility in XAI practices. (ML: 0.97)👍👎
  • The paper critiques the current state of Explainable AI (XAI) methods, arguing that they are based on flawed assumptions and lack a clear understanding of the relationship between humans and machines. (ML: 0.97)👍👎
  • The paper highlights the limitations of current XAI methods, including their reliance on simplifications and abstractions that erase the original system, and their failure to account for human-machine incommensurability. (ML: 0.96)👍👎
  • The authors propose an agential realist approach to XAI, which views interpretation as a relational co-production of interpretable phenomena through intra-actions between human and non-human agencies. (ML: 0.96)👍👎
  • Agential cut: The moment at which an interpretive apparatus enacts a relational co-production of interpretable phenomena through intra-actions between human and non-human agencies. (ML: 0.96)👍👎
  • Agential realism: A philosophical framework that views knowledge as an intra-action between human and non-human agencies. (ML: 0.94)👍👎
  • Intra-action: The process by which human and non-human agencies co-produce interpretable phenomena through their entanglements. (ML: 0.92)👍👎
  • The authors suggest that a diffractive optic offers a more philosophically robust reading of XAI practices, one that acknowledges the emergent nature of interpretation and the importance of situated contexts. (ML: 0.90)👍👎
  • This approach challenges the dominant reflectivity and refractivity optics in XAI, which assume that meaning pre-exists the practices and beings that produce it. (ML: 0.75)👍👎
Abstract
Explainable AI (XAI) is frequently positioned as a technical problem of revealing the inner workings of an AI model. This position is affected by unexamined onto-epistemological assumptions: meaning is treated as immanent to the model, the explainer is positioned outside the system, and a causal structure is presumed recoverable through computational techniques. In this paper, we draw on Barad's agential realism to develop an alternative onto-epistemology of XAI. We propose that interpretations are material-discursive performances that emerge from situated entanglements of the AI model with humans, context, and the interpretative apparatus. To develop this position, we read a comprehensive set of XAI methods through agential realism and reveal the assumptions and limitations that underpin several of these methods. We then articulate the framework's ethical dimension and propose design directions for XAI interfaces that support emergent interpretation, using a speculative text-to-music interface as a case study.
Why we are recommending this paper?
Because ai and society is a popular topic and you have less than 3 interests with available recommendations
Stanford University
Rate paper: 👍 👎 ♥ Save
AI Insights
  • LLM: Large Language Model RL: Reinforcement Learning ML engineering tasks: Machine learning tasks that heavily depend on feature engineering and hyper-parameter tuning rather than algorithm development. (ML: 0.97)👍👎
  • The paper demonstrates the feasibility and potential of automated execution feedback loops in LLM research problems, but highlights remaining limitations that need to be addressed. (ML: 0.96)👍👎
  • Execution grounding for code: The idea of learning from execution feedback in the code generation domain. (ML: 0.96)👍👎
  • Future work should focus on improving generalizability testing, exploring richer learning signals from execution trajectories, developing more capable execution agents, and incorporating alternative metrics such as idea novelty and interestingness. (ML: 0.95)👍👎
  • They find that models tend to converge on simple ideas to improve the average reward but lose diversity and do not improve the upper-bound. (ML: 0.95)👍👎
  • The paper presents a large-scale parallel executor for automatically executing model-generated ideas to verify their effectiveness on open-ended LLM research problems. (ML: 0.92)👍👎
  • The authors analyze the effectiveness of execution-guided evolutionary search and reinforcement learning with execution rewards. (ML: 0.86)👍👎
  • The paper highlights the limitations of current experiments, including a lack of generalizability testing, limited exploration incentives in RL objectives, and noise in the reward signal due to the execution agent's capabilities. (ML: 0.84)👍👎
Abstract
Automated AI research holds great potential to accelerate scientific discovery. However, current LLMs often generate plausible-looking but ineffective ideas. Execution grounding may help, but it is unclear whether automated execution is feasible and whether LLMs can learn from the execution feedback. To investigate these, we first build an automated executor to implement ideas and launch large-scale parallel GPU experiments to verify their effectiveness. We then convert two realistic research problems - LLM pre-training and post-training - into execution environments and demonstrate that our automated executor can implement a large fraction of the ideas sampled from frontier LLMs. We analyze two methods to learn from the execution feedback: evolutionary search and reinforcement learning. Execution-guided evolutionary search is sample-efficient: it finds a method that significantly outperforms the GRPO baseline (69.4% vs 48.0%) on post-training, and finds a pre-training recipe that outperforms the nanoGPT baseline (19.7 minutes vs 35.9 minutes) on pre-training, all within just ten search epochs. Frontier LLMs often generate meaningful algorithmic ideas during search, but they tend to saturate early and only occasionally exhibit scaling trends. Reinforcement learning from execution reward, on the other hand, suffers from mode collapse. It successfully improves the average reward of the ideator model but not the upper-bound, due to models converging on simple ideas. We thoroughly analyze the executed ideas and training dynamics to facilitate future efforts towards execution-grounded automated AI research.
Why we are recommending this paper?
Because research automation with ai is a popular topic and you have less than 3 interests with available recommendations
Purdue University
Rate paper: 👍 👎 ♥ Save
AI Insights
  • The authors acknowledge that their study is limited by its reliance on a small number of datasets. (ML: 0.99)👍👎
  • Domain shift: a phenomenon where the distribution of data in the training set differs from that of the testing set. (ML: 0.98)👍👎
  • The development of DSCF highlights the need for large-scale and diverse training data. (ML: 0.97)👍👎
  • Recent works have proposed unsupervised domain adaptation frameworks, but their effectiveness beyond the originally reported datasets are yet to be independently evaluated. (ML: 0.95)👍👎
  • The results of this benchmarking experiment have shown that classifying test samples that are in-distribution to the training dataset is significantly easier than test samples suffering from distribution shift due to changes in instruments and acquisition conditions, and additional contaminants. (ML: 0.94)👍👎
  • Foundation model: a pre-trained model that can be fine-tuned for specific tasks, often using transfer learning. (ML: 0.92)👍👎
  • SANet demonstrated the best overall performance across the datasets. (ML: 0.84)👍👎
  • The study benchmarks only five architectures and relies on minimal spectral pre-processing. (ML: 0.77)👍👎
  • Existing open-source Raman datasets are often restricted in size, chemical diversity or experimental variability. (ML: 0.67)👍👎
  • Creating large, curated experimental Raman spectral datasets that span multiple instruments, materials and measurement settings is key to developing a Raman-specific foundation model. (ML: 0.61)👍👎
  • Raman spectroscopy: a technique used to analyze the vibrational modes of molecules. (ML: 0.52)👍👎
Abstract
Deep learning classifiers for Raman spectroscopy are increasingly reported to outperform classical chemometric approaches. However their evaluations are often conducted in isolation or compared against traditional machine learning methods or trivially adapted vision-based architectures that were not originally proposed for Raman spectroscopy. As a result, direct comparisons between existing deep learning models developed specifically for Raman spectral analysis on shared open-source datasets remain scarce. To the best of our knowledge, this study presents one of the first systematic benchmarks comparing three or more published Raman-specific deep learning classifiers across multiple open-source Raman datasets. We evaluate five representative deep learning architectures under a unified training and hyperparameter tuning protocol across three open-source Raman datasets selected to support standard evaluation, fine-tuning, and explicit distribution-shift testing. We report classification accuracies and macro-averaged F1 scores to provide a fair and reproducible comparison of deep learning models for Raman spectra based classification.
Why we are recommending this paper?
Because deep learning is a popular topic and you have less than 3 interests with available recommendations
Quantinuum Ltd
Rate paper: 👍 👎 ♥ Save
AI Insights
  • L1 Relative Change (L1RC): A measure of the difference between two probability distributions. (ML: 0.98)👍👎
  • Signal-to-Noise Ratio (SNR): The ratio of the signal power to the noise power in a system. (ML: 0.93)👍👎
  • However, on Real Pauli data the advantage clearly shifts toward the ML-based models, which outperform all baselines in both median L1 relative change and fraction of improved circuits. (ML: 0.93)👍👎
  • Deep learning models can learn corrections directly from data gathered during circuit runs, more easily capturing correlations. (ML: 0.88)👍👎
  • The best performing models are comparable to the best baseline methods on Simulated data (both Pauli and Random). (ML: 0.87)👍👎
  • It is defined as the L1 norm of the difference between the two distributions. (ML: 0.87)👍👎
  • The learned mapping from P noisy and circuit features to P ideal captures a richer structure that goes beyond coarse depolarization or measurement-error mitigation. (ML: 0.81)👍👎
  • The PERCEIVER model consistently achieves as good or greater median performance than the baseline mitigation techniques for Pauli circuits. (ML: 0.80)👍👎
  • The deep learning approaches can generalize across noise regimes, device generations, and circuit families without relying on a predefined noise model. (ML: 0.79)👍👎
  • The deep learning approaches can generalize across noise regimes, device generations, and circuit families without relying on a predefined noise model. (ML: 0.79)👍👎
  • The baseline methods retain value as lightweight, interpretable mitigation techniques, particularly for structured, low-depth circuits. (ML: 0.61)👍👎
Abstract
We present a systematic investigation of deep learning methods applied to quantum error mitigation of noisy output probability distributions from measured quantum circuits. We compare different architectures, from fully connected neural networks to transformers, and we test different design/training modalities, identifying sequence-to-sequence, attention-based models as the most effective on our datasets. These models consistently produce mitigated distributions that are closer to the ideal outputs when tested on both simulated and real device data obtained from IBM superconducting quantum processing units (QPU) up to five qubits. Across several different circuit depths, our approach outperforms other baseline error mitigation techniques. We perform a series of ablation studies to examine: how different input features (circuit, device properties, noisy output statistics) affect performance; cross-dataset generalization across circuit families; and transfer learning to a different IBM QPU. We observe that generalization performance across similar devices with the same architecture works effectively, without needing to fully retrain models.
Why we are recommending this paper?
Because deep learning is a popular topic and you have less than 3 interests with available recommendations
UC Santa Cruz
Rate paper: 👍 👎 ♥ Save
AI Insights
  • However, they often require large amounts of data and computational resources to train, which can be a limitation. (ML: 0.98)👍👎
  • require large amounts of data and computational resources to train The use of text-to-image diffusion models for image editing has been explored by several researchers, including those who have developed datasets such as Qwen-Image and Omnigen2. (ML: 0.94)👍👎
  • Text-to-image diffusion models have become increasingly popular in recent years, with many researchers exploring their potential applications. (ML: 0.93)👍👎
  • The unreasonable effectiveness of deep features as a perceptual metric Text-to-image diffusion models have become increasingly popular in recent years, with many researchers exploring their potential applications. (ML: 0.92)👍👎
  • Text-to-image diffusion models are a type of artificial intelligence that can generate images from text descriptions. (ML: 0.91)👍👎
  • They have many potential applications, but require large amounts of data and computational resources to train. (ML: 0.91)👍👎
  • These models can be used for various tasks such as image editing, object removal, and text-to-image synthesis. (ML: 0.89)👍👎
  • These models can be used for various tasks such as object removal, text-to-image synthesis, and instruction-guided image editing. (ML: 0.88)👍👎
Abstract
Recent image generation models have shown impressive progress, yet they often struggle to yield controllable and consistent results when users attempt to edit specific elements within an existing image. Layered representations enable flexible, user-driven content creation, but existing approaches often fail to produce layers with coherent compositing relationships, and their object layers typically lack realistic visual effects such as shadows and reflections. To overcome these limitations, we propose LASAGNA, a novel, unified framework that generates an image jointly with its composing layers--a photorealistic background and a high-quality transparent foreground with compelling visual effects. Unlike prior work, LASAGNA efficiently learns correct image composition from a wide range of conditioning inputs--text prompts, foreground, background, and location masks--offering greater controllability for real-world applications. To enable this, we introduce LASAGNA-48K, a new dataset composed of clean backgrounds and RGBA foregrounds with physically grounded visual effects. We also propose LASAGNABENCH, the first benchmark for layer editing. We demonstrate that LASAGNA excels in generating highly consistent and coherent results across multiple image layers simultaneously, enabling diverse post-editing applications that accurately preserve identity and visual effects. LASAGNA-48K and LASAGNABENCH will be publicly released to foster open research in the community. The project page is https://rayjryang.github.io/LASAGNA-Page/.
Why we are recommending this paper?
Because image and video generation is a popular topic and you have less than 3 interests with available recommendations
Nanjing University NJU
Rate paper: 👍 👎 ♥ Save
AI Insights
  • StableWorld also alleviates error accumulation in autoregressive video generation, resulting in more stable, consistent, and higher-quality long videos. (ML: 0.87)👍👎
  • Autoregressive video generation: A technique where each frame is generated based on the previous one(s), often leading to error accumulation. (ML: 0.85)👍👎
  • Geometric similarity: A measure of how similar two frames are based on their geometric structure. (ML: 0.82)👍👎
  • Sliding window approach: A method where a fixed-size window is moved over the sequence, and the most recent frames are used to generate new ones. (ML: 0.82)👍👎
  • The paper proposes a method called StableWorld for long-horizon interactive video generation, which aims to prevent error accumulation and maintain temporal consistency. (ML: 0.81)👍👎
  • StableWorld effectively prevents cumulative errors by continuously filtering out degraded frames while maintaining coherent motion, resulting in more stable and temporally consistent interactive video sequences. (ML: 0.77)👍👎
  • The method's ability to identify and discard a large number of drifted frames during generation has the potential to reduce training cost and aligns naturally with future extensions toward memory-augmented world models. (ML: 0.76)👍👎
  • ORB (Oriented FAST and Rotated BRIEF): A feature detector that extracts keypoints with their descriptors for matching purposes. (ML: 0.63)👍👎
  • StableWorld uses a sliding window approach with dynamic frame eviction based on geometric similarity computed using ORB features. (ML: 0.63)👍👎
  • The method is evaluated on several benchmarks, including Matrix-Game 2.0, Hunyuan-Gamecraft 1.0, Open-Oasis, and Self-Forcing, showing improved stability and consistency in long-horizon generation. (ML: 0.59)👍👎
Abstract
In this paper, we explore the overlooked challenge of stability and temporal consistency in interactive video generation, which synthesizes dynamic and controllable video worlds through interactive behaviors such as camera movements and text prompts. Despite remarkable progress in world modeling, current methods still suffer from severe instability and temporal degradation, often leading to spatial drift and scene collapse during long-horizon interactions. To better understand this issue, we initially investigate the underlying causes of instability and identify that the major source of error accumulation originates from the same scene, where generated frames gradually deviate from the initial clean state and propagate errors to subsequent frames. Building upon this observation, we propose a simple yet effective method, \textbf{StableWorld}, a Dynamic Frame Eviction Mechanism. By continuously filtering out degraded frames while retaining geometrically consistent ones, StableWorld effectively prevents cumulative drift at its source, leading to more stable and temporal consistency of interactive generation. Promising results on multiple interactive video models, \eg, Matrix-Game, Open-Oasis, and Hunyuan-GameCraft, demonstrate that StableWorld is model-agnostic and can be applied to different interactive video generation frameworks to substantially improve stability, temporal consistency, and generalization across diverse interactive scenarios.
Why we are recommending this paper?
Because image and video generation is a popular topic and you have less than 3 interests with available recommendations

We did not find tons of content matching your interests we've included some additional topics that are popular. Also be aware that if the topics is not present in arxiv we wont be able to recommend it.

Mercor
Rate paper: 👍 👎 ♥ Save
AI Insights
  • McNemar's exact test: A statistical test used to compare the performance of two related samples. (ML: 0.97)👍👎
  • Pass@1: The proportion of tasks completed correctly by an agent. (ML: 0.95)👍👎
  • Significance tests using McNemar's exact test with Benjamini-Hochberg correction show that Kimi-K2-Thinking significantly outperforms Gemini-3-flash-preview (p=5.68e-23), GPT-OSS-120B (p=1.0000), and GPT-5.2 (p=7.29e-10). (ML: 0.95)👍👎
  • The APEX–Agents benchmark highlights the importance of developing AI models that can perform complex tasks in various professional domains, with a focus on toolbelt approaches, context window management, and intentional termination. (ML: 0.94)👍👎
  • Benjamini-Hochberg correction: A method for controlling false discovery rate in multiple testing. (ML: 0.94)👍👎
  • The APEX–Agents benchmark is a comprehensive evaluation of AI models' ability to perform complex tasks in various professional domains. (ML: 0.93)👍👎
  • The most frequently used tools by agents are code execution (256,000), add tool to the toolbelt (200,000), list files in the file system (163,874), read spreadsheet tab (127,000), and search the PDF (86,000). (ML: 0.93)👍👎
  • The benchmark consists of 227 tasks, covering finance, law, and management consulting, with each task requiring the model to complete a specific task using a set of provided tools. (ML: 0.89)👍👎
  • The top-performing models on the APEX–Agents benchmark are Gemini 3 Flash, GPT-5.2, and Kimi K2 Thinking, with Pass@1 scores of 0.555, 0.497, and 0.391 respectively. (ML: 0.88)👍👎
  • ReAct paradigm: A toolbelt approach where reasoning and acting are interleaved in a single loop. (ML: 0.79)👍👎
Abstract
We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.
Why we are recommending this paper?
Because ai agents is a popular topic and you have less than 3 interests with available recommendations
Renmin University of China
Rate paper: 👍 👎 ♥ Save
AI Insights
  • Agentic capabilities: Fundamental skills like exploration, tool use, and self-verification. (ML: 0.96)👍👎
  • Current results have limitations, such as generated videos being limited to simple animations and composed music lacking expressiveness and creativity. (ML: 0.95)👍👎
  • The agentic capability benchmark provided by LLM-in-Sandbox can be used to evaluate models' ability to leverage computational environments. (ML: 0.94)👍👎
  • Strong LLMs exhibit emergent capabilities to leverage the sandbox environment for general tasks. (ML: 0.92)👍👎
  • LLM-in-Sandbox can be used as an agentic capability benchmark, measuring fundamental skills like exploration, tool use, and self-verification. (ML: 0.91)👍👎
  • The metric ∆=LLM-in-Sandbox−LLM offers a meaningful indicator of a model's ability to leverage computational environments. (ML: 0.90)👍👎
  • LLM-in-Sandbox has the potential to become the default paradigm for serving LLMs, enabling them to perform general tasks and produce actual outputs rather than text descriptions. (ML: 0.88)👍👎
  • LLM-in-Sandbox: A paradigm that grants LLMs access to a virtual computer and enables them to leverage this environment for general tasks. (ML: 0.85)👍👎
  • Sandbox-native model training: Training models to interact with the sandbox environment as a first-class objective. (ML: 0.82)👍👎
  • LLM-in-Sandbox is a paradigm that grants LLMs access to a virtual computer and enables them to leverage this environment for general tasks. (ML: 0.80)👍👎
Abstract
We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.
Why we are recommending this paper?
Because ai agents is a popular topic and you have less than 3 interests with available recommendations
University of Amsterdam
Rate paper: 👍 👎 ♥ Save
AI Insights
  • The consistency requirement proposed by the authors is not just statistical frequency but having context-relative grounds for expecting further outputs of comparable novelty and value. (ML: 0.97)👍👎
  • The concept of creativity should remain flexible across different domains of creativity, and the indeterminacy of the consistency requirement allows for this flexibility. (ML: 0.96)👍👎
  • The consistency requirement proposed by the authors is a more inclusive and functional approach to defining creativity, allowing for non-human natural processes to be labelled 'creative'. (ML: 0.96)👍👎
  • The consistency requirement proposed by the authors may not be applicable in all contexts, especially where authenticity conditions the value of the products being generated or examined. (ML: 0.94)👍👎
  • The IAC has functional value in specific local contexts, such as cognitive science, jurisprudence, and certain domains of creative practice where authenticity conditions the value of the products being generated or examined. (ML: 0.94)👍👎
  • New Standard Definition (NSD) of Creativity: An object is creative if it is novel, valuable, and the product of a system that can consistently generate novel and valuable objects. (ML: 0.92)👍👎
  • The NSD states that an object is creative if it is novel, valuable, and the product of a system that can consistently generate novel and valuable objects. (ML: 0.91)👍👎
  • The article proposes a new standard definition (NSD) of creativity, which drops the intentional agency condition (IAC) as a necessary condition of creativity. (ML: 0.89)👍👎
  • The article does not provide a comprehensive account of where the IAC ought to be applied. (ML: 0.89)👍👎
  • The IAC should be excluded from our definition of the genus of creativity but retained as a means of distinguishing between certain species of creativity. (ML: 0.89)👍👎
  • Intentional Agency Condition (IAC): A necessary condition of creativity that requires an agent to intentionally endeavor to express themselves. (ML: 0.82)👍👎
Abstract
Many theorists of creativity maintain that intentional agency is a necessary condition of creativity. We argue that this requirement, which we call the Intentional Agency Condition (IAC), should be rejected as a general condition of creativity, while retaining its relevance in specific contexts. We show that recent advances in generative AI have rendered the IAC increasingly problematic, both descriptively and functionally. We offer two reasons for abandoning it at the general level. First, we present corpus evidence indicating that authors and journalists are increasingly comfortable ascribing creativity to generative AI, despite its lack of intentional agency. This development places pressure on the linguistic intuitions that have traditionally been taken to support the IAC. Second, drawing on the method of conceptual engineering, we argue that the IAC no longer fulfils its core social function. Rather than facilitating the identification and encouragement of reliable sources of novel and valuable products, it now feeds into biases that distort our assessments of AI-generated outputs. We therefore propose replacing the IAC with a consistency requirement, according to which creativity tracks the reliable generation of novel and valuable products. Nonetheless, we explain why the IAC should be retained in specific local domains.
Why we are recommending this paper?
Because ai and society is a popular topic and you have less than 3 interests with available recommendations
Sony
Rate paper: 👍 👎 ♥ Save
AI Insights
  • The paper concludes that current XAI methods are based on flawed assumptions and lack a clear understanding of the relationship between humans and machines. (ML: 0.98)👍👎
  • Apparatuses: The technical tools, methods, and narratives that constitute what is made intelligible and what is excluded from intelligibility in XAI practices. (ML: 0.97)👍👎
  • The paper critiques the current state of Explainable AI (XAI) methods, arguing that they are based on flawed assumptions and lack a clear understanding of the relationship between humans and machines. (ML: 0.97)👍👎
  • The paper highlights the limitations of current XAI methods, including their reliance on simplifications and abstractions that erase the original system, and their failure to account for human-machine incommensurability. (ML: 0.96)👍👎
  • The authors propose an agential realist approach to XAI, which views interpretation as a relational co-production of interpretable phenomena through intra-actions between human and non-human agencies. (ML: 0.96)👍👎
  • Agential cut: The moment at which an interpretive apparatus enacts a relational co-production of interpretable phenomena through intra-actions between human and non-human agencies. (ML: 0.96)👍👎
  • Agential realism: A philosophical framework that views knowledge as an intra-action between human and non-human agencies. (ML: 0.94)👍👎
  • Intra-action: The process by which human and non-human agencies co-produce interpretable phenomena through their entanglements. (ML: 0.92)👍👎
  • The authors suggest that a diffractive optic offers a more philosophically robust reading of XAI practices, one that acknowledges the emergent nature of interpretation and the importance of situated contexts. (ML: 0.90)👍👎
  • This approach challenges the dominant reflectivity and refractivity optics in XAI, which assume that meaning pre-exists the practices and beings that produce it. (ML: 0.75)👍👎
Abstract
Explainable AI (XAI) is frequently positioned as a technical problem of revealing the inner workings of an AI model. This position is affected by unexamined onto-epistemological assumptions: meaning is treated as immanent to the model, the explainer is positioned outside the system, and a causal structure is presumed recoverable through computational techniques. In this paper, we draw on Barad's agential realism to develop an alternative onto-epistemology of XAI. We propose that interpretations are material-discursive performances that emerge from situated entanglements of the AI model with humans, context, and the interpretative apparatus. To develop this position, we read a comprehensive set of XAI methods through agential realism and reveal the assumptions and limitations that underpin several of these methods. We then articulate the framework's ethical dimension and propose design directions for XAI interfaces that support emergent interpretation, using a speculative text-to-music interface as a case study.
Why we are recommending this paper?
Because ai and society is a popular topic and you have less than 3 interests with available recommendations
Stanford University
Rate paper: 👍 👎 ♥ Save
AI Insights
  • LLM: Large Language Model RL: Reinforcement Learning ML engineering tasks: Machine learning tasks that heavily depend on feature engineering and hyper-parameter tuning rather than algorithm development. (ML: 0.97)👍👎
  • The paper demonstrates the feasibility and potential of automated execution feedback loops in LLM research problems, but highlights remaining limitations that need to be addressed. (ML: 0.96)👍👎
  • Execution grounding for code: The idea of learning from execution feedback in the code generation domain. (ML: 0.96)👍👎
  • Future work should focus on improving generalizability testing, exploring richer learning signals from execution trajectories, developing more capable execution agents, and incorporating alternative metrics such as idea novelty and interestingness. (ML: 0.95)👍👎
  • They find that models tend to converge on simple ideas to improve the average reward but lose diversity and do not improve the upper-bound. (ML: 0.95)👍👎
  • The paper presents a large-scale parallel executor for automatically executing model-generated ideas to verify their effectiveness on open-ended LLM research problems. (ML: 0.92)👍👎
  • The authors analyze the effectiveness of execution-guided evolutionary search and reinforcement learning with execution rewards. (ML: 0.86)👍👎
  • The paper highlights the limitations of current experiments, including a lack of generalizability testing, limited exploration incentives in RL objectives, and noise in the reward signal due to the execution agent's capabilities. (ML: 0.84)👍👎
Abstract
Automated AI research holds great potential to accelerate scientific discovery. However, current LLMs often generate plausible-looking but ineffective ideas. Execution grounding may help, but it is unclear whether automated execution is feasible and whether LLMs can learn from the execution feedback. To investigate these, we first build an automated executor to implement ideas and launch large-scale parallel GPU experiments to verify their effectiveness. We then convert two realistic research problems - LLM pre-training and post-training - into execution environments and demonstrate that our automated executor can implement a large fraction of the ideas sampled from frontier LLMs. We analyze two methods to learn from the execution feedback: evolutionary search and reinforcement learning. Execution-guided evolutionary search is sample-efficient: it finds a method that significantly outperforms the GRPO baseline (69.4% vs 48.0%) on post-training, and finds a pre-training recipe that outperforms the nanoGPT baseline (19.7 minutes vs 35.9 minutes) on pre-training, all within just ten search epochs. Frontier LLMs often generate meaningful algorithmic ideas during search, but they tend to saturate early and only occasionally exhibit scaling trends. Reinforcement learning from execution reward, on the other hand, suffers from mode collapse. It successfully improves the average reward of the ideator model but not the upper-bound, due to models converging on simple ideas. We thoroughly analyze the executed ideas and training dynamics to facilitate future efforts towards execution-grounded automated AI research.
Why we are recommending this paper?
Because research automation with ai is a popular topic and you have less than 3 interests with available recommendations
Florida Institute of Technology
Rate paper: 👍 👎 ♥ Save
AI Insights
  • However, AI also has its limitations and challenges, including the issue of impostor bias, where AI systems may mistakenly identify a legitimate file or activity as malicious. (ML: 0.96)👍👎
  • Limited accuracy: AI systems may not always accurately identify malicious files or activities. (ML: 0.96)👍👎
  • Impostor bias: AI systems may mistakenly identify a legitimate file or activity as malicious. (ML: 0.94)👍👎
  • To address these challenges, researchers are working on developing more accurate and reliable AI systems for digital forensics. (ML: 0.93)👍👎
  • To address these challenges, researchers are working on developing more accurate and reliable AI systems for digital forensics. (ML: 0.93)👍👎
  • Artificial Intelligence (AI): A type of computer system that can perform tasks that would typically require human intelligence, such as learning, problem-solving, and decision-making. (ML: 0.92)👍👎
  • Digital Forensics: The process of collecting, analyzing, and preserving evidence related to cybercrime and other digital crimes. (ML: 0.92)👍👎
  • The use of artificial intelligence (AI) in digital forensics is becoming increasingly important as cybercrime continues to grow. (ML: 0.92)👍👎
  • The use of AI in digital forensics is becoming increasingly important, but it also has its limitations and challenges. (ML: 0.90)👍👎
  • The use of AI in digital forensics is becoming increasingly important, but it also has its limitations and challenges. (ML: 0.90)👍👎
Abstract
In an era where cyber threats are rapidly evolving, the reliability of cyber forensic analysis has become increasingly critical for effective digital investigations and cybersecurity responses. AI agents are being adopted across digital forensic practices due to their ability to automate processes such as anomaly detection, evidence classification, and behavioral pattern recognition, significantly enhancing scalability and reducing investigation timelines. However, the characteristics that make AI indispensable also introduce notable risks. AI systems, often trained on biased or incomplete datasets, can produce misleading results, including false positives and false negatives, thereby jeopardizing the integrity of forensic investigations. This study presents a meticulous comparative analysis of the effectiveness of the most used AI agent, ChatGPT, and human forensic investigators in the realm of cyber forensic analysis. Our research reveals critical limitations within AI-driven approaches, demonstrating scenarios in which sophisticated or novel cyber threats remain undetected due to the rigid pattern-based nature of AI systems. Conversely, our analysis highlights the crucial role that human forensic investigators play in mitigating these risks. Through adaptive decision-making, ethical reasoning, and contextual understanding, human investigators effectively identify subtle anomalies and threats that may evade automated detection systems. To reinforce our findings, we conducted comprehensive reliability testing of forensic techniques using multiple cyber threat scenarios. These tests confirmed that while AI agents significantly improve the efficiency of routine analyses, human oversight remains crucial in ensuring accuracy and comprehensiveness of the results.
Why we are recommending this paper?
Because research automation with ai is a popular topic and you have less than 3 interests with available recommendations
Purdue University
Rate paper: 👍 👎 ♥ Save
AI Insights
  • The authors acknowledge that their study is limited by its reliance on a small number of datasets. (ML: 0.99)👍👎
  • Domain shift: a phenomenon where the distribution of data in the training set differs from that of the testing set. (ML: 0.98)👍👎
  • The development of DSCF highlights the need for large-scale and diverse training data. (ML: 0.97)👍👎
  • Recent works have proposed unsupervised domain adaptation frameworks, but their effectiveness beyond the originally reported datasets are yet to be independently evaluated. (ML: 0.95)👍👎
  • The results of this benchmarking experiment have shown that classifying test samples that are in-distribution to the training dataset is significantly easier than test samples suffering from distribution shift due to changes in instruments and acquisition conditions, and additional contaminants. (ML: 0.94)👍👎
  • Foundation model: a pre-trained model that can be fine-tuned for specific tasks, often using transfer learning. (ML: 0.92)👍👎
  • SANet demonstrated the best overall performance across the datasets. (ML: 0.84)👍👎
  • The study benchmarks only five architectures and relies on minimal spectral pre-processing. (ML: 0.77)👍👎
  • Existing open-source Raman datasets are often restricted in size, chemical diversity or experimental variability. (ML: 0.67)👍👎
  • Creating large, curated experimental Raman spectral datasets that span multiple instruments, materials and measurement settings is key to developing a Raman-specific foundation model. (ML: 0.61)👍👎
  • Raman spectroscopy: a technique used to analyze the vibrational modes of molecules. (ML: 0.52)👍👎
Abstract
Deep learning classifiers for Raman spectroscopy are increasingly reported to outperform classical chemometric approaches. However their evaluations are often conducted in isolation or compared against traditional machine learning methods or trivially adapted vision-based architectures that were not originally proposed for Raman spectroscopy. As a result, direct comparisons between existing deep learning models developed specifically for Raman spectral analysis on shared open-source datasets remain scarce. To the best of our knowledge, this study presents one of the first systematic benchmarks comparing three or more published Raman-specific deep learning classifiers across multiple open-source Raman datasets. We evaluate five representative deep learning architectures under a unified training and hyperparameter tuning protocol across three open-source Raman datasets selected to support standard evaluation, fine-tuning, and explicit distribution-shift testing. We report classification accuracies and macro-averaged F1 scores to provide a fair and reproducible comparison of deep learning models for Raman spectra based classification.
Why we are recommending this paper?
Because deep learning is a popular topic and you have less than 3 interests with available recommendations
Quantinuum Ltd
Rate paper: 👍 👎 ♥ Save
AI Insights
  • L1 Relative Change (L1RC): A measure of the difference between two probability distributions. (ML: 0.98)👍👎
  • Signal-to-Noise Ratio (SNR): The ratio of the signal power to the noise power in a system. (ML: 0.93)👍👎
  • However, on Real Pauli data the advantage clearly shifts toward the ML-based models, which outperform all baselines in both median L1 relative change and fraction of improved circuits. (ML: 0.93)👍👎
  • Deep learning models can learn corrections directly from data gathered during circuit runs, more easily capturing correlations. (ML: 0.88)👍👎
  • The best performing models are comparable to the best baseline methods on Simulated data (both Pauli and Random). (ML: 0.87)👍👎
  • It is defined as the L1 norm of the difference between the two distributions. (ML: 0.87)👍👎
  • The learned mapping from P noisy and circuit features to P ideal captures a richer structure that goes beyond coarse depolarization or measurement-error mitigation. (ML: 0.81)👍👎
  • The PERCEIVER model consistently achieves as good or greater median performance than the baseline mitigation techniques for Pauli circuits. (ML: 0.80)👍👎
  • The deep learning approaches can generalize across noise regimes, device generations, and circuit families without relying on a predefined noise model. (ML: 0.79)👍👎
  • The deep learning approaches can generalize across noise regimes, device generations, and circuit families without relying on a predefined noise model. (ML: 0.79)👍👎
  • The baseline methods retain value as lightweight, interpretable mitigation techniques, particularly for structured, low-depth circuits. (ML: 0.61)👍👎
Abstract
We present a systematic investigation of deep learning methods applied to quantum error mitigation of noisy output probability distributions from measured quantum circuits. We compare different architectures, from fully connected neural networks to transformers, and we test different design/training modalities, identifying sequence-to-sequence, attention-based models as the most effective on our datasets. These models consistently produce mitigated distributions that are closer to the ideal outputs when tested on both simulated and real device data obtained from IBM superconducting quantum processing units (QPU) up to five qubits. Across several different circuit depths, our approach outperforms other baseline error mitigation techniques. We perform a series of ablation studies to examine: how different input features (circuit, device properties, noisy output statistics) affect performance; cross-dataset generalization across circuit families; and transfer learning to a different IBM QPU. We observe that generalization performance across similar devices with the same architecture works effectively, without needing to fully retrain models.
Why we are recommending this paper?
Because deep learning is a popular topic and you have less than 3 interests with available recommendations
UC Santa Cruz
Rate paper: 👍 👎 ♥ Save
AI Insights
  • However, they often require large amounts of data and computational resources to train, which can be a limitation. (ML: 0.98)👍👎
  • require large amounts of data and computational resources to train The use of text-to-image diffusion models for image editing has been explored by several researchers, including those who have developed datasets such as Qwen-Image and Omnigen2. (ML: 0.94)👍👎
  • Text-to-image diffusion models have become increasingly popular in recent years, with many researchers exploring their potential applications. (ML: 0.93)👍👎
  • The unreasonable effectiveness of deep features as a perceptual metric Text-to-image diffusion models have become increasingly popular in recent years, with many researchers exploring their potential applications. (ML: 0.92)👍👎
  • Text-to-image diffusion models are a type of artificial intelligence that can generate images from text descriptions. (ML: 0.91)👍👎
  • They have many potential applications, but require large amounts of data and computational resources to train. (ML: 0.91)👍👎
  • These models can be used for various tasks such as image editing, object removal, and text-to-image synthesis. (ML: 0.89)👍👎
  • These models can be used for various tasks such as object removal, text-to-image synthesis, and instruction-guided image editing. (ML: 0.88)👍👎
Abstract
Recent image generation models have shown impressive progress, yet they often struggle to yield controllable and consistent results when users attempt to edit specific elements within an existing image. Layered representations enable flexible, user-driven content creation, but existing approaches often fail to produce layers with coherent compositing relationships, and their object layers typically lack realistic visual effects such as shadows and reflections. To overcome these limitations, we propose LASAGNA, a novel, unified framework that generates an image jointly with its composing layers--a photorealistic background and a high-quality transparent foreground with compelling visual effects. Unlike prior work, LASAGNA efficiently learns correct image composition from a wide range of conditioning inputs--text prompts, foreground, background, and location masks--offering greater controllability for real-world applications. To enable this, we introduce LASAGNA-48K, a new dataset composed of clean backgrounds and RGBA foregrounds with physically grounded visual effects. We also propose LASAGNABENCH, the first benchmark for layer editing. We demonstrate that LASAGNA excels in generating highly consistent and coherent results across multiple image layers simultaneously, enabling diverse post-editing applications that accurately preserve identity and visual effects. LASAGNA-48K and LASAGNABENCH will be publicly released to foster open research in the community. The project page is https://rayjryang.github.io/LASAGNA-Page/.
Why we are recommending this paper?
Because image and video generation is a popular topic and you have less than 3 interests with available recommendations
Nanjing University NJU
Rate paper: 👍 👎 ♥ Save
AI Insights
  • StableWorld also alleviates error accumulation in autoregressive video generation, resulting in more stable, consistent, and higher-quality long videos. (ML: 0.87)👍👎
  • Autoregressive video generation: A technique where each frame is generated based on the previous one(s), often leading to error accumulation. (ML: 0.85)👍👎
  • Geometric similarity: A measure of how similar two frames are based on their geometric structure. (ML: 0.82)👍👎
  • Sliding window approach: A method where a fixed-size window is moved over the sequence, and the most recent frames are used to generate new ones. (ML: 0.82)👍👎
  • The paper proposes a method called StableWorld for long-horizon interactive video generation, which aims to prevent error accumulation and maintain temporal consistency. (ML: 0.81)👍👎
  • StableWorld effectively prevents cumulative errors by continuously filtering out degraded frames while maintaining coherent motion, resulting in more stable and temporally consistent interactive video sequences. (ML: 0.77)👍👎
  • The method's ability to identify and discard a large number of drifted frames during generation has the potential to reduce training cost and aligns naturally with future extensions toward memory-augmented world models. (ML: 0.76)👍👎
  • ORB (Oriented FAST and Rotated BRIEF): A feature detector that extracts keypoints with their descriptors for matching purposes. (ML: 0.63)👍👎
  • StableWorld uses a sliding window approach with dynamic frame eviction based on geometric similarity computed using ORB features. (ML: 0.63)👍👎
  • The method is evaluated on several benchmarks, including Matrix-Game 2.0, Hunyuan-Gamecraft 1.0, Open-Oasis, and Self-Forcing, showing improved stability and consistency in long-horizon generation. (ML: 0.59)👍👎
Abstract
In this paper, we explore the overlooked challenge of stability and temporal consistency in interactive video generation, which synthesizes dynamic and controllable video worlds through interactive behaviors such as camera movements and text prompts. Despite remarkable progress in world modeling, current methods still suffer from severe instability and temporal degradation, often leading to spatial drift and scene collapse during long-horizon interactions. To better understand this issue, we initially investigate the underlying causes of instability and identify that the major source of error accumulation originates from the same scene, where generated frames gradually deviate from the initial clean state and propagate errors to subsequent frames. Building upon this observation, we propose a simple yet effective method, \textbf{StableWorld}, a Dynamic Frame Eviction Mechanism. By continuously filtering out degraded frames while retaining geometrically consistent ones, StableWorld effectively prevents cumulative drift at its source, leading to more stable and temporal consistency of interactive generation. Promising results on multiple interactive video models, \eg, Matrix-Game, Open-Oasis, and Hunyuan-GameCraft, demonstrate that StableWorld is model-agnostic and can be applied to different interactive video generation frameworks to substantially improve stability, temporal consistency, and generalization across diverse interactive scenarios.
Why we are recommending this paper?
Because image and video generation is a popular topic and you have less than 3 interests with available recommendations

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.
  • Product Roadmap
  • Product Management
  • Product Strategy
You can edit or add more interests any time.