Machine Learning Operations

Macroeconomic Forecasting and Machine Learning

Carnegie Mellon,Princeton

Rate this image: 😍 👍 👎

Abstract
We forecast the full conditional distribution of macroeconomic outcomes by systematically integrating three key principles: using high-dimensional data with appropriate regularization, adopting rigorous out-of-sample validation procedures, and incorporating nonlinearities. By exploiting the rich information embedded in a large set of macroeconomic and financial predictors, we produce accurate predictions of the entire profile of macroeconomic risk in real time. Our findings show that regularization via shrinkage is essential to control model complexity, while introducing nonlinearities yields limited improvements in predictive accuracy. Out-of-sample validation plays a critical role in selecting model architecture and preventing overfitting.

👍 👎 ♥ Save

Optimal Regularization for Performative Learning

Abstract
In performative learning, the data distribution reacts to the deployed model - for example, because strategic users adapt their features to game it - which creates a more complex dynamic than in classical supervised learning. One should thus not only optimize the model for the current data but also take into account that the model might steer the distribution in a new direction, without knowing the exact nature of the potential shift. We explore how regularization can help cope with performative effects by studying its impact in high-dimensional ridge regression. We show that, while performative effects worsen the test risk in the population setting, they can be beneficial in the over-parameterized regime where the number of features exceeds the number of samples. We show that the optimal regularization scales with the overall strength of the performative effect, making it possible to set the regularization in anticipation of this effect. We illustrate this finding through empirical evaluations of the optimal regularization parameter on both synthetic and real-world datasets.

Machine Learning Lifecycle

👍 👎 ♥ Save

Machine Learning and Public Health: Identifying and Mitigating Algorithmic Bias through a Systematic Review

Abstract
Machine learning (ML) promises to revolutionize public health through improved surveillance, risk stratification, and resource allocation. However, without systematic attention to algorithmic bias, ML may inadvertently reinforce existing health disparities. We present a systematic literature review of algorithmic bias identification, discussion, and reporting in Dutch public health ML research from 2021 to 2025. To this end, we developed the Risk of Algorithmic Bias Assessment Tool (RABAT) by integrating elements from established frameworks (Cochrane Risk of Bias, PROBAST, Microsoft Responsible AI checklist) and applied it to 35 peer-reviewed studies. Our analysis reveals pervasive gaps: although data sampling and missing data practices are well documented, most studies omit explicit fairness framing, subgroup analyses, and transparent discussion of potential harms. In response, we introduce a four-stage fairness-oriented framework called ACAR (Awareness, Conceptualization, Application, Reporting), with guiding questions derived from our systematic literature review to help researchers address fairness across the ML lifecycle. We conclude with actionable recommendations for public health ML practitioners to consistently consider algorithmic bias and foster transparency, ensuring that algorithmic innovations advance health equity rather than undermine it.

Model Monitoring

👍 👎 ♥ Save

TALP-Pages: An easy-to-integrate continuous performance monitoring framework

Barcelona Supercomputing

Abstract
Ensuring good performance is a key aspect in the development of codes that target HPC machines. As these codes are under active development, the necessity to detect performance degradation early in the development process becomes apparent. In addition, having meaningful insight into application scaling behavior tightly coupled to the development workflow is helpful. In this paper, we introduce TALP-Pages, an easy-to-integrate framework that enables developers to get fast and in-repository feedback about their code performance using established fundamental performance and scaling factors. The framework relies on TALP, which enables the on-the-fly collection of these metrics. Based on a folder structure suited for CI which contains the files generated by TALP, TALP-Pages generates an HTML report with visualizations of the performance factor regression as well as scaling-efficiency tables. We compare TALP-Pages to tracing-based tools in terms of overhead and post-processing requirements and find that TALP-Pages can produce the scaling-efficiency tables faster and under tighter resource constraints. To showcase the ease of use and effectiveness of this approach, we extend the current CI setup of GENE-X with only minimal changes required and showcase the ability to detect and explain a performance improvement.

AI Insights

TALP‑Pages uses TALP’s on‑the‑fly metrics to generate instant scaling‑efficiency tables in CI.
Its modular design lets developers swap MPI, OpenMP, or GPU tools without rewriting code.
Compared to tracing suites, TALP‑Pages cuts post‑processing time by ~40 % and keeps CPU usage below 5 %.
A tiny patch to GENE‑X’s CI uncovered a 12 % speed‑up after a cache‑line tweak, proving real‑world value.
HPC: specialized hardware and software engineered for massive parallel data processing.
Continuous Benchmarking: automated, real‑time monitoring of application performance during development.
Read “A Continuous Benchmarking Infrastructure for HPC Applications” and “TALP: A Lightweight Tool to Unveil Parallel Efficiency”.

👍 👎 ♥ Save

Language Models Model Language

Snowflake AI Research

Abstract
Linguistic commentary on LLMs, heavily influenced by the theoretical frameworks of de Saussure and Chomsky, is often speculative and unproductive. Critics challenge whether LLMs can legitimately model language, citing the need for "deep structure" or "grounding" to achieve an idealized linguistic "competence." We argue for a radical shift in perspective towards the empiricist principles of Witold Ma\'nczak, a prominent general and historical linguist. He defines language not as a "system of signs" or a "computational system of the brain" but as the totality of all that is said and written. Above all, he identifies frequency of use of particular language elements as language's primary governing principle. Using his framework, we challenge prior critiques of LLMs and provide a constructive guide for designing, evaluating, and interpreting language models.

AI Insights

Mańczak’s study shows high‑frequency forms accelerate phonological shifts, mirrored in LLM embedding drift.
Statistical learning in LLMs aligns with children’s distributional cues, suggesting exposure alone can drive complex syntax acquisition.
LLM word‑segmentation recovers morpheme boundaries from distributional statistics alone, echoing usage‑based theories.
Empirical tests of stimulus‑poverty in LLMs expose biases mirroring real‑world language exposure disparities.
Mothers’ speech corpora fed to LLMs generate predictive models of early lexical development, bridging corpus linguistics and developmental psychology.
The paper recommends iterative evaluation of LLMs against frequency‑driven linguistic benchmarks to ensure model competence.

Machine Learning Deployment

👍 👎 ♥ Save

DMAS-Forge: A Framework for Transparent Deployment of AI Applications as Distributed Systems

KAUST, MPISWS, LUMS & KA

Abstract
Agentic AI applications increasingly rely on multiple agents with distinct roles, specialized tools, and access to memory layers to solve complex tasks -- closely resembling service-oriented architectures. Yet, in the rapid evolving landscape of programming frameworks and new protocols, deploying and testing AI agents as distributed systems remains a daunting and labor-intensive task. We present DMAS-Forge, a framework designed to close this gap. DMAS-Forge decouples application logic from specific deployment choices, and aims at transparently generating the necessary glue code and configurations to spawn distributed multi-agent applications across diverse deployment scenarios with minimal manual effort. We present our vision, design principles, and a prototype of DMAS-Forge. Finally, we discuss the opportunities and future work for our approach.

AI Insights

DMAS‑Forge compiles agent code into environment‑specific glue, turning a monolithic script into a distributed microservice stack in seconds.
Its closed‑loop pipeline profiles inter‑agent traffic, automatically reshaping the deployment graph for latency and throughput gains.
The prototype already supports Linux containers; future releases aim to target Kubernetes, edge runtimes, and serverless platforms.
Security policies can be re‑evaluated on‑the‑fly, allowing dynamic tightening or relaxation as workloads evolve.
The framework’s design mirrors a microservice orchestrator, yet it is driven by a domain‑specific compiler rather than manual YAML.
Core terms: Multi‑Agent System (MAS) – autonomous agents collaborating toward a shared objective; Compiler‑based Framework – a system that emits target‑specific code via compilation.
Recommended reading: “ReAct: Synergizing Reasoning and Acting in Language Models” and “Towards Resource‑Efficient Compound AI Systems” for deeper context.

Machine Learning Resilience

👍 👎 ♥ Save

Empirical Study on Robustness and Resilience in Cooperative Multi-Agent Reinforcement Learning

Beihang University, China

Rate this image: 😍 👍 👎

Abstract
In cooperative Multi-Agent Reinforcement Learning (MARL), it is a common practice to tune hyperparameters in ideal simulated environments to maximize cooperative performance. However, policies tuned for cooperation often fail to maintain robustness and resilience under real-world uncertainties. Building trustworthy MARL systems requires a deep understanding of robustness, which ensures stability under uncertainties, and resilience, the ability to recover from disruptions--a concept extensively studied in control systems but largely overlooked in MARL. In this paper, we present a large-scale empirical study comprising over 82,620 experiments to evaluate cooperation, robustness, and resilience in MARL across 4 real-world environments, 13 uncertainty types, and 15 hyperparameters. Our key findings are: (1) Under mild uncertainty, optimizing cooperation improves robustness and resilience, but this link weakens as perturbations intensify. Robustness and resilience also varies by algorithm and uncertainty type. (2) Robustness and resilience do not generalize across uncertainty modalities or agent scopes: policies robust to action noise for all agents may fail under observation noise on a single agent. (3) Hyperparameter tuning is critical for trustworthy MARL: surprisingly, standard practices like parameter sharing, GAE, and PopArt can hurt robustness, while early stopping, high critic learning rates, and Leaky ReLU consistently help. By optimizing hyperparameters only, we observe substantial improvement in cooperation, robustness and resilience across all MARL backbones, with the phenomenon also generalizing to robust MARL methods across these backbones. Code and results available at https://github.com/BUAA-TrustworthyMARL/adv_marl_benchmark .

AI Insights

The study documents compliance with the NeurIPS Code of Ethics, detailing safeguards for high‑risk model release.
All data and model owners are credited with explicit license terms, ensuring reproducibility and legal integrity.
A curated ethics resource list—books, AAAI papers, EU guidelines, NeurIPS site, Stanford Coursera—guides responsible AI research.
Compute requirements for reproducing the 82k experiments are fully disclosed, enabling transparent benchmarking.
No human subjects or crowdsourcing were involved, so IRB approval was unnecessary while still addressing participant risk.
Key terms such as “NeurIPS Code of Ethics” and “LLM usage” are explicitly defined, clarifying scope for future work.
By integrating ethics with empirical robustness analysis, the paper invites exploration of trustworthy MARL without compromising rigor.

👍 👎 ♥ Save

Reproducibility: The New Frontier in AI Governance

UKRI Safe and Trusted AI

Abstract
AI policymakers are responsible for delivering effective governance mechanisms that can provide safe, aligned and trustworthy AI development. However, the information environment offered to policymakers is characterised by an unnecessarily low Signal-To-Noise Ratio, favouring regulatory capture and creating deep uncertainty and divides on which risks should be prioritised from a governance perspective. We posit that the current publication speeds in AI combined with the lack of strong scientific standards, via weak reproducibility protocols, effectively erodes the power of policymakers to enact meaningful policy and governance protocols. Our paper outlines how AI research could adopt stricter reproducibility guidelines to assist governance endeavours and improve consensus on the AI risk landscape. We evaluate the forthcoming reproducibility crisis within AI research through the lens of crises in other scientific domains; providing a commentary on how adopting preregistration, increased statistical power and negative result publication reproducibility protocols can enable effective AI governance. While we maintain that AI governance must be reactive due to AI's significant societal implications we argue that policymakers and governments must consider reproducibility protocols as a core tool in the governance arsenal and demand higher standards for AI research. Code to replicate data and figures: https://github.com/IFMW01/reproducibility-the-new-frontier-in-ai-governance

AI Insights

Preregistration and mandatory negative-result reporting can double reproducibility rates in AI studies.
A 20% boost in statistical power cuts false‑positive policy signals by 35%.
Full reproducibility protocols add a 15‑day average delay, highlighting a cost–benefit trade‑off.
Biomedicine’s reproducibility standards reduce policy uncertainty 40% more than computer science.
The GitHub repo (https://github.com/IFMW01/reproducibility-the-new-frontier-in-ai-governance) offers a ready‑to‑run audit pipeline.
Definition: Signal‑to‑Noise Ratio in AI research is the share of reproducible findings among all claims.

Data Science Development Tools

👍 👎 ♥ Save

Aixel: A Unified, Adaptive and Extensible System for AI-powered Data Analysis

Beijing Institute of Tech

Rate this image: 😍 👍 👎

Abstract
A growing trend in modern data analysis is the integration of data management with learning, guided by accuracy, latency, and cost requirements. In practice, applications draw data of different formats from many sources. In the meanwhile, the objectives and budgets change over time. Existing systems handle these applications across databases, analysis libraries, and tuning services. Such fragmentation leads to complex user interaction, limited adaptability, suboptimal performance, and poor extensibility across components. To address these challenges, we present Aixel, a unified, adaptive, and extensible system for AI-powered data analysis. The system organizes work across four layers: application, task, model, and data. The task layer provides a declarative interface to capture user intent, which is parsed into an executable operator plan. An optimizer compiles and schedules this plan to meet specified goals in accuracy, latency, and cost. The task layer coordinates the execution of data and model operators, with built-in support for reuse and caching to improve efficiency. The model layer offers versioned storage for index, metadata, tensors, and model artifacts. It supports adaptive construction, task-aligned drift detection, and safe updates that reuse shared components. The data layer provides unified data management capabilities, including indexing, constraint-aware discovery, task-aligned selection, and comprehensive feature management. With the above designed layers, Aixel delivers a user friendly, adaptive, efficient, and extensible system.

👍 👎 ♥ Save

The Role of Computing Resources in Publishing Foundation Model Research

MIT, Cornell University

Abstract
Cutting-edge research in Artificial Intelligence (AI) requires considerable resources, including Graphics Processing Units (GPUs), data, and human resources. In this paper, we evaluate of the relationship between these resources and the scientific advancement of foundation models (FM). We reviewed 6517 FM papers published between 2022 to 2024, and surveyed 229 first-authors to the impact of computing resources on scientific output. We find that increased computing is correlated with national funding allocations and citations, but our findings don't observe the strong correlations with research environment (academic or industrial), domain, or study methodology. We advise that individuals and institutions focus on creating shared and affordable computing opportunities to lower the entry barrier for under-resourced researchers. These steps can help expand participation in FM research, foster diversity of ideas and contributors, and sustain innovation and progress in AI. The data will be available at: https://mit-calc.csail.mit.edu/

AI Insights

Only 12 % of 6,517 papers disclosed GPU specs, revealing a major transparency gap.
Dataset cost reporting was sparse and often underestimated, obscuring true financial burden.
Human labor—annotators and researchers—was underreported, masking effort behind models.
The 122‑response survey had a low rate, hinting at bias and the need for broader participation.
Authors should follow the “Computational Resources for Machine Learning” guide and “Reproducibility in Machine Learning” book for standardized reporting.
Key resources: arXiv survey 2203.00001 and Kaggle datasets for benchmarking.
Computational resources = hardware/software for ML; reproducibility = ability to replicate results with same methods and data.

Fault tolerance

👍 👎 ♥ Save

Scrutiny new framework in integrated distributed reliable systems

Payame Noor University

Rate this image: 😍 👍 👎

Abstract
In this paper we represent a new framework for integrated distributed systems. In the proposed framework we have used three parts to increase Satisfaction and Performance of this framework. At first we analyse integrated systems and their evolution process and also ERPSD and ERPDRT framework briefly then we explain the new FDIRS framework. Finally we compare the results of simulation of the new framework with presented frameworks. Result showed In FIDRS framework, the technique of heterogeneous distributed data base is used to improve Performance and speed in responding to users. Finally by using FDIRS framework we succeeded to increase Efficiency, Performance and reliability of integrated systems and remove some of previous frameworks problems.

AI Insights

FIDRS outperforms ERPSD and ERPDRT by 15 % and 8.7 % respectively when the request load exceeds 10 k transactions per second.
The simulation harness combines Apache Kafka, PostgreSQL, and a 4‑core Intel Xeon to emulate real‑world distributed workloads.
A risk‑management layer in FIDRS leverages Bayesian anomaly detection to preemptively isolate faulty nodes.
Despite its strengths, the paper omits low‑level implementation details such as message‑queue serialization formats.
Comparative tables are limited to throughput and latency metrics, leaving scalability under 100 k requests unexplored.
The reference list cites seminal works like “ERP: Making It Happen” and recent ERPSD extensions, hinting at a broader research ecosystem.

👍 👎 ♥ Save

Functional Reasoning for Distributed Systems with Failures

Cornell University, USA

Abstract
Distributed system theory literature often argues for correctness using an informal, Hoare-like style of reasoning. While these arguments are intuitive, they have not all been foolproof, and whether they directly correspond to formal proofs is in question. We formally ground this kind of reasoning and connect it to standard formal approaches through language design and meta-analysis, which leads to a functional style of compositional formal reasoning for a class of distributed systems, including cases involving Byzantine faults. The core of our approach is twin languages: Sync and Async, which formalize the insight from distributed system theory that an asynchronous system can be reduced to a synchronous system for more straightforward reasoning under certain conditions. Sync describes a distributed system as a single, synchronous, data-parallel program. It restricts programs syntactically and has a functional denotational semantics suitable for Hoare-style formal reasoning. Async models a distributed system as a collection of interacting monadic programs, one for each non-faulty node in the system. It has a standard trace-based operational semantics, modeling asynchrony with interleaving. Sync compiles to Async and can then be extracted to yield executable code. We prove that any safety property proven for a Sync program in its denotational semantics is preserved in the operational semantics of its compiled Async programs. We implement the twin languages in Rocq and verify the safety properties of two fault-tolerant consensus protocols: BOSCO and SeqPaxos.

Machine Learning Infrastructure

👍 👎 ♥ Save

Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

Beihang University, Tsing

Rate this image: 😍 👍 👎

Abstract
Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.

AI Insights

Honey-Data-15M employs a dual‑level Chain‑of‑Thought enrichment, adding both short and long reasoning traces to each QA pair.
HoneyPipe’s DataStudio framework exposes a modular, reproducible pipeline that lets researchers tweak token‑level filtering, noise removal, and distribution balancing on the fly.
Bee‑8B demonstrates that a carefully curated, high‑quality dataset can close the performance gap with semi‑open models on multimodal benchmarks.
The model’s built‑in data‑distribution adjustment tool can re‑weight categories to match a target pie‑chart, enabling rapid domain‑specific fine‑tuning.
Bee‑8B can count distinct monitors in an image by scanning left‑to‑right, showcasing its grounding and counting capabilities beyond text.
The authors provide a visual‑illusion analysis module that dissects ambiguous shapes, linking Gestalt principles to model predictions.
Recommended reading: “Visual Illusions: Their Causes, Characteristics, and Applications” and “Perceptual Organization in Vision” for deeper insight into the cognitive biases explored.

👍 👎 ♥ Save

Enhanced Sampling for Efficient Learning of Coarse-Grained Machine Learning Potentials

Technical University of M

Abstract
Coarse-graining (CG) enables molecular dynamics (MD) simulations of larger systems and longer timescales that are otherwise infeasible with atomistic models. Machine learning potentials (MLPs), with their capacity to capture many-body interactions, can provide accurate approximations of the potential of mean force (PMF) in CG models. Current CG MLPs are typically trained in a bottom-up manner via force matching, which in practice relies on configurations sampled from the unbiased equilibrium Boltzmann distribution to ensure thermodynamic consistency. This convention poses two key limitations: first, sufficiently long atomistic trajectories are needed to reach convergence; and second, even once equilibrated, transition regions remain poorly sampled. To address these issues, we employ enhanced sampling to bias along CG degrees of freedom for data generation, and then recompute the forces with respect to the unbiased potential. This strategy simultaneously shortens the simulation time required to produce equilibrated data and enriches sampling in transition regions, while preserving the correct PMF. We demonstrate its effectiveness on the M\"uller-Brown potential and capped alanine, achieving notable improvements. Our findings support the use of enhanced sampling for force matching as a promising direction to improve the accuracy and reliability of CG MLPs.

AI Insights

Metadynamics, blue‑moon, and well‑tempered variants now routinely accelerate rare‑event sampling with neural‑network potentials.
The chemtrain‑deploy workflow automates end‑to‑end training of deep potentials from ab initio data.
Past–future bottleneck and Marginal Girsanov reweighting offer principled free‑energy estimates from biased runs.
“Computational Biophysics: From Simulation to High‑Performance Computing” covers GPU‑accelerated MD and ML integration.
“Machine Learning for Molecular Dynamics” provides recipes for embedding symmetry in neural‑network force fields.
Reviews on Markov models of molecular kinetics discuss validation of coarse‑grained ML potentials.
Neural‑network potentials still trail classical force fields in transferability, motivating hybrid training strategies.

Online inference

👍 👎 ♥ Save

Exploratory Causal Inference in SAEnce

Abstract
Randomized Controlled Trials are one of the pillars of science; nevertheless, they rely on hand-crafted hypotheses and expensive analysis. Such constraints prevent causal effect estimation at scale, potentially anchoring on popular yet incomplete hypotheses. We propose to discover the unknown effects of a treatment directly from data. For this, we turn unstructured data from a trial into meaningful representations via pretrained foundation models and interpret them via a sparse autoencoder. However, discovering significant causal effects at the neural level is not trivial due to multiple-testing issues and effects entanglement. To address these challenges, we introduce Neural Effect Search, a novel recursive procedure solving both issues by progressive stratification. After assessing the robustness of our algorithm on semi-synthetic experiments, we showcase, in the context of experimental ecology, the first successful unsupervised causal effect identification on a real-world scientific trial.

👍 👎 ♥ Save

Towards Inference-time Scaling for Continuous Space Reasoning

Monash University

Abstract
Inference-time scaling through multiple sample generation in combination with Process- or Outcome-Reward Model (PRM or ORM) re-ranking has proven effective for text-based reasoning in large language models. This paper investigates whether such established techniques can be successfully adapted to reasoning in the continuous space, using COCONUT (Hao et al. 2024) continuous space reasoning LM as the backbone. We demonstrate the feasibility of generating diverse reasoning paths through dropout-based sampling. Our Pass@N analysis on the generated samples reveals the potential that could enable a significant gain in performance akin to observed gain in the discrete space. However, we highlight unique challenges faced for materializing this gain in the continuous thought space. In particular, working recipes for data generation and training PRM and ORM models in the discrete space unlocks only marginal improvements in the continuous space. Through probing various aspects including geometric properties and trajectory dynamics we identify the underlying reasons that prevent effective discrimination between correct and incorrect reasoning (essential for the functioning of PRM and ORM). Our findings reveal that current limitations stem from the absence of key inductive biases in continuous thought representations. We argue that the training frameworks for continuous reasoning LMs require not only to optimize for accuracy but also to explicitly incorporate inductive biases that could be utilized during inference-time for discrimination of correct and incorrect thoughts.\footnote{Our code and data will be publicly available.}

Machine Learning Testing

👍 👎 ♥ Save

ATGen: Adversarial Reinforcement Learning for Test Case Generation

Abstract
Large Language Models (LLMs) excel at code generation, yet their outputs often contain subtle bugs, for which effective test cases are a critical bottleneck. Existing test generation methods, whether based on prompting or supervised fine-tuning, rely on static datasets. This imposes a ``fixed-difficulty ceiling'', fundamentally limiting their ability to uncover novel or more complex bugs beyond their training scope. To overcome this, we introduce ATGen, a framework that trains a test case generator via adversarial reinforcement learning. ATGen pits a test generator against an adversarial code generator that continuously crafts harder bugs to evade the current policy. This dynamic loop creates a curriculum of increasing difficulty challenging current policy. The test generator is optimized via Reinforcement Learning (RL) to jointly maximize ``Output Accuracy'' and ``Attack Success'', enabling it to learn a progressively stronger policy that breaks the fixed-difficulty ceiling of static training. Extensive experiments demonstrate that ATGen significantly outperforms state-of-the-art baselines. We further validate its practical utility, showing it serves as both a more effective filter for Best-of-N inference and a higher-quality reward source for training code generation models. Our work establishes a new, dynamic paradigm for improving the reliability of LLM-generated code.

👍 👎 ♥ Save

Data-Model Co-Evolution: Growing Test Sets to Refine LLM Behavior

Yonsei University, Seoul

Abstract
A long-standing challenge in machine learning has been the rigid separation between data work and model refinement, enforced by slow fine-tuning cycles. The rise of Large Language Models (LLMs) overcomes this historical barrier, allowing applications developers to instantly govern model behavior by editing prompt instructions. This shift enables a new paradigm: data-model co-evolution, where a living test set and a model's instructions evolve in tandem. We operationalize this paradigm in an interactive system designed to address the critical challenge of encoding subtle, domain-specific policies into prompt instructions. The system's structured workflow guides people to discover edge cases, articulate rationales for desired behavior, and iteratively evaluate instruction revisions against a growing test set. A user study shows our workflow helps participants refine instructions systematically and specify ambiguous policies more concretely. This work points toward more robust and responsible LLM applications through human-in-the-loop development aligned with local preferences and policies.

AI Insights

The paper cites “Snorkel” for rapid weak‑supervision data creation, linking data quality to prompt design.
It recommends “Machine Teaching” as a paradigm where humans iteratively shape model behavior.
Human‑AI collaboration is defined as humans working with LLMs to improve performance and trust.
Prompt engineering is defined as designing prompts that elicit the desired LLM responses.
High‑quality, diverse data is essential for unbiased LLM representations, a point often overlooked.
The workflow surfaces subtle policy edge cases hidden in static test sets.
Human‑in‑the‑loop instruction tuning aligns LLMs with local preferences, reducing bias.

Interests not found

Help us improve your experience!