Data Science Development Environment and Productivity

Prompting the Professoriate: A Qualitative Study of Instructor Perspectives on LLMs in Data Science Education

Abstract
Large Language Models (LLMs) have shifted in just a few years from novelty to ubiquity, raising fundamental questions for data science education. Tasks once used to teach coding, writing, and problem-solving can now be completed by LLMs, forcing educators to reconsider both pedagogy and assessment. To understand how instructors are adapting, we conducted semi-structured interviews with 42 instructors from 33 institutions in 10 countries in June and July 2025. Our qualitative analysis reveals a pragmatic mix of optimism and concern. Many respondents view LLMs as inevitable classroom tools -- comparable to calculators or Wikipedia -- while others worry about de-skilling, misplaced confidence, and uneven integration across institutions. Around 58 per cent have already introduced demonstrations, guided activities, or make extensive use of LLMs in their courses, though most expect change to remain slow and uneven. That said, 31 per cent have not used LLMs to teach students and do not plan to. We highlight some instructional innovations, including AI-aware assessments, reflective use of LLMs as tutors, and course-specific chatbots. By sharing these perspectives, we aim to help data science educators adapt collectively to ensure curricula keep pace with technological change.

Machine Learning Operations

👍 👎 ♥ Save

PIPES: A Meta-dataset of Machine Learning Pipelines

Universidade Federal de P

Abstract
Solutions to the Algorithm Selection Problem (ASP) in machine learning face the challenge of high computational costs associated with evaluating various algorithms' performances on a given dataset. To mitigate this cost, the meta-learning field can leverage previously executed experiments shared in online repositories such as OpenML. OpenML provides an extensive collection of machine learning experiments. However, an analysis of OpenML's records reveals limitations. It lacks diversity in pipelines, specifically when exploring data preprocessing steps/blocks, such as scaling or imputation, resulting in limited representation. Its experiments are often focused on a few popular techniques within each pipeline block, leading to an imbalanced sample. To overcome the observed limitations of OpenML, we propose PIPES, a collection of experiments involving multiple pipelines designed to represent all combinations of the selected sets of techniques, aiming at diversity and completeness. PIPES stores the results of experiments performed applying 9,408 pipelines to 300 datasets. It includes detailed information on the pipeline blocks, training and testing times, predictions, performances, and the eventual error messages. This comprehensive collection of results allows researchers to perform analyses across diverse and representative pipelines and datasets. PIPES also offers potential for expansion, as additional data and experiments can be incorporated to support the meta-learning community further. The data, code, supplementary material, and all experiments can be found at https://github.com/cynthiamaia/PIPES.git.

👍 👎 ♥ Save

Statistical Methods in Generative AI

University of Pennsylvann

Abstract
Generative Artificial Intelligence is emerging as an important technology, promising to be transformative in many areas. At the same time, generative AI techniques are based on sampling from probabilistic models, and by default, they come with no guarantees about correctness, safety, fairness, or other properties. Statistical methods offer a promising potential approach to improve the reliability of generative AI techniques. In addition, statistical methods are also promising for improving the quality and efficiency of AI evaluation, as well as for designing interventions and experiments in AI. In this paper, we review some of the existing work on these topics, explaining both the general statistical techniques used, as well as their applications to generative AI. We also discuss limitations and potential future directions.

AI Insights

Conformal prediction gives distribution‑free confidence bands for LLM outputs, turning uncertainty into a measurable metric.
Activation engineering tweaks hidden activations to steer language models toward desired styles without retraining.
Causal inference tools uncover hidden biases in LLM responses, enabling targeted debiasing.
The reliability of steering vectors—directional prompts that guide generation—remains largely uncharted, inviting fresh research.
Uncertainty quantification turns hallucinations into measurable risks, paving the way for safer AI deployment.
“Algorithmic Learning in a Random World” provides a rigorous statistical foundation for model behavior under randomness.
Conformal abstention lets LLMs refuse uncertain queries, dramatically reducing hallucination rates.

Machine Learning Lifecycle

👍 👎 ♥ Save

A Certifiable Machine Learning-Based Pipeline to Predict Fatigue Life of Aircraft Structures

Abstract
Fatigue life prediction is essential in both the design and operational phases of any aircraft, and in this sense safety in the aerospace industry requires early detection of fatigue cracks to prevent in-flight failures. Robust and precise fatigue life predictors are thus essential to ensure safety. Traditional engineering methods, while reliable, are time consuming and involve complex workflows, including steps such as conducting several Finite Element Method (FEM) simulations, deriving the expected loading spectrum, and applying cycle counting techniques like peak-valley or rainflow counting. These steps often require collaboration between multiple teams and tools, added to the computational time and effort required to achieve fatigue life predictions. Machine learning (ML) offers a promising complement to traditional fatigue life estimation methods, enabling faster iterations and generalization, providing quick estimates that guide decisions alongside conventional simulations. In this paper, we present a ML-based pipeline that aims to estimate the fatigue life of different aircraft wing locations given the flight parameters of the different missions that the aircraft will be operating throughout its operational life. We validate the pipeline in a realistic use case of fatigue life estimation, yielding accurate predictions alongside a thorough statistical validation and uncertainty quantification. Our pipeline constitutes a complement to traditional methodologies by reducing the amount of costly simulations and, thereby, lowering the required computational and human resources.

👍 👎 ♥ Save

When Secure Isn't: Assessing the Security of Machine Learning Model Sharing

Politecnico di Milano

Abstract
The rise of model-sharing through frameworks and dedicated hubs makes Machine Learning significantly more accessible. Despite their benefits, these tools expose users to underexplored security risks, while security awareness remains limited among both practitioners and developers. To enable a more security-conscious culture in Machine Learning model sharing, in this paper we evaluate the security posture of frameworks and hubs, assess whether security-oriented mechanisms offer real protection, and survey how users perceive the security narratives surrounding model sharing. Our evaluation shows that most frameworks and hubs address security risks partially at best, often by shifting responsibility to the user. More concerningly, our analysis of frameworks advertising security-oriented settings and complete model sharing uncovered six 0-day vulnerabilities enabling arbitrary code execution. Through this analysis, we debunk the misconceptions that the model-sharing problem is largely solved and that its security can be guaranteed by the file format used for sharing. As expected, our survey shows that the surrounding security narrative leads users to consider security-oriented settings as trustworthy, despite the weaknesses shown in this work. From this, we derive takeaways and suggestions to strengthen the security of model-sharing ecosystems.

AI Insights

Keras pre‑2.13 had CVE‑2024‑3660, enabling arbitrary code execution during model load.
Silent security fixes in Keras’s changelog never received CVE advisories or public notices.
Older Keras releases (e.g., 2.15.0) still out‑number the latest 3.9.0 in PyPI downloads, revealing legacy adoption.
Safe_mode, intended to block code execution, was disabled by default in many vulnerable releases.
The study combined PyPI download analytics with static code review to track vulnerability exposure over time.
“Secure Coding: Principles and Practices” is a must‑read for hardening model‑sharing pipelines.
Recent surveys like “Security Analysis of Deep Learning Systems” and “Machine Learning Model Insecurity: A Threat Landscape” deepen the context.

Model Monitoring

👍 👎 ♥ Save

Reasonable Experiments in Model-Based Systems Engineering

Abstract
With the current trend in Model-Based Systems Engineering towards Digital Engineering and early Validation & Verification, experiments are increasingly used to estimate system parameters and explore design decisions. Managing such experimental configuration metadata and results is of utmost importance in accelerating overall design effort. In particular, we observe it is important to 'intelligent-ly' reuse experiment-related data to save time and effort by not performing potentially superfluous, time-consuming, and resource-intensive experiments. In this work, we present a framework for managing experiments on digital and/or physical assets with a focus on case-based reasoning with domain knowledge to reuse experimental data efficiently by deciding whether an already-performed experiment (or associated answer) can be reused to answer a new (potentially different) question from the engineer/user without having to set up and perform a new experiment. We provide the general architecture for such an experiment manager and validate our approach using an industrial vehicular energy system-design case study.

👍 👎 ♥ Save

Detecting Model Drifts in Non-Stationary Environment Using Edit Operation Measures

Abstract
Reinforcement learning (RL) agents typically assume stationary environment dynamics. Yet in real-world applications such as healthcare, robotics, and finance, transition probabilities or reward functions may evolve, leading to model drift. This paper proposes a novel framework to detect such drifts by analyzing the distributional changes in sequences of agent behavior. Specifically, we introduce a suite of edit operation-based measures to quantify deviations between state-action trajectories generated under stationary and perturbed conditions. Our experiments demonstrate that these measures can effectively distinguish drifted from non-drifted scenarios, even under varying levels of noise, providing a practical tool for drift detection in non-stationary RL environments.

Machine Learning Resilience

👍 👎 ♥ Save

ForTIFAI: Fending Off Recursive Training Induced Failure for AI Models

UC San Diego, Stanford Un

Rate this image: 😍 👍 👎

Abstract
The increasing reliance on generative AI models has accelerated the generation rate of synthetic data, with some projections suggesting that most available new data for training could be machine-generated by 2030. This shift to a mainly synthetic content presents a critical challenge: repeated training in synthetic data leads to a phenomenon known as model collapse, where model performance degrades over generations of training, eventually rendering the models ineffective. Although prior studies have explored the causes and detection of model collapse, existing mitigation strategies remain limited. In this paper, we identify model overconfidence in their self-generated data as a key driver of collapse. Building on this observation, we propose a confidence-aware loss function that downweights high-confidence predictions during training. We introduce a novel loss function we call Truncated Cross Entropy (TCE). We demonstrate that TCE significantly delays model collapse in recursive training. We provide a model-agnostic framework that links the loss function design to model collapse mitigation and validate our approach both theoretically and empirically, showing that it can extend the model's fidelity interval before collapse by more than 2.3x. Finally, we show that our method generalizes across modalities. These findings suggest that the design of loss functions provides a simple yet powerful tool for preserving the quality of generative models in the era of increasing synthetic data.

AI Insights

Recursive training with a Gaussian Mixture Model can produce richly varied synthetic text, yet it also accelerates factual drift if unchecked.
The Truncated Cross Entropy loss selectively dampens high‑confidence predictions, effectively curbing the runaway confidence that fuels model collapse.
Auto‑regressive generators cannot backtrack once a mistake is made, turning a single error into a cascading error chain across generations.
Empirical results show that TCE extends the fidelity interval by over 2.3×, a substantial gain for long‑term generative pipelines.
The loss‑function framework is model‑agnostic, proving its utility across text, image, and audio modalities in preliminary tests.
A GMM‑based data loop can be tuned to balance diversity and factuality, offering a knob to control the trade‑off between creativity and correctness.
The study’s theoretical analysis links overconfidence to entropy collapse, providing a principled basis for designing future loss functions.

👍 👎 ♥ Save

\texttt{R$^\textbf{2}$AI}: Towards Resistant and Resilient AI in an Evolving World

Shanghai Artificial InteI

Abstract
In this position paper, we address the persistent gap between rapidly growing AI capabilities and lagging safety progress. Existing paradigms divide into ``Make AI Safe'', which applies post-hoc alignment and guardrails but remains brittle and reactive, and ``Make Safe AI'', which emphasizes intrinsic safety but struggles to address unforeseen risks in open-ended environments. We therefore propose \textit{safe-by-coevolution} as a new formulation of the ``Make Safe AI'' paradigm, inspired by biological immunity, in which safety becomes a dynamic, adversarial, and ongoing learning process. To operationalize this vision, we introduce \texttt{R$^2$AI} -- \textit{Resistant and Resilient AI} -- as a practical framework that unites resistance against known threats with resilience to unforeseen risks. \texttt{R$^2$AI} integrates \textit{fast and slow safe models}, adversarial simulation and verification through a \textit{safety wind tunnel}, and continual feedback loops that guide safety and capability to coevolve. We argue that this framework offers a scalable and proactive path to maintain continual safety in dynamic environments, addressing both near-term vulnerabilities and long-term existential risks as AI advances toward AGI and ASI.

AI Insights

Jailbreaking tricks large language models into generating harmful content, a rising NLP security concern.
Defenses like safety‑aware decoding, proactive safety reasoning, and circuit breakers limit LLM outputs before they reach users.
Adversarial training exposes models to malicious examples, strengthening their resistance to attacks.
Robust optimization shapes objectives for worst‑case scenarios, boosting resilience against unforeseen perturbations.
Trustworthy AI demands a multidisciplinary team spanning computer science, mathematics, philosophy, and social sciences.
Recommended reading: Vaswani et al.’s “Attention Is All You Need” and Yi et al.’s survey on jailbreak attacks.
For practical guidance, watch Andrew Ng’s “The Future of Artificial Intelligence” and Yann LeCun’s “How to Build a Safe AI System”.

Data Science Development Tools

👍 👎 ♥ Save

SQLGovernor: An LLM-powered SQL Toolkit for Real World Application

Tencent Inc, Peking Unv

Abstract
SQL queries in real world analytical environments, whether written by humans or generated automatically often suffer from syntax errors, inefficiency, or semantic misalignment, especially in complex OLAP scenarios. To address these challenges, we propose SQLGovernor, an LLM powered SQL toolkit that unifies multiple functionalities, including syntax correction, query rewriting, query modification, and consistency verification within a structured framework enhanced by knowledge management. SQLGovernor introduces a fragment wise processing strategy to enable fine grained rewriting and localized error correction, significantly reducing the cognitive load on the LLM. It further incorporates a hybrid self learning mechanism guided by expert feedback, allowing the system to continuously improve through DBMS output analysis and rule validation. Experiments on benchmarks such as BIRD and BIRD CRITIC, as well as industrial datasets, show that SQLGovernor consistently boosts the performance of base models by up to 10%, while minimizing reliance on manual expertise. Deployed in production environments, SQLGovernor demonstrates strong practical utility and effective performance.

AI Insights

SQLGovernor uses fragment‑wise processing to localize rewrites, lightening the LLM’s reasoning load.
A hybrid self‑learning loop, guided by expert feedback and DBMS logs, refines rewrite rules continuously.
Its rule‑validation engine cross‑checks queries against a curated knowledge base, catching semantic drift early.
By unifying syntax correction, rewriting, modification, and consistency checks, it replaces multiple separate tools.
Continuous learning from real‑world outcomes lets the model improve without manual tuning.
Production deployments show high reliability while cutting manual debugging time.
The knowledge‑management layer stores best‑practice patterns, enabling rapid adaptation to new schemas.

Fault tolerance

👍 👎 ♥ Save

Fault Tolerant Zero Forcing

WinstonSalem State Unvrs

Abstract
Zero forcing is an iterative graph coloring process studied for its wide array of applications. In this process the vertices of the graph are initially designated as filled or unfilled, and a zero forcing set is a set of initially filled vertices that results in all vertices filled after repeated application of a color change rule. The zero forcing number of a graph is the minimum cardinality of a zero forcing set. The zero forcing number has motivated the introduction of a host of variants defined by linear-algebraic or graph-theoretic contexts. We define a variant we term the $k$-fault tolerant zero forcing number, which is the minimum cardinality of a set $B$ such that every subset of $B$ of cardinality $|B|-k$ is a zero forcing set. We study the values of this parameter on various graph families, the behavior under various graph operations, and the number of iterations of the color change rule necessary to fill all vertices.

AI Insights

Leaky forcing extends zero forcing by allowing a bounded number of “leaks” while still guaranteeing full propagation.
The paper proves that leaky forcing number is bounded above by the tree‑width plus one, linking it to classic width parameters.
A novel upper bound on the minimum rank of skew‑symmetric matrices is derived via leaky forcing techniques.
Positive‑semidefinite zero forcing is shown to be equivalent to a restricted leaky forcing process on the graph’s complement.
The authors formulate several open problems, including characterizing graphs with equal leaky and ordinary zero‑forcing numbers.
Connections to resilient controllability in networked systems are explored, positioning leaky forcing as a tool for fault‑tolerant design.
Recommended reading includes the AIM monograph “Zero Forcing Sets and Minimum Rank of Graphs” and Barioli et al.’s survey on minimum‑rank problems.

👍 👎 ♥ Save

RAFFLES: Reasoning-based Attribution of Faults for LLM Systems

Capital One

Abstract
We have reached a critical roadblock in the development and enhancement of long-horizon, multi-component LLM agentic systems: it is incredibly tricky to identify where these systems break down and why. Evaluation capabilities that currently exist today (e.g., single pass LLM-as-a-judge) are limited in that they often focus on individual metrics or capabilities, end-to-end outcomes, and are narrowly grounded on the preferences of humans. We argue that to match the agentic capabilities, evaluation frameworks must also be able to reason, probe, iterate, and understand the complex logic passing through these systems over long horizons. In this paper, we present RAFFLES - an evaluation architecture that incorporates reasoning and iterative refinement. Specifically, RAFFLES operates as an iterative, multi-component pipeline, using a central Judge to systematically investigate faults and a set of specialized Evaluators to assess not only the system's components but also the quality of the reasoning by the Judge itself, thereby building a history of hypotheses. We tested RAFFLES against several baselines on the Who&When dataset, a benchmark designed to diagnose the "who" (agent) and "when" (step) of a system's failure. RAFFLES outperforms these baselines, achieving an agent-step fault pair accuracy of over 43% on the Algorithmically-Generated dataset (a substantial increase from the previously published best of 16.6%) and over 20% on the Hand-Crafted dataset (surpassing the previously published best of 8.8%). These results demonstrate a key step towards introducing automated fault detection for autonomous systems over labor-intensive manual human review.

AI Insights

RAFFLES’ Judge iteratively refines hypotheses, logging fault‑attribution reasoning.
Evaluators audit both component faults and the Judge’s own logic for soundness.
It probes long‑horizon agentic chains, revealing hidden error propagation beyond single‑pass judges.
The Who&When benchmark forces pinpointing of both faulty agent and exact step, a nuance missing from most metrics.
Data‑driven fault attribution can drift; RAFFLES’ reasoning layer counters bias via explicit hypothesis testing.
Key readings: “Introduction to Fault Tolerance” and the 2023 survey on fault attribution techniques.
Coursera and edX labs on fault‑tolerant systems complement RAFFLES’ automated diagnostics.

Machine Learning Validation

👍 👎 ♥ Save

Synthetic Dataset Evaluation Based on Generalized Cross Validation

Abstract
With the rapid advancement of synthetic dataset generation techniques, evaluating the quality of synthetic data has become a critical research focus. Robust evaluation not only drives innovations in data generation methods but also guides researchers in optimizing the utilization of these synthetic resources. However, current evaluation studies for synthetic datasets remain limited, lacking a universally accepted standard framework. To address this, this paper proposes a novel evaluation framework integrating generalized cross-validation experiments and domain transfer learning principles, enabling generalizable and comparable assessments of synthetic dataset quality. The framework involves training task-specific models (e.g., YOLOv5s) on both synthetic datasets and multiple real-world benchmarks (e.g., KITTI, BDD100K), forming a cross-performance matrix. Following normalization, a Generalized Cross-Validation (GCV) Matrix is constructed to quantify domain transferability. The framework introduces two key metrics. One measures the simulation quality by quantifying the similarity between synthetic data and real-world datasets, while another evaluates the transfer quality by assessing the diversity and coverage of synthetic data across various real-world scenarios. Experimental validation on Virtual KITTI demonstrates the effectiveness of our proposed framework and metrics in assessing synthetic data fidelity. This scalable and quantifiable evaluation solution overcomes traditional limitations, providing a principled approach to guide synthetic dataset optimization in artificial intelligence research.

Online inference

👍 👎 ♥ Save

Statistical Inference for Misspecified Contextual Bandits

University of WisconsinM

Abstract
Contextual bandit algorithms have transformed modern experimentation by enabling real-time adaptation for personalized treatment and efficient use of data. Yet these advantages create challenges for statistical inference due to adaptivity. A fundamental property that supports valid inference is policy convergence, meaning that action-selection probabilities converge in probability given the context. Convergence ensures replicability of adaptive experiments and stability of online algorithms. In this paper, we highlight a previously overlooked issue: widely used algorithms such as LinUCB may fail to converge when the reward model is misspecified, and such non-convergence creates fundamental obstacles for statistical inference. This issue is practically important, as misspecified models -- such as linear approximations of complex dynamic system -- are often employed in real-world adaptive experiments to balance bias and variance. Motivated by this insight, we propose and analyze a broad class of algorithms that are guaranteed to converge even under model misspecification. Building on this guarantee, we develop a general inference framework based on an inverse-probability-weighted Z-estimator (IPW-Z) and establish its asymptotic normality with a consistent variance estimator. Simulation studies confirm that the proposed method provides robust and data-efficient confidence intervals, and can outperform existing approaches that exist only in the special case of offline policy evaluation. Taken together, our results underscore the importance of designing adaptive algorithms with built-in convergence guarantees to enable stable experimentation and valid statistical inference in practice.

AI Insights

Thompson sampling’s Bayesian optimism extends to contextual bandits, achieving near‑optimal regret in high‑dimensional settings.
Neural‑network approximators in contextual bandits learn complex reward surfaces but amplify misspecification risks highlighted by LinUCB failures.
Off‑policy evaluation via importance sampling and doubly robust estimators yields unbiased policy value estimates without fresh data.
The exploration–exploitation trade‑off remains critical; data scarcity can force aggressive exploration that destabilizes convergence.
Inverse‑probability‑weighted Z‑estimators bridge adaptive experimentation and asymptotic theory, delivering confidence intervals that adapt to policy drift.

👍 👎 ♥ Save

Imitative Membership Inference Attack

Purdue University & NVIDA

Abstract
A Membership Inference Attack (MIA) assesses how much a target machine learning model reveals about its training data by determining whether specific query instances were part of the training set. State-of-the-art MIAs rely on training hundreds of shadow models that are independent of the target model, leading to significant computational overhead. In this paper, we introduce Imitative Membership Inference Attack (IMIA), which employs a novel imitative training technique to strategically construct a small number of target-informed imitative models that closely replicate the target model's behavior for inference. Extensive experimental results demonstrate that IMIA substantially outperforms existing MIAs in various attack settings while only requiring less than 5% of the computational cost of state-of-the-art approaches.

Machine Learning Testing

👍 👎 ♥ Save

Concolic Testing on Individual Fairness of Neural Network Models

National ChengChi Univer

Rate this image: 😍 👍 👎

Abstract
This paper introduces PyFair, a formal framework for evaluating and verifying individual fairness of Deep Neural Networks (DNNs). By adapting the concolic testing tool PyCT, we generate fairness-specific path constraints to systematically explore DNN behaviors. Our key innovation is a dual network architecture that enables comprehensive fairness assessments and provides completeness guarantees for certain network types. We evaluate PyFair on 25 benchmark models, including those enhanced by existing bias mitigation techniques. Results demonstrate PyFair's efficacy in detecting discriminatory instances and verifying fairness, while also revealing scalability challenges for complex models. This work advances algorithmic fairness in critical domains by offering a rigorous, systematic method for fairness testing and verification of pre-trained DNNs.

AI Insights

Fairify PyCT builds a 2‑DNN by duplicating each layer’s weights, enabling symmetry‑aware fairness checks.
In benchmarks, it finds bias in 18 of 25 models, outperforming PyFair by ~30 %.
Its concolic engine generates path constraints that isolate protected‑attribute violations, revealing subtle discrimination.
Runtime scales quadratically with depth, making it resource‑hungry for very deep networks.
Results differ by protected attribute; gender and age yield higher false‑positive rates than race.
For deeper dives, see “Fairness in Machine Learning: A Survey” and the repo at https://github.com/fairlearn/fairlearn.
The tool can be integrated into CI pipelines for continuous fairness monitoring during deployment.

Interests not found

Help us improve your experience!