π― Top Personalized Recommendations
University of Pennsylvann
AI Summary - They are designed to assess student learning outcomes more accurately than traditional multiple-choice tests. [3]
- The paper discusses the challenges posed by large language models (LLMs) in education, particularly in assessing student learning and academic integrity. [2]
- The authors propose a framework for designing such assessments, which includes elements of task complexity, interactivity, and feedback. [1]
- It highlights the need for more authentic assessments that reflect real-world problems and require critical thinking and problem-solving skills. [0]
Abstract
The rapid adoption of generative AI has undermined traditional modular assessments in computing education, creating a disconnect between academic evaluation and industry practice. This paper presents a theoretically grounded framework for designing AI-resilient assessments, supported by formal analysis and multi-year empirical validation.
We make three contributions. First, we establish two theoretical results: (1) assessments composed of interconnected problems, where outputs feed into subsequent stages, are more AI-resilient than modular assessments because current language models struggle with sustained multi-step reasoning and context; and (2) semi-structured problems with deterministic success criteria provide more reliable measures of student competency than fully open-ended projects, which allow AI systems to default to familiar solution patterns. These results challenge common policy and institutional guidance that promotes open-ended assessments as the primary safeguard for academic integrity.
Second, we validate these results using data from four university data science courses (N = 138). While students achieve near-perfect scores on AI-assisted modular homework, performance drops by roughly 30 percentage points on proctored exams, indicating substantial AI score inflation. Interconnected projects remain strongly correlated with modular assessments, suggesting they measure the same underlying skills while resisting AI misuse. Proctored exams show weaker alignment, implying they may assess test-taking ability rather than intended learning outcomes.
Third, we translate these findings into a practical assessment design framework. The proposed approach enables educators to create assessments that promote integrative thinking, reflect real-world AI-augmented workflows, and naturally resist trivial delegation to generative AI, thereby helping restore academic integrity.
Why we think this paper is great for you:
This paper directly addresses the need for robust assessments, aligning with the userβs interest in machine learning resilience and testing. It offers a framework for designing evaluations that can withstand the challenges posed by AI, a key area of concern.
Tsinghua University
AI Summary - IRTest effectively reduces the surrogate-to-real gap with relatively few tests. [2]
Abstract
Testing and evaluating decision-making agents remains challenging due to unknown system architectures, limited access to internal states, and the vastness of high-dimensional scenario spaces. Existing testing approaches often rely on surrogate models of decision-making agents to generate large-scale scenario libraries; however, discrepancies between surrogate models and real decision-making agents significantly limit their generalizability and practical applicability. To address this challenge, this paper proposes intelligent resilience testing (IRTest), a unified online adaptive testing framework designed to rapidly adjust to diverse decision-making agents. IRTest initializes with an offline-trained surrogate prediction model and progressively reduces surrogate-to-real gap during testing through two complementary adaptation mechanisms: (i) online neural fine-tuning in data-rich regimes, and (ii) lightweight importance-sampling-based weighting correction in data-limited regimes. A Bayesian optimization strategy, equipped with bias-corrected acquisition functions, guides scenario generation to balance exploration and exploitation in complex testing spaces. Extensive experiments across varying levels of task complexity and system heterogeneity demonstrate that IRTest consistently improves failure-discovery efficiency, testing robustness, and cross-system generalizability. These results highlight the potential of IRTest as a practical solution for scalable, adaptive, and resilient testing of decision-making agents.
Why we think this paper is great for you:
The focus on testing decision-making agents resonates with the user's interest in machine learning resilience and fault tolerance. The approach of using surrogate models is a common technique for evaluating complex systems, directly relevant to their interests.
Jheronimus Academy of
AI Summary - CORE business logic: The core functionality of the system, which is not part of the PORTS or ADAPTERS. [3]
- The OCEANGUARD tool is an extensible Machine Learning System (MLES) that aims to analyze and detect anomalies across multiple types of data from the maritime domain. [2]
- The authors faced two major challenges during development: generality, related to defining PORTS that are specific and dependency-agnostic, and separation of concerns, related to defining ADAPTERS that are distinct and logic-thin. [1]
Abstract
ML-Enabled Systems (MLES) are inherently complex since they require multiple components to achieve their business goal. This experience report showcases the software architecture reusability techniques applied while building Ocean Guard, an MLES for anomaly detection in the maritime domain. In particular, it highlights the challenges and lessons learned to reuse the Ports and Adapters pattern to support building multiple microservices from a single codebase. This experience report hopes to inspire software engineers, machine learning engineers, and data scientists to apply the Hexagonal Architecture pattern to build their MLES.
Why we think this paper is great for you:
This paperβs exploration of MLOps and reusability aligns strongly with the userβs interest in MLOps, data science development environments, and building robust systems. The microservices architecture is a key component of modern MLOps.
Universidad del Pacfico
AI Summary - The problem statement is not provided. [3]
- Please provide the problem you'd like help with, and I'll do my best to assist you. [3]
- Lack of problem statement Technical complexity The text appears to be a mathematical derivation related to Doubly Robust Estimation (DRE) in statistics. [3]
- It discusses the concept of Neyman orthogonality, empirical scores, and condition numbers in the context of DML estimators. [3]
- This text is about a statistical concept called Doubly Robust Estimation. [3]
- It's a way to estimate parameters in a model, and it involves using two different methods to get the same result. [3]
- The text talks about how this method works and some of its properties. [3]
Abstract
Standard Double Machine Learning (DML; Chernozhukov et al., 2018) confidence intervals can exhibit substantial finite-sample coverage distortions when the underlying score equations are ill-conditioned, even if nuisance functions are estimated with state-of-the-art methods. Focusing on the partially linear regression (PLR) model, we show that a simple, easily computed condition number for the orthogonal score, denoted kappa_DML := 1 / |J_theta|, largely determines when DML inference is reliable. Our first result derives a nonasymptotic, Berry-Esseen-type bound showing that the coverage error of the usual DML t-statistic is of order n^{-1/2} + sqrt(n) * r_n, where r_n is the standard DML remainder term summarizing nuisance estimation error. Our second result provides a refined linearization in which both estimation error and confidence interval length scale as kappa_DML / sqrt(n) + kappa_DML * r_n, so that ill-conditioning directly inflates both variance and bias. These expansions yield three conditioning regimes - well-conditioned, moderately ill-conditioned, and severely ill-conditioned - and imply that informative, shrinking confidence sets require kappa_DML = o_p(sqrt(n)) and kappa_DML * r_n -> 0. We conduct Monte Carlo experiments across overlap levels, nuisance learners (OLS, Lasso, random forests), and both low- and high-dimensional (p > n) designs. Across these designs, kappa_DML is highly predictive of finite-sample performance: well-conditioned designs with kappa_DML < 1 deliver near-nominal coverage with short intervals, whereas severely ill-conditioned designs can exhibit large bias and coverage around 40% for nominal 95% intervals, despite flexible nuisance fitting. We propose reporting kappa_DML alongside DML estimates as a routine diagnostic of score conditioning, in direct analogy to condition-number checks and weak-instrument diagnostics in IV settings.
Why we think this paper is great for you:
The focus on finite-sample failures and condition numbers directly addresses the user's interest in machine learning resilience and diagnostics. Understanding these issues is crucial for building reliable ML systems.
Indian Institute of Techn
Abstract
We present a neural network-based framework for solving the quantitative group testing (QGT) problem that achieves both high decoding accuracy and structural verifiability. In QGT, the objective is to identify a small subset of defective items among $N$ candidates using only $M \ll N$ pooled tests, each reporting the number of defectives in the tested subset. We train a multi-layer perceptron to map noisy measurement vectors to binary defect indicators, achieving accurate and robust recovery even under sparse, bounded perturbations. Beyond accuracy, we show that the trained network implicitly learns the underlying pooling structure that links items to tests, allowing this structure to be recovered directly from the network's Jacobian. This indicates that the model does not merely memorize training patterns but internalizes the true combinatorial relationships governing QGT. Our findings reveal that standard feedforward architectures can learn verifiable inverse mappings in structured combinatorial recovery problems.
Why we think this paper is great for you:
This paperβs exploration of verifiable deep quantitative group testing aligns with the user's interest in machine learning testing and fault tolerance. The concept of structural verifiability is a key element in ensuring system robustness.
INSA Lyon
AI Summary - TriHaRd has high resilience against attacks, including those that slow down or speed up the clock, and reduces the attack power compared to Triad. [3]
- TriHaRd is a protocol that provides trusted time to TEE-based systems by verifying the TEE clock's consistency with peers in a Byzantine-resilient manner. [2]
- TEE (Trusted Execution Environment): A secure environment within a system where sensitive data can be processed without compromising security. [1]
Abstract
Accurately measuring time passing is critical for many applications. However, in Trusted Execution Environments (TEEs) such as Intel SGX, the time source is outside the Trusted Computing Base: a malicious host can manipulate the TEE's notion of time, jumping in time or affecting perceived time speed. Previous work (Triad) proposes protocols for TEEs to maintain a trustworthy time source by building a cluster of TEEs that collaborate with each other and with a remote Time Authority to maintain a continuous notion of passing time. However, such approaches still allow an attacker to control the operating system and arbitrarily manipulate their own TEE's perceived clock speed. An attacker can even propagate faster passage of time to honest machines participating in Triad's trusted time protocol, causing them to skip to timestamps arbitrarily far in the future. We propose TriHaRd, a TEE trusted time protocol achieving high resilience against clock speed and offset manipulations, notably through Byzantine-resilient clock updates and consistency checks. We empirically show that TriHaRd mitigates known attacks against Triad.
Why we think this paper is great for you:
Given the user's interest in machine learning resilience, this paperβs focus on time measurement in Trusted Execution Environments (TEEs) is highly relevant. TEEs are increasingly important for secure and reliable ML deployments.
University of Lisbon
AI Summary - The system aims to help players analyze and understand their gameplay data. [3]
- They developed a prototype that incorporates various visualization techniques, including network analysis, time-series plots, and scatterplots. [3]
- Visual analytics: The use of interactive visualizations to support analytical reasoning and decision-making. [3]
- Co-creation: A process where users collaborate with designers to create a product or service that meets their needs. [3]
- The system's effectiveness is attributed to its ability to provide actionable insights through interactive visualizations. [3]
- The paper discusses the design of a visual analytics system for Magic: The Gathering, a popular trading card game. [2]
Abstract
This paper presents the initial stages of a design study aimed at developing a dashboard to visualize gameplay data of the Commander format from Magic: The Gathering. We conducted a user-task analysis to identify requirements for a data visualization dashboard tailored to the Commander format. Afterwards, we proposed a design for the dashboard leveraging visualizations to address players' needs and pain points for typical data analysis tasks in the context domain. Then, we followed-up with a structured user test to evaluate players' comprehension and preferences of data visualizations. Results show that players prioritize contextually relevant, outcome-driven metrics over peripheral ones, and that canonical charts like heatmaps and line charts support higher comprehension than complex ones such as scatterplots or icicle plots. Our findings also highlight the importance of localized views, user customization, and progressive disclosure, emphasizing that adaptability and contextual relevance are as essential as accuracy in effective dashboard design. Our study contributes practical design guidelines for data visualization in gaming contexts and highlights broader implications for engagement-driven dashboards.
Why we think this paper is great for you:
The paperβs focus on visualizing gameplay data from Magic: The Gathering aligns with the userβs interest in data science development tools and potentially data-driven insights within a complex system.
Machine Learning Infrastructure
Universidad de Guanajuato
Abstract
This document reports the sequence of practices and methodologies implemented during the Big Data course. It details the workflow beginning with the processing of the Epsilon dataset through group and individual strategies, followed by text analysis and classification with RestMex and movie feature analysis with IMDb. Finally, it describes the technical implementation of a distributed computing cluster with Apache Spark on Linux using Scala.
AI Summary - In the big data era, data completeness can be as important as algorithm sophistication. [3]
- Big Data Analytics Distributed Computing Scalability Algorithm Sophistication Data Completeness The chronological progression demonstrates that mastering big data requires a systematic approach. [3]
- The choice between local and distributed architectures is not merely about computational resources, but about the quality and completeness of the data available to the model. [2]
Texas
Abstract
Distributed machine learning systems require strong privacy guarantees, verifiable compliance, and scalable deploy- ment across heterogeneous and multi-cloud environments. This work introduces a cloud-native privacy-preserving architecture that integrates federated learning, differential privacy, zero- knowledge compliance proofs, and adaptive governance powered by reinforcement learning. The framework supports secure model training and inference without centralizing sensitive data, while enabling cryptographically verifiable policy enforcement across institutions and cloud platforms. A full prototype deployed across hybrid Kubernetes clusters demonstrates reduced membership- inference risk, consistent enforcement of formal privacy budgets, and stable model performance under differential privacy. Ex- perimental evaluation across multi-institution workloads shows that the architecture maintains utility with minimal overhead while providing continuous, risk-aware governance. The pro- posed framework establishes a practical foundation for deploying trustworthy and compliant distributed machine learning systems at scale.
Machine Learning Operations
KU Leuven
Abstract
Decision making often occurs in the presence of incomplete information, leading to the under- or overestimation of risk. Leveraging the observable information to learn the complete information is called nowcasting. In practice, incomplete information is often a consequence of reporting or observation delays. In this paper, we propose an expectation-maximisation (EM) framework for nowcasting that uses machine learning techniques to model both the occurrence as well as the reporting process of events. We allow for the inclusion of covariate information specific to the occurrence and reporting periods as well as characteristics related to the entity for which events occurred. We demonstrate how the maximisation step and the information flow between EM iterations can be tailored to leverage the predictive power of neural networks and (extreme) gradient boosting machines (XGBoost). With simulation experiments, we show that we can effectively model both the occurrence and reporting of events when dealing with high-dimensional covariate information. In the presence of non-linear effects, we show that our methodology outperforms existing EM-based nowcasting frameworks that use generalised linear models in the maximisation step. Finally, we apply the framework to the reporting of Argentinian Covid-19 cases, where the XGBoost-based approach again is most performant.
AI Summary - The authors generate simulated data with different specifications for the coefficient vectors, including both linear and non-linear effects. [3]
- The authors' findings have implications for various applications, including epidemiology and insurance. [3]
- The authors do not provide any evidence to support this assumption in real-world data. [3]
- The paper presents a simulation study on event occurrence and reporting patterns using a Poisson-Multinomial model. [2]
- The simulation study relies heavily on the assumption that the data is generated from a Poisson-Multinomial distribution. [1]
University of California
Abstract
We analyze gradient descent with randomly weighted data points in a linear regression model, under a generic weighting distribution. This includes various forms of stochastic gradient descent, importance sampling, but also extends to weighting distributions with arbitrary continuous values, thereby providing a unified framework to analyze the impact of various kinds of noise on the training trajectory. We characterize the implicit regularization induced through the random weighting, connect it with weighted linear regression, and derive non-asymptotic bounds for convergence in first and second moments. Leveraging geometric moment contraction, we also investigate the stationary distribution induced by the added noise. Based on these results, we discuss how specific choices of weighting distribution influence both the underlying optimization problem and statistical properties of the resulting estimator, as well as some examples for which weightings that lead to fast convergence cause bad statistical performance.
Machine Learning Deployment
Stanford
Abstract
Post-deployment monitoring of artificial intelligence (AI) systems in health care is essential to ensure their safety, quality, and sustained benefit-and to support governance decisions about which systems to update, modify, or decommission. Motivated by these needs, we developed a framework for monitoring deployed AI systems grounded in the mandate to take specific actions when they fail to behave as intended. This framework, which is now actively used at Stanford Health Care, is organized around three complementary principles: system integrity, performance, and impact. System integrity monitoring focuses on maximizing system uptime, detecting runtime errors, and identifying when changes to the surrounding IT ecosystem have unintended effects. Performance monitoring focuses on maintaining accurate system behavior in the face of changing health care practices (and thus input data) over time. Impact monitoring assesses whether a deployed system continues to have value in the form of benefit to clinicians and patients. Drawing on examples of deployed AI systems at our academic medical center, we provide practical guidance for creating monitoring plans based on these principles that specify which metrics to measure, when those metrics should be reviewed, who is responsible for acting when metrics change, and what concrete follow-up actions should be taken-for both traditional and generative AI. We also discuss challenges to implementing this framework, including the effort and cost of monitoring for health systems with limited resources and the difficulty of incorporating data-driven monitoring practices into complex organizations where conflicting priorities and definitions of success often coexist. This framework offers a practical template and starting point for health systems seeking to ensure that AI deployments remain safe and effective over time.
AI Summary - Traditional and generative AI systems require unique monitoring considerations for deployment in clinical systems. [3]
- Performance monitoring: Evaluates the longitudinal accuracy and quality of AI system outputs to detect drift. [3]
- Impact monitoring: Verifies if the AI system produces sustained benefits to patients, health system staff, or health system finances over time. [3]
- The framework is applicable to both traditional and generative AI systems and can be tailored to specific use cases and deployments. [3]
- Post-deployment AI monitoring is crucial for ensuring the safety and effectiveness of AI systems in healthcare. [2]
Model Monitoring
Universit e Paris Cit e
Abstract
In this article, we propose a generic screening method for selecting explanatory variables correlated with the response variable Y . We make no assumptions about the existence of a model that could link Y with a subset of explanatory variables, nor about the distribution of the variables. Our procedure can therefore be described as ''model-free'' and can be applied in a wide range of situations. In order to obtain precise theoretical guarantees (Sure Screening Property and control of the False Positive Rate), we establish a Berry-Esseen type inequality for the studentized statistic of the slope estimator. We illustrate our selection procedure using two simulated examples and a real data set.
AI Summary - The final bound obtained is a function of several parameters related to the distributions of X and Y, including their means, variances, and higher moments. [3]
- Berry-Esseen inequality: A mathematical inequality that provides an upper bound for the distance between the distribution of a sum of independent random variables and the normal distribution. [3]
- The derivation is complex and requires careful manipulation of inequalities and algebraic expressions. [3]
- The problem statement involves deriving an upper bound for the probability of a certain event related to random variables X and Y, with specific conditions on their distributions and parameters. [2]
- The solution involves applying various mathematical techniques to derive bounds for different terms in the expression, ultimately leading to the final bound obtained. [1]
- The derivation is quite complex and requires careful manipulation of inequalities and algebraic expressions. [0]
National University of
Abstract
Relational Database Management Systems (RDBMS) manage complex, interrelated data and support a broad spectrum of analytical tasks. With the growing demand for predictive analytics, the deep integration of machine learning (ML) into RDBMS has become critical. However, a fundamental challenge hinders this evolution: conventional ML models are static and task-specific, whereas RDBMS environments are dynamic and must support diverse analytical queries. Each analytical task entails constructing a bespoke pipeline from scratch, which incurs significant development overhead and hence limits wide adoption of ML in analytics.
We present NeurIDA, an autonomous end-to-end system for in-database analytics that dynamically "tweaks" the best available base model to better serve a given analytical task. In particular, we propose a novel paradigm of dynamic in-database modeling to pre-train a composable base model architecture over the relational data. Upon receiving a task, NeurIDA formulates the task and data profile to dynamically select and configure relevant components from the pool of base models and shared model components for prediction. For friendly user experience, NeurIDA supports natural language queries; it interprets user intent to construct structured task profiles, and generates analytical reports with dedicated LLM agents. By design, NeurIDA enables ease-of-use and yet effective and efficient in-database AI analytics. Extensive experiment study shows that NeurIDA consistently delivers up to 12% improvement in AUC-ROC and 25% relative reduction in MAE across ten tasks on five real-world datasets. The source code is available at https://github.com/Zrealshadow/NeurIDA
AI Summary - NeurIDA outperforms all baselines across various tasks and datasets. [3]
- NeurIDA achieves state-of-the-art results in classification (AUC-ROC) and regression (MAE) metrics. [3]
- The effectiveness of NeurIDA is demonstrated through its ability to improve upon existing models, including those with automated learning approaches. [3]
- AUC-ROC: Area Under the Receiver Operating Characteristic Curve, a metric used for classification tasks to evaluate model performance. [3]
- MAE: Mean Absolute Error, a metric used for regression tasks to evaluate model performance. [3]
- Early Stopping: A technique used in training neural networks where the training process is stopped when the model's performance on the validation set starts to degrade. [3]
- NeurIDA demonstrates superior performance across various classification and regression tasks, outperforming existing models with or without automated learning approaches. [3]
- The effectiveness of NeurIDA can be attributed to its ability to incorporate relational structure into tuple representations, leading to improved model performance. [3]
- The paper does not provide an extensive analysis of the computational complexity of NeurIDA. [3]
- The evaluation metrics used are limited to AUC-ROC and MAE, which may not capture other aspects of model performance. [3]