Hi!

Your personalized paper recommendations for 19 to 23 January, 2026.

Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores

Trismik and University of Cambridge

Rate paper: 👍 👎 ♥ Save

AI Insights

The paper presents a novel approach to evaluating language models using Item Response Theory (IRT) and continuous IRT extension. (ML: 0.98)👍👎
The proposed approach provides a more comprehensive understanding of model performance, highlighting areas where models excel or struggle. (ML: 0.97)👍👎
The results show that the proposed approach can provide a more nuanced understanding of model performance, highlighting strengths and weaknesses in different areas. (ML: 0.97)👍👎
The method is applied to five benchmark datasets: BioLaySumm2025-PLOS, FLORES-Turkish-English, GovReport-Summarization, Nemotron-PII, and TruthfulQA. (ML: 0.97)👍👎
Continuous IRT extension: An adaptation of IRT that allows for continuous scores rather than binary ones, enabling more fine-grained analysis of model performance. (ML: 0.96)👍👎
The use of AI assistants in generating text and experimental details demonstrates their potential for automating tasks and improving research efficiency. (ML: 0.95)👍👎
Item Response Theory (IRT): A statistical framework used to analyze the relationship between items on a test or questionnaire and the latent traits or abilities they are intended to measure. (ML: 0.94)👍👎
The paper also explores the use of AI assistants to generate text and experimental details, demonstrating their potential for automating tasks. (ML: 0.92)👍👎
BERTScore: A metric used to evaluate the similarity between generated text and reference text, based on BERT embeddings. (ML: 0.91)👍👎
Heteroskedastic normal distribution: A probability distribution where the variance is not constant but depends on the mean. (ML: 0.87)👍👎

Abstract
Computerized Adaptive Testing (CAT) has proven effective for efficient LLM evaluation on multiple-choice benchmarks, but modern LLM evaluation increasingly relies on generation tasks where outputs are scored continuously rather than marked correct/incorrect. We present a principled extension of IRT-based adaptive testing to continuous bounded scores (ROUGE, BLEU, LLM-as-a-Judge) by replacing the Bernoulli response distribution with a heteroskedastic normal distribution. Building on this, we introduce an uncertainty aware ranker with adaptive stopping criteria that achieves reliable model ranking while testing as few items and as cheaply as possible. We validate our method on five benchmarks spanning n-gram-based, embedding-based, and LLM-as-judge metrics. Our method uses 2% of the items while improving ranking correlation by 0.12 τ over random sampling, with 95% accuracy on confident predictions.

Why we are recommending this paper?
Due to your Interest in Ranking

This paper explores continuous scoring, a key area for evaluating LLMs, aligning with your interest in personalization and deep learning. The focus on adaptive evaluation methods is directly relevant to improving ranking systems.

BanditLP: Large-Scale Stochastic Optimization for Personalized Recommendations

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Adaptive heuristics: Heuristics that adapt to changing conditions or environments. (ML: 0.97)👍👎
The paper assumes that the true dominant reward and cost functions are known, which may not be realistic in practice. (ML: 0.95)👍👎
The authors propose a neural optimization approach using adaptive heuristics for intelligent marketing systems. (ML: 0.93)👍👎
The proposed approach can effectively handle complex marketing systems with multiple stakeholders and constraints. (ML: 0.92)👍👎
The offline experiment setup involves generating synthetic data with 500 users and 100 items, each assigned to one of five disjoint sets of items. (ML: 0.91)👍👎
Neural optimization approach: An approach that uses neural networks to optimize a function subject to constraints. (ML: 0.90)👍👎
Multi-stakeholder contextual bandit problem: A problem where the goal is to maximize the cumulative reward while satisfying multiple constraints. (ML: 0.89)👍👎
The paper presents a multi-stakeholder contextual bandit problem, where the goal is to maximize the cumulative reward while satisfying multiple constraints. (ML: 0.88)👍👎
The offline experiment setup is limited to 500 users and 100 items, which may not generalize well to larger-scale problems. (ML: 0.85)👍👎
The offline experiment results demonstrate the effectiveness of the approach in maximizing cumulative reward while satisfying constraints. (ML: 0.81)👍👎

Abstract
We present BanditLP, a scalable multi-stakeholder contextual bandit framework that unifies neural Thompson Sampling for learning objective-specific outcomes with a large-scale linear program for constrained action selection at serving time. The methodology is application-agnostic, compatible with arbitrary neural architectures, and deployable at web scale, with an LP solver capable of handling billions of variables. Experiments on public benchmarks and synthetic data show consistent gains over strong baselines. We apply this approach in LinkedIn's email marketing system and demonstrate business win, illustrating the value of integrated exploration and constrained optimization in production.

Why we are recommending this paper?
Due to your Interest in Personalization

BanditLP directly addresses personalized recommendations using a scalable framework, which aligns with your interest in ranking and personalization. The use of contextual bandits is a core concept in building effective recommendation systems.

HumanLLM: Towards Personalized Understanding and Simulation of Human Nature

University of Science and Technology of China

Rate paper: 👍 👎 ♥ Save

AI Insights

Machine psychology: A research trend advocating for new data collection methodologies, sophisticated simulation tasks, and human-centered training regimes to align LLMs with human cognition and behavior. (ML: 0.98)👍👎
Role-playing and theory-of-mind tasks are commonly used to assess whether models can understand and replicate human behavior. (ML: 0.98)👍👎
Further research is needed to improve the performance of human-centric LLMs and address the limitations of current models. (ML: 0.98)👍👎
The development of HumanLLM demonstrates a significant step towards bridging the gap between academic and social capabilities in LLMs. (ML: 0.97)👍👎
Human-centric LLMs are needed to capture the intricacies of real human behaviors. (ML: 0.96)👍👎
The need for systematic evaluation of LLMs' human-like abilities has given rise to specialized benchmarks and datasets. (ML: 0.96)👍👎
Current LLMs have limited ability to simulate individual personalities, motivations, and dynamic social contexts. (ML: 0.95)👍👎
Current LLMs are limited in their ability to simulate individual personalities, motivations, and dynamic social contexts. (ML: 0.94)👍👎
Human-centric LLMs are necessary for authentic social intelligence in applications that interact directly with humans. (ML: 0.92)👍👎
Human-centric LLMs: LLMs that prioritize capturing the intricacies of real human behaviors and structured persona-scenario-behavior data for advanced social simulation. (ML: 0.89)👍👎

Abstract
Motivated by the remarkable progress of large language models (LLMs) in objective tasks like mathematics and coding, there is growing interest in their potential to simulate human behavior--a capability with profound implications for transforming social science research and customer-centric business insights. However, LLMs often lack a nuanced understanding of human cognition and behavior, limiting their effectiveness in social simulation and personalized applications. We posit that this limitation stems from a fundamental misalignment: standard LLM pretraining on vast, uncontextualized web data does not capture the continuous, situated context of an individual's decisions, thoughts, and behaviors over time. To bridge this gap, we introduce HumanLLM, a foundation model designed for personalized understanding and simulation of individuals. We first construct the Cognitive Genome Dataset, a large-scale corpus curated from real-world user data on platforms like Reddit, Twitter, Blogger, and Amazon. Through a rigorous, multi-stage pipeline involving data filtering, synthesis, and quality control, we automatically extract over 5.5 million user logs to distill rich profiles, behaviors, and thinking patterns. We then formulate diverse learning tasks and perform supervised fine-tuning to empower the model to predict a wide range of individualized human behaviors, thoughts, and experiences. Comprehensive evaluations demonstrate that HumanLLM achieves superior performance in predicting user actions and inner thoughts, more accurately mimics user writing styles and preferences, and generates more authentic user profiles compared to base models. Furthermore, HumanLLM shows significant gains on out-of-domain social intelligence benchmarks, indicating enhanced generalization.

Why we are recommending this paper?
Due to your Interest in Personalization

This paper investigates simulating human behavior with LLMs, a fascinating area for understanding user preferences and behaviors. The focus on personalization makes it a strong fit for your interests.

BallotRank: A Condorcet Completion Method for Graphs

University of California

Rate paper: 👍 👎 ♥ Save

AI Insights

The method has been tested on real-world elections and compared to other popular Condorcet completion methods. (ML: 0.96)👍👎
The method has been shown to be more preferable than convergence voting (CV) in some cases, as it guarantees the selection of the Condorcet winner at d=1 when one exists. (ML: 0.92)👍👎
BallotRank is a novel social choice function that aggregates individual preferences into a ranking of candidates using the PageRank algorithm. (ML: 0.92)👍👎
Pareto criterion: if all voters prefer a candidate b over a, then the SWF does not rank b above a. (ML: 0.92)👍👎
BallotRank satisfies some social choice criteria, such as anonymity, neutrality, majority, non-dictatorship, Smith, Pareto, Condorcet loser, and later-no-harm, but fails others, like IIA, monotonicity, no-show/participation, and cloning. (ML: 0.92)👍👎
Social choice function: a function that aggregates individual preferences into a ranking or subset of winners. (ML: 0.91)👍👎
BallotRank is a social choice function that aggregates individual preferences into a ranking of candidates. (ML: 0.91)👍👎
It uses the PageRank algorithm to produce a distribution of weights, where each candidate's weight represents their likelihood of being ranked first. (ML: 0.90)👍👎
Smith set: the smallest non-empty subset of candidates such that every member defeats every non-member. (ML: 0.82)👍👎
Social welfare function (SWF): a function that produces a ranking of all candidates with ties allowed. (ML: 0.81)👍👎

Abstract
We introduce BallotRank, a ranked preference aggregation method derived from a modified PageRank algorithm. It is a Condorcet-consistent method without damping, and empirical examination of nearly 2,000 ranked choice elections and over 20,000 internet polls confirms that BallotRank always identifies the Condorcet winner at conventional values of the damping parameter. We also prove that the method satisfies many of the same social choice criteria as other well-known Condorcet completion methods, but it has the advantage of being a natural social welfare function that provides a full ranking of the candidates.

Why we are recommending this paper?
Due to your Interest in Ranking

BallotRank offers a novel ranked preference aggregation method, relevant to your interest in ranking and search. The Condorcet-consistent approach provides a solid foundation for building intelligent ranking systems.

DS@GT at TREC TOT 2025: Bridging Vague Recollection with Fusion Retrieval and Learned Reranking

Georgia Institute of Technology

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

A city in Japan famous for its temples, shrines, and traditional Japanese culture. (ML: 0.87)👍👎
A series of limestone karsts and islands in Vietnam's Gulf of Tonkin, known for their dramatic scenery and diverse wildlife. (ML: 0.83)👍👎
The ancient city of Petra, carved into the sandstone cliffs of Jordan The majestic architecture of Angkor Wat in Cambodia The vibrant street art scene in Rio de Janeiro, Brazil The stunning natural beauty of Ha Long Bay in Vietnam The rich history and culture of Kyoto, Japan A city carved into the sandstone cliffs of Jordan that was once the capital of the Nabataean Kingdom. (ML: 0.82)👍👎
The colorful murals and graffiti that cover buildings and walls throughout Rio de Janeiro, Brazil. (ML: 0.81)👍👎
A massive temple complex in Cambodia built in the 12th century as a symbol of Khmer power and artistry. (ML: 0.81)👍👎

Abstract
We develop a two-stage retrieval system that combines multiple complementary retrieval methods with a learned reranker and LLM-based reranking, to address the TREC Tip-of-the-Tongue (ToT) task. In the first stage, we employ hybrid retrieval that merges LLM-based retrieval, sparse (BM25), and dense (BGE-M3) retrieval methods. We also introduce topic-aware multi-index dense retrieval that partitions the Wikipedia corpus into 24 topical domains. In the second stage, we evaluate both a trained LambdaMART reranker and LLM-based reranking. To support model training, we generate 5000 synthetic ToT queries using LLMs. Our best system achieves recall of 0.66 and NDCG@1000 of 0.41 on the test set by combining hybrid retrieval with Gemini-2.5-flash reranking, demonstrating the effectiveness of fusion retrieval.

Why we are recommending this paper?
Due to your Interest in Information Retrieval

This paper tackles the challenges of information retrieval, specifically vague recollection, using a fusion retrieval system. The integration of LLMs and learned reranking aligns well with your interest in search and personalization.

Generalization and Completeness of Stochastic Local Search Algorithms

Universidad Complutense de Madrid

Rate paper: 👍 👎 ♥ Save

AI Insights

Turing-completeness: A computation model is said to be Turing-complete if it can simulate any given Turing Machine for any given input. (ML: 0.94)👍👎
Turing Machine: A Turing Machine is defined as (Q,Σ,Γ, δ, q 0, B, F), where Q is the set of states; Σ is the input alphabet; Γ is the tape alphabet; δ is the transition function; q 0∈Q is the initial state; B∈Γ is the blank symbol; and F⊆Q is the set of final states (accepting states). (ML: 0.92)👍👎
This means that the final solution of the GA provides the output of that Turing machine for that input. (ML: 0.87)👍👎
It consists in, given a finite set W of pairs(a, b) of finite strings, find out if, for some finite sequence of pairs (where there may be several occurrences of the same pair), the string obtained by consecutively reading the first components of the pairs and the string read on the second ones coincide. (ML: 0.87)👍👎
The Turing-completeness of genetic algorithms (GAs) is proven by constructing a GA capable of simulating any given Turing Machine for any given input. (ML: 0.86)👍👎
The proof shows that GAs are Turing-complete even when all their components are defined in remarkably simple ways, such as the fitness function checking for substring inclusion. (ML: 0.86)👍👎
If PCP were decidable, so would be the halting problem (which is undecidable). (ML: 0.85)👍👎
The construction of the GA involves designing a genetic algorithm that tries to solve the Modified Post Correspondence Problem (MPCP) for the input T by sequentially finding the next pair until MPCP is solved. (ML: 0.85)👍👎
The halting problem for Turing machines is reduced to MPCP, and MPCP is reduced to PCP. (ML: 0.83)👍👎
The proof of the Turing-completeness of GAs shows that they are capable of simulating any given Turing Machine for any given input, making them Turing-complete. (ML: 0.78)👍👎
Post Correspondence Problem (PCP): PCP consists in finding a solution to MPCP. (ML: 0.78)👍👎
The GA will only stop when the tile for the closing pair has been placed correctly, and it uses a blacklist of tiles via Extrato filter the ones already discarded in the current step of the MPCP to guarantee regular convergence. (ML: 0.74)👍👎
Modified Post Correspondence Problem (MPCP): MPCP adds the constraint that the first pair of the sequence is fixed. (ML: 0.71)👍👎
The idea is to mimic the way genetic programming builds structured individuals in order to design an individual that arranges the pairs from T into incremental partial solutions of MPCP until a complete solution is found. (ML: 0.71)👍👎
This result has significant implications for the analysis and prediction of GAs, as it means that their behavior cannot be easily predicted or analyzed using traditional methods. (ML: 0.70)👍👎

Abstract
We generalize Stochastic Local Search (SLS) heuristics into a unique formal model. This model has two key components: a common structure designed to be as large as possible and a parametric structure intended to be as small as possible. Each heuristic is obtained by instantiating the parametric part in a different way. Particular instances for Genetic Algorithms (GA), Ant Colony Optimization (ACO), and Particle Swarm Optimization (PSO) are presented. Then, we use our model to prove the Turing-completeness of SLS algorithms in general. The proof uses our framework to construct a GA able to simulate any Turing machine. This Turing-completeness implies that determining any non-trivial property concerning the relationship between the inputs and the computed outputs is undecidable for GA and, by extension, for the general set of SLS methods (although not necessarily for each particular method). Similar proofs are more informally presented for PSO and ACO.

Why we are recommending this paper?
Due to your Interest in Search

Benchmarking Deep Learning Models for Raman Spectroscopy Across Open-Source Datasets

Purdue University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The authors acknowledge that their study is limited by its reliance on a small number of datasets. (ML: 0.99)👍👎
Domain shift: a phenomenon where the distribution of data in the training set differs from that of the testing set. (ML: 0.98)👍👎
The development of DSCF highlights the need for large-scale and diverse training data. (ML: 0.97)👍👎
Recent works have proposed unsupervised domain adaptation frameworks, but their effectiveness beyond the originally reported datasets are yet to be independently evaluated. (ML: 0.95)👍👎
The results of this benchmarking experiment have shown that classifying test samples that are in-distribution to the training dataset is significantly easier than test samples suffering from distribution shift due to changes in instruments and acquisition conditions, and additional contaminants. (ML: 0.94)👍👎
Foundation model: a pre-trained model that can be fine-tuned for specific tasks, often using transfer learning. (ML: 0.92)👍👎
SANet demonstrated the best overall performance across the datasets. (ML: 0.84)👍👎
The study benchmarks only five architectures and relies on minimal spectral pre-processing. (ML: 0.77)👍👎
Existing open-source Raman datasets are often restricted in size, chemical diversity or experimental variability. (ML: 0.67)👍👎
Creating large, curated experimental Raman spectral datasets that span multiple instruments, materials and measurement settings is key to developing a Raman-specific foundation model. (ML: 0.61)👍👎
Raman spectroscopy: a technique used to analyze the vibrational modes of molecules. (ML: 0.52)👍👎

Abstract
Deep learning classifiers for Raman spectroscopy are increasingly reported to outperform classical chemometric approaches. However their evaluations are often conducted in isolation or compared against traditional machine learning methods or trivially adapted vision-based architectures that were not originally proposed for Raman spectroscopy. As a result, direct comparisons between existing deep learning models developed specifically for Raman spectral analysis on shared open-source datasets remain scarce. To the best of our knowledge, this study presents one of the first systematic benchmarks comparing three or more published Raman-specific deep learning classifiers across multiple open-source Raman datasets. We evaluate five representative deep learning architectures under a unified training and hyperparameter tuning protocol across three open-source Raman datasets selected to support standard evaluation, fine-tuning, and explicit distribution-shift testing. We report classification accuracies and macro-averaged F1 scores to provide a fair and reproducible comparison of deep learning models for Raman spectra based classification.

Why we are recommending this paper?
Due to your Interest in Deep Learning

Deep Learning Approaches to Quantum Error Mitigation

Quantinuum Ltd

Rate paper: 👍 👎 ♥ Save

AI Insights

L1 Relative Change (L1RC): A measure of the difference between two probability distributions. (ML: 0.98)👍👎
Signal-to-Noise Ratio (SNR): The ratio of the signal power to the noise power in a system. (ML: 0.93)👍👎
However, on Real Pauli data the advantage clearly shifts toward the ML-based models, which outperform all baselines in both median L1 relative change and fraction of improved circuits. (ML: 0.93)👍👎
Deep learning models can learn corrections directly from data gathered during circuit runs, more easily capturing correlations. (ML: 0.88)👍👎
The best performing models are comparable to the best baseline methods on Simulated data (both Pauli and Random). (ML: 0.87)👍👎
It is defined as the L1 norm of the difference between the two distributions. (ML: 0.87)👍👎
The learned mapping from P noisy and circuit features to P ideal captures a richer structure that goes beyond coarse depolarization or measurement-error mitigation. (ML: 0.81)👍👎
The PERCEIVER model consistently achieves as good or greater median performance than the baseline mitigation techniques for Pauli circuits. (ML: 0.80)👍👎
The deep learning approaches can generalize across noise regimes, device generations, and circuit families without relying on a predefined noise model. (ML: 0.79)👍👎
The deep learning approaches can generalize across noise regimes, device generations, and circuit families without relying on a predefined noise model. (ML: 0.79)👍👎
The baseline methods retain value as lightweight, interpretable mitigation techniques, particularly for structured, low-depth circuits. (ML: 0.61)👍👎

Abstract
We present a systematic investigation of deep learning methods applied to quantum error mitigation of noisy output probability distributions from measured quantum circuits. We compare different architectures, from fully connected neural networks to transformers, and we test different design/training modalities, identifying sequence-to-sequence, attention-based models as the most effective on our datasets. These models consistently produce mitigated distributions that are closer to the ideal outputs when tested on both simulated and real device data obtained from IBM superconducting quantum processing units (QPU) up to five qubits. Across several different circuit depths, our approach outperforms other baseline error mitigation techniques. We perform a series of ablation studies to examine: how different input features (circuit, device properties, noisy output statistics) affect performance; cross-dataset generalization across circuit families; and transfer learning to a different IBM QPU. We observe that generalization performance across similar devices with the same architecture works effectively, without needing to fully retrain models.

Why we are recommending this paper?
Due to your Interest in Deep Learning

A Class of Subadditive Information Measures and their Applications

The Chinese University of Hong Kong

Rate paper: 👍 👎 ♥ Save

AI Insights

This allows for the use of various mathematical tools such as differentiation and concavity to establish the desired inequalities. (ML: 0.93)👍👎
This is done by considering the likelihood ratio function t: Y → [0,∞), defined by ty = qY(y)/rY(y). (ML: 0.89)👍👎
t(y): Y → [0,∞): the likelihood ratio function defined by ty = qY(y)/rY(y). (ML: 0.89)👍👎
The proof of Lemma 7 involves a more general approach, considering arbitrary distributions (q∗Z, r∗Z) on Z and minimizing the subadditivity gap function over all pairs (qY, rY) on any finite alphabet Y. (ML: 0.86)👍👎
This is done by rewriting the inequality in terms of a new function φ(x) = G−1(G(x) + G(c)), which is concave if and only if a certain condition involving the second derivatives of G is satisfied. (ML: 0.86)👍👎
The proof of Lemma 5 and Lemma 6 establishes that the subadditivity gap function can be minimized by considering binary alphabets for Y. (ML: 0.86)👍👎
The proof of Lemma 7 shows that the subadditivity gap function can be minimized over all pairs (qY, rY) on any finite alphabet Y. (ML: 0.85)👍👎
The proof of Lemma 7 involves minimizing the subadditivity gap function over all pairs (qY, rY) on any finite alphabet Y. (ML: 0.85)👍👎
G: [0, ∞) → [0, ∞): a strictly increasing and differentiable function. (ML: 0.84)👍👎
The proof of Lemma 5 and Lemma 6 involves showing that the subadditivity gap function can be minimized by considering binary alphabets for Y. (ML: 0.83)👍👎
The proof of Lemma 5 and Lemma 6 relies heavily on the properties of G, specifically its strict increasingness and differentiability. (ML: 0.78)👍👎
The subadditivity gap function depends on the pair (qY, rY) only through expectations of the form P yrY(y)f(ty). (ML: 0.76)👍👎
J(rY, t y): the subadditivity gap function depending on the pair (qY, rY) only through expectations of the form P yrY(y)f(ty). (ML: 0.74)👍👎
φ(x): = G−1(G(x) + G(c)): a new function introduced to simplify the inequality in Lemma 5 and Lemma 6. (ML: 0.71)👍👎

Abstract
We introduce a two-parameter family of discrepancy measures, termed \emph{$(G,f)$-divergences}, obtained by applying a non-decreasing function $G$ to an $f$-divergence $D_f$. Building on Csiszár's formulation of mutual $f$-information, we define a corresponding $(G,f)$-information measure $ I_{G,f}(X;Y)$. A central theme of the paper is subadditivity over product distributions and product channels. We develop reduction principles showing that, for broad classes of $G$, it suffices to verify divergence subadditivity on binary alphabets. Specializing to the functions $G(x)\in\{x,\log(1+x),-\log(1-x)\}$, we derive tractable sufficient conditions on $f$ that guarantee subadditivity, covering many standard $f$-divergences. Finally, we present applications to finite-blocklength converses for channel coding, bounds in binary hypothesis testing, and an extension of the Shannon--Gallager--Berlekamp sphere-packing exponent framework to subadditive $(G,f)$-divergences.

Why we are recommending this paper?
Due to your Interest in Information Retrieval

💬 Help Shape Our Pricing

We're exploring pricing options to make this project sustainable. Take 3 minutes to share what you'd be willing to pay (if anything). Your input guides our future investment.

Share Your Feedback

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback