Hi!

Your personalized paper recommendations for 01 to 05 December, 2025.
Information Retrieval
Alibaba Group
Abstract
Achievement. We introduce LORE, a systematic framework for Large Generative Model-based relevance in e-commerce search. Deployed and iterated over three years, LORE achieves a cumulative +27\% improvement in online GoodRate metrics. This report shares the valuable experience gained throughout its development lifecycle, spanning data, features, training, evaluation, and deployment. Insight. While existing works apply Chain-of-Thought (CoT) to enhance relevance, they often hit a performance ceiling. We argue this stems from treating relevance as a monolithic task, lacking principled deconstruction. Our key insight is that relevance comprises distinct capabilities: knowledge and reasoning, multi-modal matching, and rule adherence. We contend that a qualitative-driven decomposition is essential for breaking through current performance bottlenecks. Contributions. LORE provides a complete blueprint for the LLM relevance lifecycle. Key contributions include: (1) A two-stage training paradigm combining progressive CoT synthesis via SFT with human preference alignment via RL. (2) A comprehensive benchmark, RAIR, designed to evaluate these core capabilities. (3) A query frequency-stratified deployment strategy that efficiently transfers offline LLM capabilities to the online system. LORE serves as both a practical solution and a methodological reference for other vertical domains.
AI Summary
  • The framework selects features to endow the model with necessary perception and reasoning skills spanning both visual and textual modalities. [3]
  • The framework incorporates Stock Keeping Unit (SKU) information to capture fine-grained, purchasable variations such as different colors, sizes, or package contents. [3]
  • The framework employs a scientific sampling strategy that adequately covers diverse e-commerce data distributions, thereby mitigating potential oversampling or undersampling bias. [3]
  • The framework uses a robust data cleaning pipeline to systematically reduce noise and enhance overall dataset quality. [2]
  • The framework uses a two-stage discrimination framework, where the model first analyzes the query's intent and attribute requirements, and then extracts relevant item attributes to render a judgment based on established rules. [1]
S&P Global
Abstract
Accurate question answering over real spreadsheets remains difficult due to multirow headers, merged cells, and unit annotations that disrupt naive chunking, while rigid SQL views fail on files lacking consistent schemas. We present SQuARE, a hybrid retrieval framework with sheet-level, complexity-aware routing. It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation. A lightweight agent supervises retrieval, refinement, or combination of results across both paths when confidence is low. This design maintains header hierarchies, time labels, and units, ensuring that returned values are faithful to the original cells and straightforward to verify. Evaluated on multi-header corporate balance sheets, a heavily merged World Bank workbook, and diverse public datasets, SQuARE consistently surpasses single-strategy baselines and ChatGPT-4o on both retrieval precision and end-to-end answer accuracy while keeping latency predictable. By decoupling retrieval from model choice, the system is compatible with emerging tabular foundation models and offers a practical bridge toward a more robust table understanding.
AI Summary
  • The paper presents a retrieval-augmented generation (RAG) framework for tabular question answering. [2]
  • The proposed RAG framework uses a combination of embedding models and SQL reasoning to improve the accuracy of tabular QA systems. [1]
Deep Learning
National Technical Univer
Abstract
The computational demands of modern Deep Neural Networks (DNNs) are immense and constantly growing. While training costs usually capture public attention, inference demands are also contributing in significant computational, energy and environmental footprints. Sparsity stands out as a critical mechanism for drastically reducing these resource demands. However, its potential remains largely untapped and is not yet fully incorporated in production AI systems. To bridge this gap, this work provides the necessary knowledge and insights for performance engineers keen to get involved in deep learning inference optimization. In particular, in this work we: a) discuss the various forms of sparsity that can be utilized in DNN inference, b) explain how the original dense computations translate to sparse kernels, c) provide an extensive bibliographic review of the state-of-the-art in the implementation of these kernels for CPUs and GPUs, d) discuss the availability of sparse datasets in support of sparsity-related research and development, e) explore the current software tools and frameworks that provide robust sparsity support, and f) present evaluation results of different implementations of the key SpMM and SDDMM kernels on CPU and GPU platforms. Ultimately, this paper aims to serve as a resource for performance engineers seeking to develop and deploy highly efficient sparse deep learning models in productions.
AI Summary
  • The text discusses various aspects of deep learning, including model architecture, training, optimization, and inference. [3]
  • Model Training: The process that makes a DNN learn to perform a specific task, much like a student learns from practice and correction. [3]
  • Batch Training: Instead of feeding individual data points one by one, models are trained on small groups of samples called batches. [3]
  • Training often requires many epochs to fully learn the data’s patterns. [3]
  • The text concludes that deep learning involves various steps from model architecture to inference, and optimization is crucial for efficient deployment of DNNs. [3]
  • The text mentions several deep learning frameworks such as PyTorch, TensorFlow, JAX, and Hugging Face Hub. [3]
  • Deep learning involves various steps from model architecture to inference, and optimization is crucial for efficient deployment of DNNs. [3]
  • But, just like how you need to practice and get better at recognizing cats, the computer needs to be trained and optimized so that it can perform well in real-world situations. [3]
  • Epochs: A single pass through the entire dataset is called an epoch. [2]
  • The text does not provide a clear explanation of the differences between various model representations such as ONNX, TorchScript, TensorFlow SavedModel / GraphDef, etc. [1]
EPFL
Abstract
In this work, we investigate the potential of weights to serve as effective representations, focusing on neural fields. Our key insight is that constraining the optimization space through a pre-trained base model and low-rank adaptation (LoRA) can induce structure in weight space. Across reconstruction, generation, and analysis tasks on 2D and 3D data, we find that multiplicative LoRA weights achieve high representation quality while exhibiting distinctiveness and semantic structure. When used with latent diffusion models, multiplicative LoRA weights enable higher-quality generation than existing weight-space methods.
Search
Google
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
Abstract
Online learning is the cornerstone of applications like recommendation and advertising systems, where models continuously adapt to shifting data distributions. Model training for such systems is remarkably expensive, a cost that multiplies during hyperparameter search. We introduce a two-stage paradigm to reduce this cost: (1) efficiently identifying the most promising configurations, and then (2) training only these selected candidates to their full potential. Our core insight is that focusing on accurate identification in the first stage, rather than achieving peak performance, allows for aggressive cost-saving measures. We develop novel data reduction and prediction strategies that specifically overcome the challenges of sequential, non-stationary data not addressed by conventional hyperparameter optimization. We validate our framework's effectiveness through a dual evaluation: first on the Criteo 1TB dataset, the largest suitable public benchmark, and second on an industrial advertising system operating at a scale two orders of magnitude larger. Our methods reduce the total hyperparameter search cost by up to 10$\times$ on the public benchmark and deliver significant, validated efficiency gains in the industrial setting.
Personalization
The University of Quebec
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
Abstract
As large language models (LLMs) become increasingly capable of generating persuasive content, understanding their effectiveness across different advertising strategies becomes critical. This paper presents a two-part investigation examining LLM-generated advertising through complementary lenses: (1) personality-based and (2) psychological persuasion principles. In our first study (n=400), we tested whether LLMs could generate personalized advertisements tailored to specific personality traits (openness and neuroticism) and how their performance compared to human experts. Results showed that LLM-generated ads achieved statistical parity with human-written ads (51.1% vs. 48.9%, p > 0.05), with no significant performance differences for matched personalities. Building on these insights, our second study (n=800) shifted focus from individual personalization to universal persuasion, testing LLM performance across four foundational psychological principles: authority, consensus, cognition, and scarcity. AI-generated ads significantly outperformed human-created content, achieving a 59.1% preference rate (vs. 40.9%, p < 0.001), with the strongest performance in authority (63.0%) and consensus (62.5%) appeals. Qualitative analysis revealed AI's advantage stems from crafting more sophisticated, aspirational messages and achieving superior visual-narrative coherence. Critically, this quality advantage proved robust: even after applying a 21.2 percentage point detection penalty when participants correctly identified AI-origin, AI ads still outperformed human ads, and 29.4% of participants chose AI content despite knowing its origin. These findings demonstrate LLMs' evolution from parity in personalization to superiority in persuasive storytelling, with significant implications for advertising practice given LLMs' near-zero marginal cost and time requirements compared to human experts.
AI Summary
  • AI-generated advertisements achieved a dominant preference rate compared to human-created advertisements. [3]
  • The performance of AI-generated content varied depending on the persuasion strategy employed, with strong results in Authority and Consensus conditions. [3]
  • Identifying an advertisement as AI-generated influenced user preference, resulting in a bias against known AI content. [3]
  • The results suggest that AI-generated content can be a viable alternative to traditional advertising methods, particularly in certain persuasion strategies. [3]
  • LLM-generated ads can be competitive with human-written ads in terms of user engagement and purchase intent. [2]
University of WestFlorida
Abstract
AIVisor, an agentic retrieval-augmented LLM for student advising, was used to examine how personalization affects system performance across multiple evaluation dimensions. Using twelve authentic advising questions intentionally designed to stress lexical precision, we compared ten personalized and non-personalized system configurations and analyzed outcomes with a Linear Mixed-Effects Model across lexical (BLEU, ROUGE-L), semantic (METEOR, BERTScore), and grounding (RAGAS) metrics. Results showed a consistent trade-off: personalization reliably improved reasoning quality and grounding, yet introduced a significant negative interaction on semantic similarity, driven not by poorer answers but by the limits of current metrics, which penalize meaningful personalized deviations from generic reference texts. This reveals a structural flaw in prevailing LLM evaluation methods, which are ill-suited for assessing user-specific responses. The fully integrated personalized configuration produced the highest overall gains, suggesting that personalization can enhance system effectiveness when evaluated with appropriate multidimensional metrics. Overall, the study demonstrates that personalization produces metric-dependent shifts rather than uniform improvements and provides a methodological foundation for more transparent and robust personalization in agentic AI.
Ranking
Carnegie Mellon
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
Abstract
The application of large language models (LLMs) in recommendation systems has recently gained traction. Traditional recommendation systems often lack explainability and suffer from issues such as popularity bias. Previous research has also indicated that LLMs, when used as standalone predictors, fail to achieve accuracy comparable to traditional models. To address these challenges, we propose to use LLM as an explainable re-ranker, a hybrid approach that combines traditional recommendation models with LLMs to enhance both accuracy and interpretability. We constructed a dataset to train the re-ranker LLM and evaluated the alignment between the generated dataset and human expectations. Leveraging a two-stage training process, our model significantly improved NDCG, a key ranking metric. Moreover, the re-ranker outperformed a zero-shot baseline in ranking accuracy and interpretability. These results highlight the potential of integrating traditional recommendation models with LLMs to address limitations in existing systems and pave the way for more explainable and fair recommendation frameworks.
AI Summary
  • The study introduces an approach to integrating Large Language Models (LLMs) into recommendation systems, addressing challenges such as explainability and popularity bias. [3]
  • The model combines traditional recommendation models with an LLM-based re-ranker, improving ranking accuracy and interpretability. [3]
  • After two-stage training, the model outperformed zero-shot baselines and demonstrated significant improvements in metrics like NDCG. [3]
  • RECOMMENDER SYSTEM: A system that suggests items or services to users based on their preferences and behavior. [3]
  • NDCG (Normalized Discounted Cumulative Gain): A metric used to evaluate the performance of recommender systems, taking into account the relevance and ranking of recommended items. [3]
  • The approach has the potential to address challenges such as explainability and popularity bias in recommender systems. [3]
  • Human evaluations confirmed the alignment of LLM-generated datasets with human expectations and demonstrated the trained model's ability to provide more compelling explanations. [2]
  • The study demonstrates the effectiveness of integrating LLMs into recommendation systems, improving both accuracy and interpretability. [1]
Mercor
Abstract
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform high-value consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) (55.2%) and GPT 5.1 (Thinking = High) (55.1%). Models differ across domains, and in Shopping the top model scores under 50%. For some requests (such as giving the correct price or providing working links), models are highly prone to hallucination. Overall, ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.
AI Summary
  • The ACE benchmark is a comprehensive evaluation of conversational AI models, covering various domains such as DIY, food, gaming, and shopping. [3]
  • The benchmark consists of multiple workflows for each domain, with specific instructions and criteria for the models to follow. [3]
  • Gemini 2.5 Flash (On), Gemini 3 Pro (High), and o3 (On) also demonstrate strong performance across various domains. [3]
  • The benchmark highlights the strengths and weaknesses of each model, providing valuable insights for developers and researchers to improve their conversational AI systems. [3]
  • ACE-v1-heldout: A subset of the ACE benchmark used for evaluation, consisting of 100 cases per domain. [3]
  • Bootstrapped confidence intervals: A statistical method used to estimate the uncertainty of mean scores by resampling with replacement from the original dataset. [3]
  • Domain: A specific category or area of expertise within the ACE benchmark, such as DIY, food, gaming, or shopping. [3]
  • Model: A conversational AI system being evaluated on the ACE benchmark, including models like Gemini 2.5 Flash (On), GPT-5 (High), and o3 (On). [3]
  • The ACE benchmark provides a comprehensive evaluation of conversational AI models across various domains. [3]
  • The results show that GPT-5 (High) and GPT-5.1 (High) perform exceptionally well in most domains, achieving high mean scores and confidence intervals. [2]