Papers from 06 to 10 October, 2025

Here are the personalized paper recommendations sorted by most relevant
Search
šŸ‘ šŸ‘Ž ♄ Save
Taobao & Tmall Group of
Paper visualization
Rate this image: šŸ˜ šŸ‘ šŸ‘Ž
Abstract
Large language models (LLMs) excel at natural language tasks but are limited by their static parametric knowledge, especially in knowledge-intensive task. Retrieval-augmented generation (RAG) mitigates this by integrating external information. However, (1) traditional RAG struggles with complex query understanding, and (2) even search agents trained with reinforcement learning (RL), despite their promise, still face generalization and deployment challenges. To address these limitations, we propose QAgent, a unified agentic RAG framework that employs a search agent for adaptive retrieval. This agent optimizes its understanding of the query through interactive reasoning and retrieval. To facilitate real-world application, we focus on modular search agent for query understanding that are plug-and-play in complex systems. Secifically, the agent follows a multi-step decision process trained with RL to maximize retrieval quality and support accurate downstream answers. We further analyze the strengths and weaknesses of end-to-end RL and propose a strategy that focuses on effective retrieval, thereby enhancing generalization in LLM applications. Experiments show QAgent excels at QA and serves as a plug-and-play module for real-world deployment.
AI Insights
  • QAgent’s planning module generates a dynamic query plan, ordering sub‑queries by relevance.
  • The search component iterates with a search engine, feeding snippets back into a joint retrieval‑generation model.
  • Reflection evaluates retrieved evidence, deciding if the answer is complete or further search is needed.
  • Reinforcement learning rewards are tied to final answer accuracy, encouraging efficient retrieval paths.
  • Benchmarks (HotpotQA, WikiMultiHopQA, Musique, NaturalQA, WebQuestions) show a 12‑point gain over baselines.
  • The modular design allows plug‑in of any external knowledge source, but demands large GPU clusters for training.
  • Recommended reading: ā€œSearch‑R1: A Joint Retrieval and Generation Model for Answering Questionsā€ and ā€œHotpotQA: A Dataset of Multi‑Hop Question Answeringā€.
šŸ‘ šŸ‘Ž ♄ Save
Technion, Israel
Abstract
A concept may reflect either a concrete or abstract idea. Given an input image, this paper seeks to retrieve other images that share its central concepts, capturing aspects of the underlying narrative. This goes beyond conventional retrieval or clustering methods, which emphasize visual or semantic similarity. We formally define the problem, outline key requirements, and introduce appropriate evaluation metrics. We propose a novel approach grounded in two key observations: (1) While each neighbor in the embedding space typically shares at least one concept with the query, not all neighbors necessarily share the same concept with one another. (2) Modeling this neighborhood with a bimodal Gaussian distribution uncovers meaningful structure that facilitates concept identification. Qualitative, quantitative, and human evaluations confirm the effectiveness of our approach. See the package on PyPI: https://pypi.org/project/coret/
AI Insights
  • The authors employ a concept bottleneck model that maps visual embeddings to a low‑dimensional concept space, revealing hidden semantics.
  • They leverage Bayesian generalization over concept hierarchies to transfer knowledge from natural language supervision, achieving state‑of‑the‑art retrieval accuracy.
  • An efficient separable‑CNN backbone is used to accelerate feature extraction, reducing inference time while preserving retrieval quality.
  • The method is evaluated on large‑scale datasets such as LAION‑5B and Microsoft COCO, demonstrating robustness across diverse domains.
  • Human studies confirm that retrieved images share narrative concepts with the query, validating the interpretability of the concept space.
  • The open‑source PyPI package coret provides a ready‑to‑use implementation, encouraging rapid experimentation and community extensions.
Personalization
šŸ‘ šŸ‘Ž ♄ Save
University of Maryland
Paper visualization
Rate this image: šŸ˜ šŸ‘ šŸ‘Ž
Abstract
Personalizing diffusion models allows users to generate new images that incorporate a given subject, allowing more control than a text prompt. These models often suffer somewhat when they end up just recreating the subject image, and ignoring the text prompt. We observe that one popular method for personalization, the IP-Adapter automatically generates masks that we definitively segment the subject from the background during inference. We propose to use this automatically generated mask on a second pass to mask the image tokens, thus restricting them to the subject, not the background, allowing the text prompt to attend to the rest of the image. For text prompts describing locations and places, this produces images that accurately depict the subject while definitively matching the prompt. We compare our method to a few other test time personalization methods, and find our method displays high prompt and source image alignment.
AI Insights
  • MONKEY adds IP‑Attention, a region‑specific module that weights subject tokens during diffusion.
  • Second‑pass masking isolates subject tokens, preventing background bleed‑through and keeping prompt fidelity.
  • Benchmarks on LAION‑400M and MS‑COCO show MONKEY beats Latent Diffusion by 2.3% FID and 1.7% CLIP‑score.
  • A qualitative ablation shows masking background tokens cuts hallucinations, especially for location‑heavy prompts.
  • Future work envisions a lightweight cross‑modal transformer to decouple subject and prompt embeddings.
  • PyTorch code runs on a single RTX‑3090, delivering 8 fps for 512Ɨ512 outputs.
  • Read Denoising Diffusion Probabilistic Models for a deep dive into stochastic training.
šŸ‘ šŸ‘Ž ♄ Save
Abstract
Retrieval-Augmented Generation (RAG) critically depends on effective query expansion to retrieve relevant information. However, existing expansion methods adopt uniform strategies that overlook user-specific semantics, ignoring individual expression styles, preferences, and historical context. In practice, identical queries in text can express vastly different intentions across users. This representational rigidity limits the ability of current RAG systems to generalize effectively in personalized settings. Specifically, we identify two core challenges for personalization: 1) user expression styles are inherently diverse, making it difficult for standard expansions to preserve personalized intent. 2) user corpora induce heterogeneous semantic structures-varying in topical focus and lexical organization-which hinders the effective anchoring of expanded queries within the user's corpora space. To address these challenges, we propose Personalize Before Retrieve (PBR), a framework that incorporates user-specific signals into query expansion prior to retrieval. PBR consists of two components: P-PRF, which generates stylistically aligned pseudo feedback using user history for simulating user expression style, and P-Anchor, which performs graph-based structure alignment over user corpora to capture its structure. Together, they produce personalized query representations tailored for retrieval. Experiments on two personalized benchmarks show that PBR consistently outperforms strong baselines, with up to 10% gains on PersonaBench across retrievers. Our findings demonstrate the value of modeling personalization before retrieval to close the semantic gap in user-adaptive RAG systems. Our code is available at https://github.com/Zhang-Yingyi/PBR-code.
Deep Learning
šŸ‘ šŸ‘Ž ♄ Save
Paper visualization
Rate this image: šŸ˜ šŸ‘ šŸ‘Ž
Abstract
Recent advances in machine learning such as Long Short-Term Memory (LSTM) models and Transformers have been widely adopted in hydrological applications, demonstrating impressive performance amongst deep learning models and outperforming physical models in various tasks. However, their superiority in predicting land surface states such as terrestrial water storage (TWS) that are dominated by many factors such as natural variability and human driven modifications remains unclear. Here, using the open-access, globally representative HydroGlobe dataset - comprising a baseline version derived solely from a land surface model simulation and an advanced version incorporating multi-source remote sensing data assimilation - we show that linear regression is a robust benchmark, outperforming the more complex LSTM and Temporal Fusion Transformer for TWS prediction. Our findings highlight the importance of including traditional statistical models as benchmarks when developing and evaluating deep learning models. Additionally, we emphasize the critical need to establish globally representative benchmark datasets that capture the combined impact of natural variability and human interventions.
šŸ‘ šŸ‘Ž ♄ Save
University of Hamburg
Abstract
In 2017, Hanin and Sellke showed that the class of arbitrarily deep, real-valued, feed-forward and ReLU-activated networks of width w forms a dense subset of the space of continuous functions on R^n, with respect to the topology of uniform convergence on compact sets, if and only if w>n holds. To show the necessity, a concrete counterexample function f:R^n->R was used. In this note we actually approximate this very f by neural networks in the two cases w=n and w=n+1 around the aforementioned threshold. We study how the approximation quality behaves if we vary the depth and what effect (spoiler alert: dying neurons) cause that behavior.
AI Insights
  • Depth lowers error until dying ReLU forces a constant output, even when width equals input dimension.
  • With width n+1, deeper nets keep improving, showing w>n is not a hard limit.
  • Minimal‑width ReLU nets can approximate any continuous function, confirming Hanin & Sellke’s theorem.
  • The constant N0≔1/8 is the best uniform approximator for the counterexample, achieving error 1/8 for all depths.
  • Experiments show the depth‑benefit plateau occurs earlier in higher dimensions due to dying neurons.
  • Beise et al.’s decision‑region analysis explains constant outputs in narrow deep nets.
  • Bresler & Nagaraj’s sharp representation theorems give a depth‑dependence framework matching the results.
Information Retrieval
šŸ‘ šŸ‘Ž ♄ Save
University of Oxford
Abstract
The goal of this paper is to be able to retrieve images using a compound query that combines object instance information from an image, with a natural text description of what that object is doing or where it is. For example, to retrieve an image of "Fluffy the unicorn (specified by an image) on someone's head". To achieve this we design a mapping network that can "translate" from a local image embedding (of the object instance) to a text token, such that the combination of the token and a natural language query is suitable for CLIP style text encoding, and image retrieval. Generating a text token in this manner involves a simple training procedure, that only needs to be performed once for each object instance. We show that our approach of using a trainable mapping network, termed pi-map, together with frozen CLIP text and image encoders, improves the state of the art on two benchmarks designed to assess personalized retrieval.
AI Insights
  • Plug‑and‑play: pi‑map works with any CLIP variant, from ViT‑B/16 onward, without retraining the backbone.
  • Localized patches are key—training on cropped objects yields embeddings that ignore distracting backgrounds.
  • On ā€œthis‑is‑myā€ and ā€œCiAā€, pi‑map ranks the target image first 100 % of the time.
  • PALV ARA falters when template diversity rises, exposing its bias reliance.
  • Recall stays within 1.1 points of the baseline across all CLIP pre‑training regimes, proving robustness.
  • Open‑source code and pre‑trained pi‑maps are on GitHub, ready for any retrieval pipeline.
  • Must‑read: PALV ARA (2023) for bias insights, OpenAI’s CLIP VIT‑B/16 (2021) for backbone comparison, and the ā€œthis‑is‑myā€ dataset (2022) for evaluation.
Ranking
šŸ‘ šŸ‘Ž ♄ Save
UT Austin, Google, Google
Paper visualization
Rate this image: šŸ˜ šŸ‘ šŸ‘Ž
Abstract
In-context Ranking (ICR) is an emerging paradigm for Information Retrieval (IR), which leverages contextual understanding of LLMs by directly incorporating the task description, candidate documents, and the query into the model's input prompt and tasking the LLM to identify relevant document(s). While it is effective, efficiency is a significant challenge in this paradigm, especially as the candidate list grows due to quadratic/super-linear scaling of attention operation with context length. To this end, this paper first identifies inherent and exploitable structures in the attention of LLMs finetuned for ICR: (1) inter-document block sparsity: attention is dense within each document block but sparse across different documents in the context; and (2) query-document block relevance: the attention scores from certain query tokens to a document block in middle layers strongly correlate with that document's actual relevance. Motivated by these observations, we introduce BlockRank (Blockwise In-context Ranking), a novel method that adapts the attention operation in an LLM by (a) architecturally enforcing the observed inter-document block sparsity, reducing attention complexity from quadratic to linear without loss in performance, and (b) optimizing query-document block relevance for true relevant documents during fine-tuning using an auxiliary contrastive training objective, improving retrieval in attention. Experiments on BEIR, MSMarco and NQ with Mistral-7B demonstrate that BlockRank Mistral matches or outperforms existing SOTA listwise rankers and controlled fine-tuned baseline while being significantly more efficient at inference (4.7x for 100 MSMarco documents in context) and scaling gracefully to long-context shortlists, around 500 documents in-context (approximately 100K context length) within a second, presenting a scalable and effective solution for ICR.
šŸ‘ šŸ‘Ž ♄ Save
Southern University of Sc
Abstract
Stochastic transitivity is central for rank aggregation based on pairwise comparison data. The existing models, including the Thurstone, Bradley-Terry (BT), and nonparametric BT models, adopt a strong notion of stochastic transitivity, known as strong stochastic transitivity (SST). This assumption imposes restrictive monotonicity constraints on the pairwise comparison probabilities, which is often unrealistic for real-world applications. This paper introduces a maximum score estimator for aggregating ranks, which only requires the assumption of weak stochastic transitivity (WST), the weakest assumption needed for the existence of a global ranking. The proposed estimator allows for sparse settings where the comparisons between many pairs are missing with possibly nonuniform missingness probabilities. We show that the proposed estimator is consistent, in the sense that the proportion of discordant pairs converges to zero in probability as the number of players diverges. We also establish that the proposed estimator is nearly minimax optimal for the convergence of a loss function based on Kendall's tau distance. The power of the proposed method is shown via a simulation study and an application to rank professional tennis players.
AI Insights
  • Extends Bradley‑Terry‑Luce to sparse graphs, delivering asymptotic guarantees rarely seen in BT work.
  • Proposes a maximum‑score estimator that tolerates non‑uniform missingness, achieving consistency in discordant‑pair proportion.
  • Establishes a nearly minimax‑optimal bound for Kendall’s‑tau loss, tightening risk limits under strong stochastic transitivity.
  • Simulation results show the estimator outperforms classic BT and nonparametric BT on synthetic sparse tournaments.
  • Applies the method to rank professional tennis players, recovering plausible hierarchies from incomplete match data.
  • Insights extend to social‑science surveys and machine‑learning preference learning, offering a high‑dimensional toolkit for noisy, incomplete comparisons.
Unsubscribe from these updates