UT Austin, UCLA, Google
Abstract
Modern IR systems are increasingly tasked with answering complex,
multi-faceted queries that require deep reasoning rather than simple keyword or
semantic matching. While LLM-based IR has shown great promise, the prevailing
retrieve-then-rerank paradigm inherits the limitations of embedding-based
retrieval; parametric generative approaches are difficult to update with new
information; and long-context methods that place the entire corpus in context
are computationally infeasible for large document collections. To address these
challenges, we introduce LATTICE, a hierarchical retrieval framework that
enables an LLM to reason over and navigate large corpora with logarithmic
search complexity by imposing a semantic tree structure on the corpus. Our
approach consists of two stages: (1) an offline phase that organizes the corpus
into a semantic hierarchy via either a bottom-up agglomerative strategy or a
top-down divisive strategy using multi-level summaries and (2) an online
traversal phase where a search LLM navigates this tree. A central challenge in
such LLM-guided search is that the model's relevance judgments are noisy,
context-dependent, and unaware of the hierarchy, making cross-branch and
cross-level comparisons difficult. To overcome this, we propose a traversal
algorithm that estimates calibrated latent relevance scores from local LLM
outputs and aggregates them into a global path relevance metric. Our
training-free framework achieves state-of-the-art zero-shot performance on the
reasoning-intensive BRIGHT benchmark, demonstrating up to 9% improvement in
Recall@100 and 5% in nDCG@10 over the next best zero-shot baseline.
Furthermore, compared to the fine-tuned SOTA method DIVER-v2, LATTICE attains
comparable results on BRIGHT subsets that use a static corpus for evaluation.
Shanghai Jiao Tong Univer
Abstract
Accurately modeling query-item relevance drives e-commerce ranking, yet
long-tail, knowledge-heavy, and fast-evolving queries exceed parametric LLM
coverage. External context (reviews, attribute encyclopedias, UGC) can help but
is noisy, and single-pass latency and cost forbid any clean-then-summarize
step. The model must, per query, judge relevance and decide whether to use,
partially use, or ignore the context. DyKnow-RAG is a dynamic noisy-RAG
framework built on Group Relative Policy Optimization. It trains two rollout
groups (no external context vs a single retrieved chunk) and applies
posterior-driven inter-group advantage scaling that adaptively reweights their
contributions by the per-query correctness gap. This teaches when to trust
retrieval versus fall back to parametric knowledge, without process labels,
value networks, or extra inference passes, preserving single-pass, single-chunk
deployment under production latency. Training combines: (1) supervised
initialization with a structured rationale that explicitly records the
context-usage decision; (2) an RL pool prioritized by SFT uncertainty to focus
where context choice is most consequential; and (3) an optional lightweight DPO
warm start to stabilize with-context calibration. Under a unified
retrieval/index and fixed latency budget, DyKnow-RAG outperforms SFT, DPO, and
vanilla GRPO in offline tests, and delivers consistent lifts on GSB, Query
Goodrate, and Item Goodrate in Taobao A/B testing. It is deployed in Taobao's
production relevance system, serving live traffic. To our knowledge, it is
among the first single-pass RAG solutions for e-commerce relevance, turning
noisy external signals into reliable gains without added online complexity.