University of Illinois, 2
Abstract
Supervised Fine-Tuning (SFT) on domain-specific datasets is a common approach
to adapt Large Language Models (LLMs) to specialized tasks but is often
believed to degrade their general capabilities. In this work, we revisit this
trade-off and present both empirical and theoretical insights. First, we show
that SFT does not always hurt: using a smaller learning rate can substantially
mitigate general performance degradation while preserving comparable
target-domain performance. We then provide a theoretical analysis that explains
these phenomena and further motivates a new method, Token-Adaptive Loss
Reweighting (TALR). Building on this, and recognizing that smaller learning
rates alone do not fully eliminate general-performance degradation in all
cases, we evaluate a range of strategies for reducing general capability loss,
including L2 regularization, LoRA, model averaging, FLOW, and our proposed
TALR. Experimental results demonstrate that while no method completely
eliminates the trade-off, TALR consistently outperforms these baselines in
balancing domain-specific gains and general capabilities. Finally, we distill
our findings into practical guidelines for adapting LLMs to new domains: (i)
using a small learning rate to achieve a favorable trade-off, and (ii) when a
stronger balance is further desired, adopt TALR as an effective strategy.
AI Insights - CoT‑adapter lets pre‑trained LLMs generate step‑by‑step reasoning without heavy fine‑tuning.
- Chain‑of‑Thought (CoT) boosts few‑shot performance by explicitly modeling intermediate reasoning.
- Prompt engineering remains critical: the right cue can unlock CoT behavior in any model.
- Computational overhead rises with CoT, yet the gains in interpretability often outweigh the cost.
- The paper’s TALR method shares a similar spirit—reweighting tokens to preserve generality.
- Read “Deep Learning” and “Natural Language Processing (almost) from Scratch”, plus key papers like “Chain‑of‑Thought Prompt Engineering for Conversational AI” and “Reasoning Augmentation of Pre‑trained Models with Chain‑of‑Thought Adapters”.
Lehigh University, 2Uni
Abstract
The exponential growth of scientific literature poses unprecedented
challenges for researchers attempting to synthesize knowledge across rapidly
evolving fields. We present \textbf{Agentic AutoSurvey}, a multi-agent
framework for automated survey generation that addresses fundamental
limitations in existing approaches. Our system employs four specialized agents
(Paper Search Specialist, Topic Mining \& Clustering, Academic Survey Writer,
and Quality Evaluator) working in concert to generate comprehensive literature
surveys with superior synthesis quality. Through experiments on six
representative LLM research topics from COLM 2024 categories, we demonstrate
that our multi-agent approach achieves significant improvements over existing
baselines, scoring 8.18/10 compared to AutoSurvey's 4.77/10. The multi-agent
architecture processes 75--443 papers per topic (847 total across six topics)
while targeting high citation coverage (often $\geq$80\% on 75--100-paper sets;
lower on very large sets such as RLHF) through specialized agent orchestration.
Our 12-dimension evaluation captures organization, synthesis integration, and
critical analysis beyond basic metrics. These findings demonstrate that
multi-agent architectures represent a meaningful advancement for automated
literature survey generation in rapidly evolving scientific domains.