Xiaoduo AI
Abstract
The rapid evolution of specialized large language models (LLMs) has
transitioned from simple domain adaptation to sophisticated native
architectures, marking a paradigm shift in AI development. This survey
systematically examines this progression across healthcare, finance, legal, and
technical domains. Besides the wide use of specialized LLMs, technical
breakthrough such as the emergence of domain-native designs beyond fine-tuning,
growing emphasis on parameter efficiency through sparse computation and
quantization, increasing integration of multimodal capabilities and so on are
applied to recent LLM agent. Our analysis reveals how these innovations address
fundamental limitations of general-purpose LLMs in professional applications,
with specialized models consistently performance gains on domain-specific
benchmarks. The survey further highlights the implications for E-Commerce field
to fill gaps in the field.
Abstract
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is
represented, integrated, and applied in scientific research, yet their progress
is shaped by the complex nature of scientific data. This survey presents a
comprehensive, data-centric synthesis that reframes the development of Sci-LLMs
as a co-evolution between models and their underlying data substrate. We
formulate a unified taxonomy of scientific data and a hierarchical model of
scientific knowledge, emphasizing the multimodal, cross-scale, and
domain-specific challenges that differentiate scientific corpora from general
natural language processing datasets. We systematically review recent Sci-LLMs,
from general-purpose foundations to specialized models across diverse
scientific disciplines, alongside an extensive analysis of over 270
pre-/post-training datasets, showing why Sci-LLMs pose distinct demands --
heterogeneous, multi-scale, uncertainty-laden corpora that require
representations preserving domain invariance and enabling cross-modal
reasoning. On evaluation, we examine over 190 benchmark datasets and trace a
shift from static exams toward process- and discovery-oriented assessments with
advanced evaluation protocols. These data-centric analyses highlight persistent
issues in scientific data development and discuss emerging solutions involving
semi-automated annotation pipelines and expert validation. Finally, we outline
a paradigm shift toward closed-loop systems where autonomous agents based on
Sci-LLMs actively experiment, validate, and contribute to a living, evolving
knowledge base. Collectively, this work provides a roadmap for building
trustworthy, continually evolving artificial intelligence (AI) systems that
function as a true partner in accelerating scientific discovery.