Abstract
Recent advances in large language models (LLMs) have significantly improved
the performance of dialog systems, yet current approaches often fail to provide
accurate guidance of topic due to their inability to discern user confusion in
related concepts. To address this, we introduce the Ask-Good-Question (AGQ)
framework, which features an improved Concept-Enhanced Item Response Theory
(CEIRT) model to better identify users' knowledge levels. Our contributions
include applying the CEIRT model along with LLMs to directly generate guiding
questions based on the inspiring text, greatly improving information retrieval
efficiency during the question & answer process. Through comparisons with other
baseline methods, our approach outperforms by significantly enhencing the
users' information retrieval experiences.
Abstract
In recent years, with the rapid development of the depth and breadth of large
language models' capabilities, various corresponding evaluation benchmarks have
been emerging in increasing numbers. As a quantitative assessment tool for
model performance, benchmarks are not only a core means to measure model
capabilities but also a key element in guiding the direction of model
development and promoting technological innovation. We systematically review
the current status and development of large language model benchmarks for the
first time, categorizing 283 representative benchmarks into three categories:
general capabilities, domain-specific, and target-specific. General capability
benchmarks cover aspects such as core linguistics, knowledge, and reasoning;
domain-specific benchmarks focus on fields like natural sciences, humanities
and social sciences, and engineering technology; target-specific benchmarks pay
attention to risks, reliability, agents, etc. We point out that current
benchmarks have problems such as inflated scores caused by data contamination,
unfair evaluation due to cultural and linguistic biases, and lack of evaluation
on process credibility and dynamic environments, and provide a referable design
paradigm for future benchmark innovation.