Abstract
Self-recognition is a crucial metacognitive capability for AI systems,
relevant not only for psychological analysis but also for safety, particularly
in evaluative scenarios. Motivated by contradictory interpretations of whether
models possess self-recognition (Panickssery et al., 2024; Davidson et al.,
2024), we introduce a systematic evaluation framework that can be easily
applied and updated. Specifically, we measure how well 10 contemporary larger
language models (LLMs) can identify their own generated text versus text from
other models through two tasks: binary self-recognition and exact model
prediction. Different from prior claims, our results reveal a consistent
failure in self-recognition. Only 4 out of 10 models predict themselves as
generators, and the performance is rarely above random chance. Additionally,
models exhibit a strong bias toward predicting GPT and Claude families. We also
provide the first evaluation of model awareness of their own and others'
existence, as well as the reasoning behind their choices in self-recognition.
We find that the model demonstrates some knowledge of its own existence and
other models, but their reasoning reveals a hierarchical bias. They appear to
assume that GPT, Claude, and occasionally Gemini are the top-tier models, often
associating high-quality text with them. We conclude by discussing the
implications of our findings on AI safety and future directions to develop
appropriate AI self-awareness.
Stanford Institute for HU
Abstract
Designing wise AI policy is a grand challenge for society. To design such
policy, policymakers should place a premium on rigorous evidence and scientific
consensus. While several mechanisms exist for evidence generation, and nascent
mechanisms tackle evidence synthesis, we identify a complete void on consensus
formation. In this position paper, we argue NeurIPS should actively catalyze
scientific consensus on AI policy. Beyond identifying the current deficit in
consensus formation mechanisms, we argue that NeurIPS is the best option due
its strengths and the paucity of compelling alternatives. To make progress, we
recommend initial pilots for NeurIPS by distilling lessons from the IPCC's
leadership to build scientific consensus on climate policy. We dispel
predictable counters that AI researchers disagree too much to achieve consensus
and that policy engagement is not the business of NeurIPS. NeurIPS leads AI on
many fronts, and it should champion scientific consensus to create higher
quality AI policy.
AI Insights - Authors propose a unified crossβdomain framework for safety, fairness, and robustness metrics.
- They stress transparent model documentation to trace societal impact from data to deployment.
- A pilot NeurIPS consensus workshop, modeled after the IPCCβs review cycle, is recommended.
- Cybench is highlighted as a tool for quantifying cybersecurity risks in large language models.
- The paper urges studies on openβfoundationβmodel societal effects, citing recent impact research.
- Authors flag the lack of a concrete implementation roadmap as a key weakness.
- They caution that new metrics alone may not solve governance issues, calling for deeper normative work.