The Key Laboratory of Cog
Abstract
The fundamental units of internal representations in large language models
(LLMs) remain undefined, limiting further understanding of their mechanisms.
Neurons or features are often regarded as such units, yet neurons suffer from
polysemy, while features face concerns of unreliable reconstruction and
instability. To address this issue, we propose the Atoms Theory, which defines
such units as atoms. We introduce the atomic inner product (AIP) to correct
representation shifting, formally define atoms, and prove the conditions that
atoms satisfy the Restricted Isometry Property (RIP), ensuring stable sparse
representations over atom set and linking to compressed sensing. Under stronger
conditions, we further establish the uniqueness and exact $\ell_1$
recoverability of the sparse representations, and provide guarantees that
single-layer sparse autoencoders (SAEs) with threshold activations can reliably
identify the atoms. To validate the Atoms Theory, we train threshold-activated
SAEs on Gemma2-2B, Gemma2-9B, and Llama3.1-8B, achieving 99.9% sparse
reconstruction across layers on average, and more than 99.8% of atoms satisfy
the uniqueness condition, compared to 0.5% for neurons and 68.2% for features,
showing that atoms more faithfully capture intrinsic representations of LLMs.
Scaling experiments further reveal the link between SAEs size and recovery
capacity. Overall, this work systematically introduces and validates Atoms
Theory of LLMs, providing a theoretical framework for understanding internal
representations and a foundation for mechanistic interpretability. Code
available at https://github.com/ChenhuiHu/towards_atoms.
Abstract
Recent work by Chatzi et al. and Ravfogel et al. has developed, for the first
time, a method for generating counterfactuals of probabilistic Large Language
Models. Such counterfactuals tell us what would - or might - have been the
output of an LLM if some factual prompt ${\bf x}$ had been ${\bf x}^*$ instead.
The ability to generate such counterfactuals is an important necessary step
towards explaining, evaluating, and comparing, the behavior of LLMs. I argue,
however, that the existing method rests on an ambiguous interpretation of LLMs:
it does not interpret LLMs literally, for the method involves the assumption
that one can change the implementation of an LLM's sampling process without
changing the LLM itself, nor does it interpret LLMs as intended, for the method
involves explicitly representing a nondeterministic LLM as a deterministic
causal model. I here present a much simpler method for generating
counterfactuals that is based on an LLM's intended interpretation by
representing it as a nondeterministic causal model instead. The advantage of my
simpler method is that it is directly applicable to any black-box LLM without
modification, as it is agnostic to any implementation details. The advantage of
the existing method, on the other hand, is that it directly implements the
generation of a specific type of counterfactuals that is useful for certain
purposes, but not for others. I clarify how both methods relate by offering a
theoretical foundation for reasoning about counterfactuals in LLMs based on
their intended semantics, thereby laying the groundwork for novel
application-specific methods for generating counterfactuals.