From Tokens to Words:
On the Inner Lexicon of LLMs

The Hebrew University of Jerusalem
ICLR 2025

Inner Lexicon Hypothesis

LLMs de-tokenize subword tokens into word by:

  • • Propagating information from preceding tokens via the attention mechanism
  • • Retrieving word identity in the final token through FFN updates
  • • Culminating in the replacement of the last sub-word token with a unified word representation

Abstract

Natural language is composed of words, but modern large language models (LLMs) process sub-words as input. A natural question raised by this discrepancy is whether LLMs encode words internally, and if so how. We present evidence that LLMs engage in an intrinsic detokenization process, where sub-word sequences are combined into coherent whole-word representations at their last token. Our experiments show that this process primarily takes place within the early and middle layers of the model. We further demonstrate its robustness to arbitrary splits (e.g., "cats" to "ca" and "ts"), typos, and importantly-to out-of-vocabulary words: when feeding the last token internal representations of such words to the model as input, it can "understand" them as the complete word despite never seeing such representations as input during training. Our findings suggest that LLMs maintain a latent vocabulary beyond the tokenizer's scope. These insights provide a practical, finetuning-free application for expanding the vocabulary of pre-trained models. By enabling the addition of new vocabulary words, we reduce input length and inference iterations, which reduces both space and model latency, with little to no loss in model accuracy.

Motivation

Words are the fundamental building blocks of language—they carry meaning and, when combined, convey complex ideas. Humans effortlessly recognize words—even in the presence of typos or alternative spellings—by drawing on a rich mental lexicon that unifies these variant forms.

Large language models, however, process text as sub-word tokens: while common words may receive a single token, less frequent words, multi-word expressions, and typos are often split into multiple tokens that do not align with natural morphology.

Diagram illustrating how words are split into sub-word tokens
Fig. 1: Tokenization often splits words (e.g., 'interpretable' or 'merci') into several smaller units - tokens.

Interestingly, although the model may encounter the same word in different tokenized forms—such as a single token (cats) or split into several tokens (c a t s)—it ultimately consolidates these variations into a unified internal representation. This detokenization process mirrors our own mental lexicon, where diverse presentations of a word converge to convey the same meaning.

Illustration showing LLM mapping multiple token sequences to a single 'word' meaning
Fig. 2: The model unifies various sub-word sequences (e.g., 'cats', 'c a t s') into the same underlying representation, effectively “detokenizing” them into one word.

This work investigates how LLMs still manage to recognize and reconstruct these fragmented inputs, a process we refer to as detokenization.

Understanding this internal detokenization reveals how models maintain an implicit lexicon beyond the raw tokenizer vocabulary. Uncovering these mechanisms has practical implications: we can reduce token counts, speed up inference, and improve performance for languages disproportionately split by tokenizers—all without sacrificing comprehension of words or expressions.

A motivating observation: LLMs can tell words from non-words

Can LLMs decide when a sequence of sub-word tokens forms a real word rather than gibberish?

To explore this, we build a balanced dataset using 10,000 multi-token English words from the Gutenberg corpus (Gerlach & Font-Clos, 2018) paired with nonwords generated by shuffling tokens from real words. The nonwords are created by reordering tokens while preserving their typical positions—for example, an ing token always appears at the end and an un token at the beginning—thus mitigating distributional and positional biases that naturally arise in word formation and may otherwise introduce artifacts.

Illustration of the dataset creation process
Fig. 3: Nonwords generated by shuffling tokens while retaining their natural positions.

We now apply a 5-nearest neighbors classifier to the hidden representations of the last tokens across different layers, testing the model's ability to distinguish words from nonwords on unseen examples. The classifier’s accuracy increases rapidly in the early layers, peaking at 89% around layer 13 when using the last token, while using the penultimate token yields only 61% accuracy—nearly at chance level.

Visualization of classification accuracy across layers
Fig. 4: Classification accuracy peaks at 89% for the last token versus 61% for the penultimate token.

These findings indicate that LLMs construct internal representations in the early to middle layers that signal whether a token sequence constitutes a coherent word. This observation motivates our next investigation into how word identity is extracted from LLM hidden states.

Extracting word indentity from LLM hidden states & How does detokenization happen?

So far, an abstract notion of “this is a word” has emerged in the model’s hidden states. But what if the final token actually becomes the word—and when does this merging occur?

To answer this, we examine three groups of token sequences using WikiText-103 (Merity et al., 2017): (i) single-token words that we artificially split into 2–5 sub-word tokens, (ii) single-token words split due to typos, and (iii) naturally occurring multi-token (out-of-vocabulary) words. Nonwords are generated by shuffling tokens from real words while preserving their natural positions (e.g., the token ing always appears at the end and un at the beginning), thereby mitigating distributional and positional biases.

Examples of artificially split words, typo-induced splits, and natural multi-token words
Fig. 5: The three groups of token sequences under investigation.

We then determine whether the final token “equals” the full word by probing the hidden representations. For in-vocabulary words artificially split into multiple tokens, we apply logit lens (nostalgebraist, 2020) to project the final token’s hidden state onto the input embedding space. For multi-token words, we use the Patchscopes technique (Ghandeharioun et al., 2024) to prompt the model to repeat the original word.

Schematic of detokenization methodology using logit lens and Patchscopes
Fig. 6: Overview of our methodology for interpreting the residual stream.

Our experiments show that the final token begins to merge into the full word’s representation in the early to middle layers. For artificially split words, retrieval accuracy using the final token increases after layer 4 and peaks above 80% at around layer 15—with 93.2% of words correctly recovered in at least one layer—whereas for artificial separation, 66% for typos and 78% for multi-token words.

Retrieval accuracy for final vs. penultimate tokens across layers
Fig. 7: retrival of the original word within the last token of the residual stream along the layers of the model.

To understand how detokenization happens, we analyze the transformer’s components. First, we focus on the feedforward network (FFN) layers, which act as implicit key–value memories (Geva et al., 2022; Meng et al., 2022b). By applying the logit lens to FFN outputs in our typos experiment, we find that FFN update vectors match the input representation of the word in roughly 5% of the layers and, cumulatively, achieve a 70% match rate—often preceding full recovery in the residual stream. Ablation experiments on Llama2-7B reveal that canceling these critical FFN updates drops retrieval from 85% to 18% and reduces performance on a downstream country–capital prediction task from 88% to 41%.

FFN update analysis via logit lens
Fig. 8: FFN update vectors sporadically align with word representations prior to full convergence in the residual stream, indicating an underlying retrieval mechanism.
Impact of FFN ablation on downstream task performance
Fig. 9: Ablation of key FFN updates reduces downstream task performance from 88% to 41%.

Second, we examine token aggregation by measuring attention weights. For multi-token words, the final token exhibits a sharp attention peak toward its preceding tokens in the first two layers—which then declines by up to 90%—while single-token words show a considerably lower initial attention that later increases. This indicates that early layers aggregate information from prefix tokens, setting the stage for the FFN layers to refine the final token’s representation.

Attention from final token to prefix tokens in multi-token words
Fig. 10: Early layers (layers 2–3) show a significant attention peak from the final token to its prefix tokens in multi-token words.

In summary, our analysis reveals a two-stage detokenization process: first, early layers aggregate information from prefix tokens via high attention, and then FFN layers refine the final token’s hidden state to retrieve the full word representation from an implicit key–value memory. These rigorous findings—using techniques such as the logit lens (nostalgebraist, 2020) and Patchscopes (Ghandeharioun et al., 2024)—provide key insights into how LLMs construct word-level representations from sub-word tokens.

Expanding LLM vocabulary without finetuning

Our analysis shows that LLMs naturally fuse multi-token words into a single-token representation. This insight raises a key question: can we leverage these fused representations to process fewer tokens—thereby reducing computation—without sacrificing generation quality?

Motivated by this idea, we propose a post-hoc vocabulary expansion method that detects words the model effectively detokenizes to a single vector. By extracting this representative hidden state, we can treat it as a new token embedding in both the input and output spaces, ultimately reducing the total number of tokens processed during inference.

Our framework follows a 3-step process:

  1. Representation Extraction: For each multi-token word w (from WikiText-103; Merity et al., 2017), we feed w into the model and apply Patchscopes (Ghandeharioun et al., 2024) with a specific prompt to its final token across all layers. We then identify the earliest layer ℓ at which this hidden state decodes to the complete word.
  2. Linear Mapping: We learn two linear transformations, Tℓ,E and Tℓ,U, via orthogonal Procrustes (Schönemann, 1966), which map the hidden states at layer ℓ onto the model’s input embedding matrix E and output unembedding matrix U. These mappings are learned solely from the original vocabulary.
  3. Representation Refinement: Finally, we refine the new token embeddings by initializing two refinement matrices, WE and WU, and computing the final embeddings as e = ê + WE ê and u = û + WU û. The refinement matrices are optimized in a short continued pretraining run while keeping all other model parameters frozen. If a word is never successfully detokenized (i.e. a representative hidden state is not found), it is excluded from the expanded vocabulary.
Stages of the vocabulary expansion process
Fig. 11: A 3-step method for expanding the vocabulary without modifying the model's core parameters.

Harnessing these newly derived token embeddings offers significant computational benefits. Our experiments demonstrate that using the expanded vocabulary reduces the average sequence length by 10.5% on WikiText-103, 13.5% on PubMed, and 14.5% on Arabic Wiki40B—all while maintaining overall generation performance.

Table summarizing token reduction results
Table 1: Token reduction rates achieved with vocabulary expansion—10.5% for WikiText-103, 13.5% for PubMed, and 14.5% for Arabic Wiki40B.

In summary, our vocabulary expansion method leverages detokenized word representations to enable processing of fewer tokens during both input encoding and generation, thereby reducing computational cost and inference latency. This approach is especially beneficial for languages and domains where token sequences are inherently longer, achieving efficiency gains with minimal to no impact on generation quality.

Related works

Patchscopes: A Framework for Inspecting Hidden Representations

Ghandeharioun et al., 2024

This work introduces a unified method for visualizing hidden token representations in language models. Its techniques are pivotal in analyzing detokenization and have directly informed our approach to isolating representative word-level embeddings.

Patchscopes Project Preview

The Remarkable Robustness of LLMs: Stages of Inference

Lad, Gurnee, & Tegmark, 2024

This work investigates the robustness of large language models by deleting and swapping adjacent layers. The authors identify four universal inference stages—detokenization, feature engineering, prediction ensembling, and residual sharpening. In particular, their findings on early-stage detokenization, where local token information is aggregated into coherent representations, resonate with our study on implicit word-level detokenization.

Stages of Inference Visualization

Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

Feucht et al., 2024

This study investigates how selectively erasing tokens leaves discernible traces in hidden representations, revealing implicit vocabulary items within LLMs. These insights underscore the latent structures that our work leverages for effective detokenization.

Token Erasure Visualization

Transformer Feed-Forward Layers Are Key-Value Memories

Geva et al., 2021

This work demonstrates that feed-forward layers in transformers operate as key-value memories, storing and retrieving factual and linguistic information. Its findings provide a critical theoretical foundation for our investigation into how word-level representations emerge from subword tokens.

FFN Key-Value Memories Visualization

Poster

From Tokens to Words Poster

BibTeX

@misc{kaplan2025tokenswordsinnerlexicon,
      title={From Tokens to Words: On the Inner Lexicon of LLMs}, 
      author={Guy Kaplan and Matanel Oren and Yuval Reif and Roy Schwartz},
      year={2025},
      eprint={2410.05864},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.05864}, 
}