Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

Hebrew University of Jerusalem & Technion – Israel Institute of Technology
Preprint

*Indicates Equal Contribution

Abstract

Text-to-image generation models suffer from alignment problems, where generated images fail to accurately capture the objects and relations in the text prompt. Prior work has focused on improving alignment by refining the diffusion process, ignoring the role of the text encoder, which guides the diffusion. In this work, we investigate how semantic information is distributed across token representations in text-to-image prompts, analyzing it at two levels: (1) in-item representation—whether individual tokens represent their lexical item (i.e., a word or expression conveying a single concept), and (2) cross-item interaction—whether information flows between tokens of different lexical items. We use patching techniques to uncover encoding patterns, and find that information is usually concentrated in only one or two of the item's tokens; for example, in the item "San Francisco's Golden Gate Bridge", the token "Gate" sufficiently captures the entire expression while the other tokens could effectively be discarded. Lexical items also tend to remain isolated; for instance, in the prompt "a green dog", the token "dog" encodes no visual information about "green". However, in some cases, items do influence each other's representation, often leading to misinterpretations—e.g., in the prompt "a pool by a table", the token "pool" represents a "pool table" after contextualization. Our findings highlight the critical role of token-level encoding in image generation, and demonstrate that simple interventions at the encoding stage can substantially improve alignment and generation quality.

Key Findings

  • Concentrated Information: In most cases, a lexical item’s meaning is concentrated in just one or two tokens.
  • Redundancy Reduction: Removing redundant tokens improves image generation accuracy, lowering errors by 21%.
  • Semantic Leakage: In 11% of cases, tokens unintentionally leak information between unrelated items, potentially leading to misinterpretations.
  • Effective Patching: Replacing leaked token representations with their uncontextualized versions mitigates semantic leakage by 85%.
  • Efficient Token Identification: A simple k-NN classifier predicts redundant tokens with 92% precision, enabling practical improvements in T2I pipelines.
  • Catastrophic Negligence: In about 7% of cases, a lexical item is accurately encoded but fails to appear in the final image, highlighting a disconnect between the encoder and the decoder.

Motivation

Despite the impressive advances in text-to-image models, several key issues still persist:

  • Missing Attribute Binding: In prompts such as “a pink colored giraffe,” the model may fail to accurately apply the pink attribute to the giraffe, generating a normal giraffe instead.
  • Semantic Leakage: For example, a prompt like “a person wearing a cone hat is eating” can mistakenly merge the concept of the hat with the act of eating—generating an ice-cream cone hat.
  • Catastrophic Neglect: Sometimes, as with “a tuba made of flower petals,” the primary object (the tuba) may be completely omitted, leaving only flower petals in the resulting image.

Traditionally, most approaches have addressed these challenges from the diffusion model's perspective—by tweaking the cross-attention mechanism or optimizing image latents. While these methods can mitigate certain errors, they do not solve all cases.

This observation raises a fundamental question: Could these issues originate earlier in the pipeline, within the text encoder itself?

Examples illustrating semantic leakage, missing attribute binding, and catastrophic neglect.
Left: “a tuba made of flower petals” (the tuba is neglected).
Middle: “a pink colored giraffe” (attribute binding is missing).
Right: “a person wearing cone hat is eating” (semantic leakage from eating to the cone hat).

Tracing Token Representations in the Text Encoder

We analyze how information is distributed within each lexical item by generating images from individual token representations.

Visualization of individual token representations within a lexical item.
Analysis of in-item information flow: Only select tokens capture the full semantic load of the lexical item.

Our experiments show that for most words or expressions, only one or two tokens effectively carry the full meaning.

Highlighted representative tokens among all token representations.
Typically, one or two tokens are representative while others are redundant.

To further validate this observation, we remove the unrepresentative (redundant) tokens and regenerate the image. The results confirm that omitting these tokens not only reproduces the original image but can also improve aspects like attribute binding.

Redundant removal qualitative results
Redundant token removal: The top row shows images generated solely from the representative ( bolded ) tokens after masking out uninformative tokens, while the bottom row shows the original outputs. On the left, removal has little effect on the overall generation; on the right, it significantly improves the visual alignment with the prompt.

Information Flow Between Lexical Items

Beyond individual tokens, our work examines how different lexical items interact during encoding. In 89% of cases, items remain independent; however, in the remaining 11%, uninteded interactions occurs—for example, the word “bats” may inadvertently adopt features of “baseball,” leading to a generate a baseball bat instead of the animals "bats".

Examples of information flow between items.
Examples of information flow between items. Top: Images generated from a lexical item encoded alongside another item that alters its representation. Bottom: Images generated from the uncontextualized representation of the same lexical item. The first three images (from the left) demonstrate correct information flow, while the last image (far right) demonstrates incorrect information flow.

We propose a simple yet effective method: by re-encoding the suspected leaked token in isolation and patching it into the prompt’s representation, we reduce semantic leakage errors by 85%. This approach demonstrates that addressing the issue at the token level—in the text encoder—can significantly improve overall generation quality.

Method for removing semantic leakage by replacing the leaked concept representation.
Removing semantic leakage by replacing the contextually leaked concept representation. (1) Regular generation produces an image showing a crosswalk to the right of a bus station. (2) Generation from the prompt “standing zebra,” without any context, results in the correct interpretation of the zebra as an animal. (3) Using the original prompt but substituting the leaked concept with its uncontextualized representation yields the desired image.

In practice, this patching step significantly reduces misinterpretations. By restoring each item’s “clean” encoding, we ensure that contextual cues do not unintentionally override the core meaning of an entity.

Catastrophic Negligence

In a small percentage of cases (around 7%), even though the text encoder captures the intended concept, the final image omits it completely—a phenomenon we term catastrophic negligence. Our analysis, corroborated by tools like Patchscopes, shows that while the text encoder reliably captures the intended concepts, the diffusion model occasionally fails to generate them—likely due to insufficient training examples or inherent decoding challenges.

Examples of catastrophic negligence where key concepts are omitted.
Despite accurate encoding, some items are entirely missing from the generated image of the relevant tokens.

Supplementary: In-Item Token Distribution

Our supplementary analysis further illustrates how semantic weight is unevenly distributed among tokens. In many cases, a single or a couple of sub-tokens carry most of the meaning, while others contribute little. This insight opens up avenues for more efficient token-level interventions in T2I systems.

Animated view of token-level concept representation.
An animated demonstration of how only a few tokens carry the primary semantic load.

BibTeX


@misc{kaplan2025followflowinformationflow,
      title={Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models}, 
      author={Guy Kaplan and Michael Toker and Yuval Reif and Yonatan Belinkov and Roy Schwartz},
      year={2025},
      eprint={2504.01137},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.01137}, 
}