Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

Guy Kaplan^*, Michael Toker^*, Yuval Reif, Yonatan Belinkov, Roy Schwartz

Hebrew University of Jerusalem & Technion – Israel Institute of Technology
Preprint
^*Indicates Equal Contribution

What causes semantic leakage in T2I models, and how can it be handled?

Abstract

Text-to-Image (T2I) models often suffer from issues such as semantic leakage, incorrect feature binding, and omissions of key concepts in the generated image. This work studies these phenomena by looking into the role of information flow between textual token representations. To this end, we generate images by applying the diffusion component on a subset of contextual token representations in a given prompt and observe several interesting phenomena. First, in many cases, a word or multiword expression is fully represented by one or two tokens, while other tokens are redundant. For example, in "San Francisco's Golden Gate Bridge", the token "gate" alone captures the full expression. We demonstrate the redundancy of these tokens by removing them after textual encoding and generating an image from the resulting representation. Surprisingly, we find that this process not only maintains image generation performance, but also reduces errors by 21\% compared to standard generation. We then show that information can also flow between different expressions in a sentence, which often leads to semantic leakage. Based on this observation, we propose a simple, training-free method to mitigate semantic leakage: replacing the leaked item's representation after the textual encoding with its uncontextualized representation. Remarkably, this simple approach reduces semantic leakage by 85\%. Overall, our work provides a comprehensive analysis of information flow across textual tokens in T2I models, offering both novel insights and practical benefits.

Key Findings

Concentrated Information: In most cases, a lexical item’s meaning is concentrated in just one or two tokens.
Redundancy Reduction: Removing redundant tokens improves image generation accuracy, lowering errors by 21%.
Semantic Leakage: In 11% of cases, tokens unintentionally leak information between unrelated items, potentially leading to misinterpretations.
Effective Patching: Replacing leaked token representations with their uncontextualized versions mitigates semantic leakage by 85%.
Efficient Token Identification: A simple k-NN classifier predicts redundant tokens with 92% precision, enabling practical improvements in T2I pipelines.
Catastrophic Negligence: In about 7% of cases, a lexical item is accurately encoded but fails to appear in the final image, highlighting a disconnect between the encoder and the decoder.

Motivation

Despite the impressive advances in text-to-image models, several key issues still persist:

Missing Attribute Binding: In prompts such as “a pink colored giraffe,” the model may fail to accurately apply the pink attribute to the giraffe, generating a normal giraffe instead.
Semantic Leakage: For example, a prompt like “a person wearing a cone hat is eating” can mistakenly merge the concept of the hat with the act of eating—generating an ice-cream cone hat.
Catastrophic Neglect: Sometimes, as with “a tuba made of flower petals,” the primary object (the tuba) may be completely omitted, leaving only flower petals in the resulting image.

Traditionally, most approaches have addressed these challenges from the diffusion model's perspective—by tweaking the cross-attention mechanism or optimizing image latents. While these methods can mitigate certain errors, they do not solve all cases.

This observation raises a fundamental question: Could these issues originate earlier in the pipeline, within the text encoder itself?

Examples illustrating semantic leakage, missing attribute binding, and catastrophic neglect. — Left: “a tuba made of flower petals” (the tuba is neglected).
Middle: “a pink colored giraffe” (attribute binding is missing).
Right: “a person wearing cone hat is eating” (semantic leakage from eating to the cone hat).

Tracing Token Representations in the Text Encoder

We analyze how information is distributed within each lexical item by generating images from individual token representations.

Visualization of individual token representations within a lexical item. — Analysis of in-item information flow: Only select tokens capture the full semantic load of the lexical item.

Our experiments show that for most words or expressions, only one or two tokens effectively carry the full meaning.

Highlighted representative tokens among all token representations. — Typically, one or two tokens are representative while others are redundant.

To further validate this observation, we remove the unrepresentative (redundant) tokens and regenerate the image. The results confirm that omitting these tokens not only reproduces the original image but can also improve aspects like attribute binding.

Redundant removal qualitative results — Redundant token removal: The top row shows images generated solely from the representative ( **bolded** ) tokens after masking out uninformative tokens, while the bottom row shows the original outputs. On the *left*, removal has little effect on the overall generation; on the *right*, it significantly improves the visual alignment with the prompt.

Information Flow Between Lexical Items

Beyond individual tokens, our work examines how different lexical items interact during encoding. In 89% of cases, items remain independent; however, in the remaining 11%, uninteded interactions occurs—for example, the word “bats” may inadvertently adopt features of “baseball,” leading to a generate a baseball bat instead of the animals "bats".

**Examples of information flow between items.** Top: Images generated from a lexical item encoded alongside another item that alters its representation. Bottom: Images generated from the uncontextualized representation of the same lexical item. The first three images (from the left) demonstrate correct information flow, while the last image (far right) demonstrates incorrect information flow.

We propose a simple yet effective method: by re-encoding the suspected leaked token in isolation and patching it into the prompt’s representation, we reduce semantic leakage errors by 85%. This approach demonstrates that addressing the issue at the token level—in the text encoder—can significantly improve overall generation quality.

Method for removing semantic leakage by replacing the leaked concept representation. — **Removing semantic leakage by replacing the contextually leaked concept representation.** (1) Regular generation produces an image showing a crosswalk to the right of a bus station. (2) Generation from the prompt “standing zebra,” without any context, results in the correct interpretation of the zebra as an animal. (3) Using the original prompt but substituting the leaked concept with its uncontextualized representation yields the desired image.

In practice, this patching step significantly reduces misinterpretations. By restoring each item’s “clean” encoding, we ensure that contextual cues do not unintentionally override the core meaning of an entity.

Catastrophic Negligence

In a small percentage of cases (around 7%), even though the text encoder captures the intended concept, the final image omits it completely—a phenomenon we term catastrophic negligence. Our analysis, corroborated by tools like Patchscopes, shows that while the text encoder reliably captures the intended concepts, the diffusion model occasionally fails to generate them—likely due to insufficient training examples or inherent decoding challenges.

Supplementary: In-Item Token Distribution

Our supplementary analysis further illustrates how semantic weight is unevenly distributed among tokens. In many cases, a single or a couple of sub-tokens carry most of the meaning, while others contribute little. This insight opens up avenues for more efficient token-level interventions in T2I systems.

Animated view of token-level concept representation. — An animated demonstration of how only a few tokens carry the primary semantic load.

BibTeX


@misc{kaplan2025followflowinformationflow,
      title={Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models}, 
      author={Guy Kaplan and Michael Toker and Yuval Reif and Yonatan Belinkov and Roy Schwartz},
      year={2025},
      eprint={2504.01137},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.01137}, 
}

Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

What causes semantic leakage in T2I models, and how can it be handled?

Abstract

Key Findings

Motivation

Tracing Token Representations in the Text Encoder

Information Flow Between Lexical Items

Catastrophic Negligence

Supplementary: In-Item Token Distribution

Related Works

Attend-and-Excite: Attention-Based Semantic Guidance for T2I Diffusion Models

Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models

Patchscopes: A Framework for Inspecting Hidden Representations

From Tokens to Words: On the Inner Lexicon of LLMs

DALLE-2: Examining Word-to-Concept Mapping in T2I Models

BibTeX