What causes semantic leakage in T2I models, and how can it be handled?
Abstract
Text-to-Image (T2I) models often suffer from issues such as semantic leakage, incorrect feature binding, and omissions of key concepts in the generated image. This work studies these phenomena by looking into the role of information flow between textual token representations. To this end, we generate images by applying the diffusion component on a subset of contextual token representations in a given prompt and observe several interesting phenomena. First, in many cases, a word or multiword expression is fully represented by one or two tokens, while other tokens are redundant. For example, in "San Francisco's Golden Gate Bridge", the token "gate" alone captures the full expression.
We demonstrate the redundancy of these tokens by removing them after textual encoding and generating an image from the resulting representation. Surprisingly, we find that this process not only maintains image generation performance, but also reduces errors by 21\% compared to standard generation. We then show that information can also flow between different expressions in a sentence, which often leads to semantic leakage. Based on this observation, we propose a simple, training-free method to mitigate semantic leakage: replacing the leaked item's representation after the textual encoding with its uncontextualized representation. Remarkably, this simple approach reduces semantic leakage by 85\%. Overall, our work provides a comprehensive analysis of information flow across textual tokens in T2I models, offering both novel insights and practical benefits.
Key Findings
Concentrated Information: In most cases, a lexical item’s meaning is concentrated in just one or two tokens.
Semantic Leakage: In 11% of cases, tokens unintentionally leak information between unrelated items, potentially leading to misinterpretations.
Effective Patching: Replacing leaked token representations with their uncontextualized versions mitigates semantic leakage by 85%.
Efficient Token Identification: A simple k-NN classifier predicts redundant tokens with 92% precision, enabling practical improvements in T2I pipelines.
Catastrophic Negligence: In about 7% of cases, a lexical item is accurately encoded but fails to appear in the final image, highlighting a disconnect between the encoder and the decoder.
Motivation
Despite the impressive advances in text-to-image models, several key issues still persist:
Missing Attribute Binding: In prompts such as “a pink colored giraffe,” the model may fail to accurately apply the pink attribute to the giraffe, generating a normal giraffe instead.
Semantic Leakage: For example, a prompt like “a person wearing a cone hat is eating” can mistakenly merge the concept of the hat with the act of eating—generating an ice-cream cone hat.
Catastrophic Neglect: Sometimes, as with “a tuba made of flower petals,” the primary object (the tuba) may be completely omitted, leaving only flower petals in the resulting image.
Traditionally, most approaches have addressed these challenges from the diffusion model's perspective—by tweaking the cross-attention mechanism or optimizing image latents. While these methods can mitigate certain errors, they do not solve all cases.
This observation raises a fundamental question: Could these issues originate earlier in the pipeline, within the text encoder itself?
Left: “a tuba made of flower petals” (the tuba is neglected).
Middle: “a pink colored giraffe” (attribute binding is missing).
Right: “a person wearing cone hat is eating” (semantic leakage from eating to the cone hat).
Tracing Token Representations in the Text Encoder
We analyze how information is distributed within each lexical item by generating images from individual token representations.
Analysis of in-item information flow: Only select tokens capture the full semantic load of the lexical item.
Our experiments show that for most words or expressions, only one or two tokens effectively carry the full meaning.
Typically, one or two tokens are representative while others are redundant.
To further validate this observation, we remove the unrepresentative (redundant) tokens and regenerate the image. The results confirm that omitting these tokens not only reproduces the original image but can also improve aspects like attribute binding.
Redundant token removal: The top row shows images generated solely from the representative ( bolded ) tokens after masking out uninformative tokens, while the bottom row shows the original outputs. On the left, removal has little effect on the overall generation; on the right, it significantly improves the visual alignment with the prompt.
Information Flow Between Lexical Items
Beyond individual tokens, our work examines how different lexical items interact during encoding. In 89% of cases, items remain independent; however, in the remaining 11%, uninteded interactions occurs—for example, the word “bats” may inadvertently adopt features of “baseball,” leading to a generate a baseball bat instead of the animals "bats".
Examples of information flow between items.
Top: Images generated from a
lexical item
encoded alongside
another item
that alters its representation.
Bottom: Images generated from the uncontextualized representation of the same lexical item.
The first three images (from the left) demonstrate correct information flow, while the last image (far right) demonstrates incorrect information flow.
We propose a simple yet effective method: by re-encoding the suspected leaked token in isolation and patching it into the prompt’s representation, we reduce semantic leakage errors by 85%. This approach demonstrates that addressing the issue at the token level—in the text encoder—can significantly improve overall generation quality.
Removing semantic leakage by replacing the contextually leaked concept representation.
(1) Regular generation produces an image showing a crosswalk to the right of a bus station.
(2) Generation from the prompt “standing zebra,” without any context, results in the correct interpretation of the zebra as an animal.
(3) Using the original prompt but substituting the leaked concept with its uncontextualized representation yields the desired image.
In practice, this patching step significantly reduces misinterpretations. By restoring each item’s “clean” encoding, we ensure that contextual cues do not unintentionally override the core meaning of an entity.
Catastrophic Negligence
In a small percentage of cases (around 7%), even though the text encoder captures the intended concept, the final image omits it completely—a phenomenon we term catastrophic negligence. Our analysis, corroborated by tools like Patchscopes, shows that while the text encoder reliably captures the intended concepts, the diffusion model occasionally fails to generate them—likely due to insufficient training examples or inherent decoding challenges.
Despite accurate encoding, some items are entirely missing from the generated image of the relevant tokens.
Supplementary: In-Item Token Distribution
Our supplementary analysis further illustrates how semantic weight is unevenly distributed among tokens. In many cases, a single or a couple of sub-tokens carry most of the meaning, while others contribute little. This insight opens up avenues for more efficient token-level interventions in T2I systems.
An animated demonstration of how only a few tokens carry the primary semantic load.
This influential work introduces an approach that enhances image generation by optimizing attention maps to focus on neglected entities. Its method for mitigating semantic leakage and improving attribute binding has inspired subsequent research.
This work offers a detailed analysis of how padding tokens impact token representations in text-to-image models. Its insights have been crucial for our approach, inspiring the use of pad embeddings in our intervention method to effectively identify and remove redundant tokens.
Patchscopes provides a unified approach for visualizing and interpreting hidden token representations in language models. Its techniques support analyses of semantic leakage and help guide token-level interventions.
This influential study investigates how subword tokens are fused into coherent word-level representations—a process known as implicit detokenization. Its findings laid the groundwork for our analysis of information concentration across tokens, informing our methods for isolating representative tokens in T2I pipelines.
This work critically analyzes how text-to-image models map words to visual concepts, exposing challenges such as semantic leakage and misbinding of attributes. Its findings have driven further research into token-level information flow.
BibTeX
@misc{kaplan2025followflowinformationflow,
title={Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models},
author={Guy Kaplan and Michael Toker and Yuval Reif and Yonatan Belinkov and Roy Schwartz},
year={2025},
eprint={2504.01137},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.01137},
}