Abstract
Although text-to-image models excellently create realistic images from text, they struggle with long-form text due to token limits in the pretrained text encoder. In this paper, we propose the Padding Is Enough(PIE) text encoder that is trained to represent long-form text contexts with a single embedding. The embedding representing long-form text utilizes a knowledge distillation technique, where the outputs from the PIE text encoder and the CLIP text encoder are input into a diffusion model to align their outputs. Specifically, we dont train the diffusion model, but only the text encoder, thereby preserving the extensive pretrained knowledge. Furthermore, the PIE text encoder can be used to extend text prompt in task-specific large pretrained diffusion models. It helps enhance the expressiveness of large pretrained models while reducing costs. We demonstrate that our model achieves high availability as a text extension without damage to the extensive knowledge of the pretrained diffusion models.
| Original language | English |
|---|---|
| Pages (from-to) | 79-87 |
| Number of pages | 9 |
| Journal | Proceedings of the IEEE International Conference on Big Data and Smart Computing, BIGCOMP |
| Issue number | 2025 |
| DOIs | |
| State | Published - 2025 |
| Event | 2025 IEEE International Conference on Big Data and Smart Computing, BigComp 2025 - Kota Kinabalu, Malaysia Duration: 9 Feb 2025 → 12 Feb 2025 |
Keywords
- Diffusion Models
- Text Semantic Compression
- Text-to-image Model