PIE Text Encoder: Padding Is Enough to generate Image in diffusion based Text-to-Image Model

Jong Hyun Han, Soo Hyun Lee, Jong Youl Park

Research output: Contribution to journalConference articlepeer-review

Abstract

Although text-to-image models excellently create realistic images from text, they struggle with long-form text due to token limits in the pretrained text encoder. In this paper, we propose the Padding Is Enough(PIE) text encoder that is trained to represent long-form text contexts with a single embedding. The embedding representing long-form text utilizes a knowledge distillation technique, where the outputs from the PIE text encoder and the CLIP text encoder are input into a diffusion model to align their outputs. Specifically, we dont train the diffusion model, but only the text encoder, thereby preserving the extensive pretrained knowledge. Furthermore, the PIE text encoder can be used to extend text prompt in task-specific large pretrained diffusion models. It helps enhance the expressiveness of large pretrained models while reducing costs. We demonstrate that our model achieves high availability as a text extension without damage to the extensive knowledge of the pretrained diffusion models.

Original languageEnglish
Pages (from-to)79-87
Number of pages9
JournalProceedings of the IEEE International Conference on Big Data and Smart Computing, BIGCOMP
Issue number2025
DOIs
StatePublished - 2025
Event2025 IEEE International Conference on Big Data and Smart Computing, BigComp 2025 - Kota Kinabalu, Malaysia
Duration: 9 Feb 202512 Feb 2025

Keywords

  • Diffusion Models
  • Text Semantic Compression
  • Text-to-image Model

Fingerprint

Dive into the research topics of 'PIE Text Encoder: Padding Is Enough to generate Image in diffusion based Text-to-Image Model'. Together they form a unique fingerprint.

Cite this