GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

Abstract

Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions. However, the tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade. Such changes often require massive fine-tuning or even training from scratch with the prohibitive expense. To address this problem, we propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model. The approach introduces a new training objective that leverages parallel corpora to align the representation spaces of different encoders. Empirical results show that GlueNet can be trained efficiently and enables various capabilities beyond previous state-of-the-art models: 1) multilingual language models such as XLM-Roberta can be aligned with existing T2I models, allowing for the generation of high-quality images from captions beyond English; 2) GlueNet can align multi-modal encoders such as AudioCLIP with the Stable Diffusion model, enabling sound-to-image generation; 3) it can also upgrade the current text encoder of the latent diffusion model for challenging case generation. By the alignment of various feature representations, the GlueNet allows for flexible and efficient integration of new functionality into existing T2I models and sheds light on X-to-image (X2I) generation.

Sound and Sound/text-to-image Generation
(AudioCLIP + GlueNet + Stable Diffusion)

Beyond the text signals, the proposed GlueNet also achieves sound-to-image generation, i.e., (a) and (b), and image generation via sound-text-mix signals, i.e., (c), by aligning the CLIP text encoder with AudioCLIP audio encoders.

Monolingual Text-to-image Generation
(T5-3B + GlueNet + Latent Diffusion)

Monolingual text-to-image generation in resolution 256 $\times$ 256 with guidance weight 7.5 and DDIM steps 200.

Multilingual Text-to-image Generation
(XLM-Roberta-L + GlueNet + Stable Diffusion)

Multilingual generation results in resolution 512 * 512 of XLM-Roberta + Glue-Net + SDM decoder (sd-v1-4) with the same caption, ``afternoon garden oil painting painted by impressionists". With the help of different Glue-Nets and multilingual text encoder, the SDM decoder can support different languages including Japanese, Italian, Chinese, French and Spanish. The guidance weight is assigned as 7.5 and PLMS sampling steps are 50.

Hybrid multilingual generation in resolution of 512 * 512. There are three-different-language texts in the input caption including Chinese, Japanese and English. The caption of (a) is ``colorful, a cat painted by Picasso, sit on a table, is eating food'' and the caption of (b) is ``a white, sedan, crash into a building''. With our GlueNet infused ahead, the XLM-Roberta can guide SDM decoder to generate reasonable results where the original SDM fails to work.

BibTeX

@article{qin2023gluegen,
        title={GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation},
        author={Qin, Can and Yu, Ning and Xing, Chen and Zhang, Shu and Chen, Zeyuan and Ermon, Stefano and Fu, Yun and Xiong, Caiming and Xu, Ran},
        journal={arXiv preprint arXiv:2303.10056},
        year={2023}
      }

GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

Abstract

Sound and Sound/text-to-image Generation (AudioCLIP + GlueNet + Stable Diffusion)

Monolingual Text-to-image Generation (T5-3B + GlueNet + Latent Diffusion)

Multilingual Text-to-image Generation (XLM-Roberta-L + GlueNet + Stable Diffusion)

BibTeX

Sound and Sound/text-to-image Generation
(AudioCLIP + GlueNet + Stable Diffusion)

Monolingual Text-to-image Generation
(T5-3B + GlueNet + Latent Diffusion)

Multilingual Text-to-image Generation
(XLM-Roberta-L + GlueNet + Stable Diffusion)