PMGAN: pretrained model-based generative adversarial network for text-to-image generation

Yue Yu; Yue Yang; Jingshuo Xing

doi:10.1007/s00371-024-03326-1

PMGAN: pretrained model-based generative adversarial network for text-to-image generation

Yue Yu^*, Yue Yang, Jingshuo Xing

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

Text-to-image generation is a challenging task. Although diffusion models can generate high-quality images of complex scenes, they sometimes suffer from a lack of realism. Additionally, there is often a large diversity among images generated from different text with the same semantics. Furthermore, the generation of details is sometimes insufficient. Generative adversarial networks can generate realism images. These images are consistent with the text descriptions. And the networks can generate content-consistent images. In this paper, we argue that generating images that are more consistent with the text descriptions is more important than generating higher-quality images. Therefore, this paper proposes the pretrained model-based generative adversarial network (PMGAN). PMGAN utilizes multiple pre-trained models in both generator and discriminator. Specifically, in the generator, the deep attentional multimodal similarity model text encoder extracts word and sentence embeddings from the input text, and the contrastive language-image pre-training (CLIP) text encoder extracts initial image features from the input text. In the discriminator, a pre-trained CLIP image encoder extracts image features from the input image. The CLIP encoder can map text and images into a common semantic space, which is beneficial to generate high-quality images. Experimental results show that compared to the state-of-the-art methods, PMGAN achieves better scores on both inception score and Fréchet inception distance and can produce higher quality images while maintaining greater consistency with text descriptions.

Original language	English
Journal	Visual Computer
DOIs	https://doi.org/10.1007/s00371-024-03326-1
Publication status	Accepted/In press - 2024

Keywords

Feature extraction
Generative adversarial network
Pretrained model
Text-to-image generation

Access to Document

10.1007/s00371-024-03326-1

Cite this

@article{805d02e79ee2462f9121aa482a5b54a6,

title = "PMGAN: pretrained model-based generative adversarial network for text-to-image generation",

abstract = "Text-to-image generation is a challenging task. Although diffusion models can generate high-quality images of complex scenes, they sometimes suffer from a lack of realism. Additionally, there is often a large diversity among images generated from different text with the same semantics. Furthermore, the generation of details is sometimes insufficient. Generative adversarial networks can generate realism images. These images are consistent with the text descriptions. And the networks can generate content-consistent images. In this paper, we argue that generating images that are more consistent with the text descriptions is more important than generating higher-quality images. Therefore, this paper proposes the pretrained model-based generative adversarial network (PMGAN). PMGAN utilizes multiple pre-trained models in both generator and discriminator. Specifically, in the generator, the deep attentional multimodal similarity model text encoder extracts word and sentence embeddings from the input text, and the contrastive language-image pre-training (CLIP) text encoder extracts initial image features from the input text. In the discriminator, a pre-trained CLIP image encoder extracts image features from the input image. The CLIP encoder can map text and images into a common semantic space, which is beneficial to generate high-quality images. Experimental results show that compared to the state-of-the-art methods, PMGAN achieves better scores on both inception score and Fr{\'e}chet inception distance and can produce higher quality images while maintaining greater consistency with text descriptions.",

keywords = "Feature extraction, Generative adversarial network, Pretrained model, Text-to-image generation",

author = "Yue Yu and Yue Yang and Jingshuo Xing",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.",

year = "2024",

doi = "10.1007/s00371-024-03326-1",

language = "English",

journal = "Visual Computer",

issn = "0178-2789",

publisher = "Springer Verlag",

}

TY - JOUR

T1 - PMGAN

T2 - pretrained model-based generative adversarial network for text-to-image generation

AU - Yu, Yue

AU - Yang, Yue

AU - Xing, Jingshuo

N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.

PY - 2024

Y1 - 2024

N2 - Text-to-image generation is a challenging task. Although diffusion models can generate high-quality images of complex scenes, they sometimes suffer from a lack of realism. Additionally, there is often a large diversity among images generated from different text with the same semantics. Furthermore, the generation of details is sometimes insufficient. Generative adversarial networks can generate realism images. These images are consistent with the text descriptions. And the networks can generate content-consistent images. In this paper, we argue that generating images that are more consistent with the text descriptions is more important than generating higher-quality images. Therefore, this paper proposes the pretrained model-based generative adversarial network (PMGAN). PMGAN utilizes multiple pre-trained models in both generator and discriminator. Specifically, in the generator, the deep attentional multimodal similarity model text encoder extracts word and sentence embeddings from the input text, and the contrastive language-image pre-training (CLIP) text encoder extracts initial image features from the input text. In the discriminator, a pre-trained CLIP image encoder extracts image features from the input image. The CLIP encoder can map text and images into a common semantic space, which is beneficial to generate high-quality images. Experimental results show that compared to the state-of-the-art methods, PMGAN achieves better scores on both inception score and Fréchet inception distance and can produce higher quality images while maintaining greater consistency with text descriptions.

AB - Text-to-image generation is a challenging task. Although diffusion models can generate high-quality images of complex scenes, they sometimes suffer from a lack of realism. Additionally, there is often a large diversity among images generated from different text with the same semantics. Furthermore, the generation of details is sometimes insufficient. Generative adversarial networks can generate realism images. These images are consistent with the text descriptions. And the networks can generate content-consistent images. In this paper, we argue that generating images that are more consistent with the text descriptions is more important than generating higher-quality images. Therefore, this paper proposes the pretrained model-based generative adversarial network (PMGAN). PMGAN utilizes multiple pre-trained models in both generator and discriminator. Specifically, in the generator, the deep attentional multimodal similarity model text encoder extracts word and sentence embeddings from the input text, and the contrastive language-image pre-training (CLIP) text encoder extracts initial image features from the input text. In the discriminator, a pre-trained CLIP image encoder extracts image features from the input image. The CLIP encoder can map text and images into a common semantic space, which is beneficial to generate high-quality images. Experimental results show that compared to the state-of-the-art methods, PMGAN achieves better scores on both inception score and Fréchet inception distance and can produce higher quality images while maintaining greater consistency with text descriptions.

KW - Feature extraction

KW - Generative adversarial network

KW - Pretrained model

KW - Text-to-image generation

UR - http://www.scopus.com/inward/record.url?scp=85188959220&partnerID=8YFLogxK

U2 - 10.1007/s00371-024-03326-1

DO - 10.1007/s00371-024-03326-1

M3 - Article

AN - SCOPUS:85188959220

SN - 0178-2789

JO - Visual Computer

JF - Visual Computer

ER -

PMGAN: pretrained model-based generative adversarial network for text-to-image generation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this