Exploiting Diffusion Model as Prompt Generator for Object Localization

Yuqi Jiang, Qiankun Liu, Yichen Li, Hao Jia, Ying Fu*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recently, diffusion models have shown unprecedented power in text-to-image generation. The intermediate features in well-trained text-to-image diffusion models have been proven to contain basic semantic and layout information of the synthesized image. Based on such findings, we present a Diffusion-model-based Prompt generator for Object Localization, named as DPOL. By providing proper text guidance to DPOL, the corresponding object in an image can be localized within two steps: (1) Prompt generation. Conditioned on the text guidance, the image is first inverted into its corresponding latent code and then reconstructed by the diffusion model. The attention maps produced by the diffusion model are used as the location prompt, which contain the coarse position information of the interested objects; (2) Location refinement. The Segment Anything Model (i.e., SAM) is used to get a more accurate position based on the location prompt, which is transformed into the format (in detail, box) that is compatible with SAM. Extensive experiments are conducted to show that our DPOL achieves comparable performance with existing open-vocabulary localization methods, even DPOL requires neither training nor fine-tuning.

Original languageEnglish
Title of host publicationDigital Multimedia Communications - 20th International Forum on Digital TV and Wireless Multimedia Communications, IFTC 2023, Revised Selected Papers
EditorsGuangtao Zhai, Jun Zhou, Hua Yang, Long Ye, Ping An, Xiaokang Yang
PublisherSpringer Science and Business Media Deutschland GmbH
Pages284-296
Number of pages13
ISBN (Print)9789819736256
DOIs
Publication statusPublished - 2024
Event20th International Forum on Digital TV and Wireless Multimedia Communications, IFTC 2023 - Beijing, China
Duration: 21 Dec 202322 Dec 2023

Publication series

NameCommunications in Computer and Information Science
Volume2067 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference20th International Forum on Digital TV and Wireless Multimedia Communications, IFTC 2023
Country/TerritoryChina
CityBeijing
Period21/12/2322/12/23

Keywords

  • Diffusion models
  • Object localization
  • Prompt generation

Cite this