TY - GEN
T1 - Exploiting Diffusion Model as Prompt Generator for Object Localization
AU - Jiang, Yuqi
AU - Liu, Qiankun
AU - Li, Yichen
AU - Jia, Hao
AU - Fu, Ying
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.
PY - 2024
Y1 - 2024
N2 - Recently, diffusion models have shown unprecedented power in text-to-image generation. The intermediate features in well-trained text-to-image diffusion models have been proven to contain basic semantic and layout information of the synthesized image. Based on such findings, we present a Diffusion-model-based Prompt generator for Object Localization, named as DPOL. By providing proper text guidance to DPOL, the corresponding object in an image can be localized within two steps: (1) Prompt generation. Conditioned on the text guidance, the image is first inverted into its corresponding latent code and then reconstructed by the diffusion model. The attention maps produced by the diffusion model are used as the location prompt, which contain the coarse position information of the interested objects; (2) Location refinement. The Segment Anything Model (i.e., SAM) is used to get a more accurate position based on the location prompt, which is transformed into the format (in detail, box) that is compatible with SAM. Extensive experiments are conducted to show that our DPOL achieves comparable performance with existing open-vocabulary localization methods, even DPOL requires neither training nor fine-tuning.
AB - Recently, diffusion models have shown unprecedented power in text-to-image generation. The intermediate features in well-trained text-to-image diffusion models have been proven to contain basic semantic and layout information of the synthesized image. Based on such findings, we present a Diffusion-model-based Prompt generator for Object Localization, named as DPOL. By providing proper text guidance to DPOL, the corresponding object in an image can be localized within two steps: (1) Prompt generation. Conditioned on the text guidance, the image is first inverted into its corresponding latent code and then reconstructed by the diffusion model. The attention maps produced by the diffusion model are used as the location prompt, which contain the coarse position information of the interested objects; (2) Location refinement. The Segment Anything Model (i.e., SAM) is used to get a more accurate position based on the location prompt, which is transformed into the format (in detail, box) that is compatible with SAM. Extensive experiments are conducted to show that our DPOL achieves comparable performance with existing open-vocabulary localization methods, even DPOL requires neither training nor fine-tuning.
KW - Diffusion models
KW - Object localization
KW - Prompt generation
UR - http://www.scopus.com/inward/record.url?scp=85200471260&partnerID=8YFLogxK
U2 - 10.1007/978-981-97-3626-3_21
DO - 10.1007/978-981-97-3626-3_21
M3 - Conference contribution
AN - SCOPUS:85200471260
SN - 9789819736256
T3 - Communications in Computer and Information Science
SP - 284
EP - 296
BT - Digital Multimedia Communications - 20th International Forum on Digital TV and Wireless Multimedia Communications, IFTC 2023, Revised Selected Papers
A2 - Zhai, Guangtao
A2 - Zhou, Jun
A2 - Yang, Hua
A2 - Ye, Long
A2 - An, Ping
A2 - Yang, Xiaokang
PB - Springer Science and Business Media Deutschland GmbH
T2 - 20th International Forum on Digital TV and Wireless Multimedia Communications, IFTC 2023
Y2 - 21 December 2023 through 22 December 2023
ER -