Image-free multi-label image recognition via LLM-powered hierarchical prompt tuning

  • Shuo Yang*
  • , Zirui Shang
  • , Yongqi Wang
  • , Derong Deng
  • , Hongwei Chen
  • , Xinxiao Wu
  • , Qiyuan Cheng
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

This paper proposes a novel framework for multi-label image recognition without any training images, namely image-free framework, which uses knowledge of pre-trained Large Language Model (LLM) to learn prompts to adapt a pre-trained Vision-Language Model (VLM) like Contrastive Language–Image Pre-training (CLIP) to multi-label classification. Through asking LLM well-designed questions, we acquire comprehensive knowledge about the characteristics and contexts of objects, which provides valuable text descriptions for learning prompts. Then, we propose a hierarchical prompt learning method by taking the multi-label dependency into consideration, wherein a subset of category-specific prompt tokens is shared when the corresponding objects exhibit similar attributes or are more likely to co-occur. Benefiting from the remarkable alignment between visual and linguistic semantics of CLIP, the hierarchical prompts learned from text descriptions are applied to perform classification of images during inference. Our framework presents a new way to explore the synergies between multiple pre-trained models for novel category recognition. Extensive experiments on three public datasets, i.e., Microsoft Common Objects in Context (MS-COCO), Visual Object Classes 2007 (VOC2007), and National University of Singapore Web Image Database (NUS-WIDE), demonstrate that our method achieves better results than the state-of-the-art methods.

Original languageEnglish
Article number112986
JournalPattern Recognition
Volume174
DOIs
Publication statusPublished - Jun 2026
Externally publishedYes

Keywords

  • Hierarchical prompt tuning
  • Image-free
  • LLM
  • Multi-label image recognition

Fingerprint

Dive into the research topics of 'Image-free multi-label image recognition via LLM-powered hierarchical prompt tuning'. Together they form a unique fingerprint.

Cite this