TY - JOUR
T1 - Image-free multi-label image recognition via LLM-powered hierarchical prompt tuning
AU - Yang, Shuo
AU - Shang, Zirui
AU - Wang, Yongqi
AU - Deng, Derong
AU - Chen, Hongwei
AU - Wu, Xinxiao
AU - Cheng, Qiyuan
N1 - Publisher Copyright:
© 2025 Elsevier Ltd
PY - 2026/6
Y1 - 2026/6
N2 - This paper proposes a novel framework for multi-label image recognition without any training images, namely image-free framework, which uses knowledge of pre-trained Large Language Model (LLM) to learn prompts to adapt a pre-trained Vision-Language Model (VLM) like Contrastive Language–Image Pre-training (CLIP) to multi-label classification. Through asking LLM well-designed questions, we acquire comprehensive knowledge about the characteristics and contexts of objects, which provides valuable text descriptions for learning prompts. Then, we propose a hierarchical prompt learning method by taking the multi-label dependency into consideration, wherein a subset of category-specific prompt tokens is shared when the corresponding objects exhibit similar attributes or are more likely to co-occur. Benefiting from the remarkable alignment between visual and linguistic semantics of CLIP, the hierarchical prompts learned from text descriptions are applied to perform classification of images during inference. Our framework presents a new way to explore the synergies between multiple pre-trained models for novel category recognition. Extensive experiments on three public datasets, i.e., Microsoft Common Objects in Context (MS-COCO), Visual Object Classes 2007 (VOC2007), and National University of Singapore Web Image Database (NUS-WIDE), demonstrate that our method achieves better results than the state-of-the-art methods.
AB - This paper proposes a novel framework for multi-label image recognition without any training images, namely image-free framework, which uses knowledge of pre-trained Large Language Model (LLM) to learn prompts to adapt a pre-trained Vision-Language Model (VLM) like Contrastive Language–Image Pre-training (CLIP) to multi-label classification. Through asking LLM well-designed questions, we acquire comprehensive knowledge about the characteristics and contexts of objects, which provides valuable text descriptions for learning prompts. Then, we propose a hierarchical prompt learning method by taking the multi-label dependency into consideration, wherein a subset of category-specific prompt tokens is shared when the corresponding objects exhibit similar attributes or are more likely to co-occur. Benefiting from the remarkable alignment between visual and linguistic semantics of CLIP, the hierarchical prompts learned from text descriptions are applied to perform classification of images during inference. Our framework presents a new way to explore the synergies between multiple pre-trained models for novel category recognition. Extensive experiments on three public datasets, i.e., Microsoft Common Objects in Context (MS-COCO), Visual Object Classes 2007 (VOC2007), and National University of Singapore Web Image Database (NUS-WIDE), demonstrate that our method achieves better results than the state-of-the-art methods.
KW - Hierarchical prompt tuning
KW - Image-free
KW - LLM
KW - Multi-label image recognition
UR - https://www.scopus.com/pages/publications/105027328868
U2 - 10.1016/j.patcog.2025.112986
DO - 10.1016/j.patcog.2025.112986
M3 - Article
AN - SCOPUS:105027328868
SN - 0031-3203
VL - 174
JO - Pattern Recognition
JF - Pattern Recognition
M1 - 112986
ER -