TY - JOUR
T1 - DM-PCL
T2 - Text-Driven Dual-Modal Prototype Consistency Learning for Weakly-Supervised Few-Shot Part Segmentation
AU - Han, Mengya
AU - Luo, Yong
AU - Hu, Han
AU - Wang, Zengmao
AU - Zhang, Lefei
AU - Du, Bo
AU - Duan, Ling Yu
AU - Tao, Dacheng
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
PY - 2025/11
Y1 - 2025/11
N2 - Few-shot part segmentation is essential for fine-grained visual understanding, but it remains challenging in the absence of pixel-level annotations. This motivates us to introduce a more practical task setting, weakly-supervised few-shot part segmentation, where only part-level textual labels (e.g., textual part descriptions) are provided for support images. This setting is quite challenging due to the semantic-visual gap and lack of pixel-level supervision. To address this challenge, we propose text-driven dual-modal prototype consistency learning (DM-PCL), which predicts pseudo masks for both support and query images using part-level textual labels and learns consistent part prototypes across diverse images and modalities to facilitate accurate part segmentation. Specifically, DM-PCL introduces: (i) a pseudo mask generation (PMG) module, which generates pseudo masks by comparing image features with textual part prototypes derived from part-level textual labels; (ii) a text-driven spatial interaction (TSI) module that enriches visual features with semantic knowledge to enhance part perception; and (iii) a dual-modal prototype consistency learning (DPCL) module that enforces consistency between part prototypes across different images and modalities. Final segmentation is performed by comparing query features with both visual and textual part prototypes via a dual-modal cooperative segmentation strategy. Extensive experiments on benchmark datasets demonstrate that our method significantly outperforms existing approaches, achieving the state-of-the-art performance in weakly-supervised few-shot part segmentation.
AB - Few-shot part segmentation is essential for fine-grained visual understanding, but it remains challenging in the absence of pixel-level annotations. This motivates us to introduce a more practical task setting, weakly-supervised few-shot part segmentation, where only part-level textual labels (e.g., textual part descriptions) are provided for support images. This setting is quite challenging due to the semantic-visual gap and lack of pixel-level supervision. To address this challenge, we propose text-driven dual-modal prototype consistency learning (DM-PCL), which predicts pseudo masks for both support and query images using part-level textual labels and learns consistent part prototypes across diverse images and modalities to facilitate accurate part segmentation. Specifically, DM-PCL introduces: (i) a pseudo mask generation (PMG) module, which generates pseudo masks by comparing image features with textual part prototypes derived from part-level textual labels; (ii) a text-driven spatial interaction (TSI) module that enriches visual features with semantic knowledge to enhance part perception; and (iii) a dual-modal prototype consistency learning (DPCL) module that enforces consistency between part prototypes across different images and modalities. Final segmentation is performed by comparing query features with both visual and textual part prototypes via a dual-modal cooperative segmentation strategy. Extensive experiments on benchmark datasets demonstrate that our method significantly outperforms existing approaches, achieving the state-of-the-art performance in weakly-supervised few-shot part segmentation.
KW - Dual-modal cooperative segmentation
KW - Dual-modal prototype consistency learning
KW - Few-shot part segmentation
KW - Text-driven spatial interaction
KW - Weakly-supervised
UR - https://www.scopus.com/pages/publications/105012626311
U2 - 10.1007/s11263-025-02545-w
DO - 10.1007/s11263-025-02545-w
M3 - Article
AN - SCOPUS:105012626311
SN - 0920-5691
VL - 133
SP - 7553
EP - 7569
JO - International Journal of Computer Vision
JF - International Journal of Computer Vision
IS - 11
ER -