Abstract
Co-segmentation identifies and segments common objects by mimicking the human visual system. Existing co-segmentation methods rely solely on mining finite visual consensus within an image set, lacking semantic transfer capability, which limits their generalization to unseen categories. To bridge this gap, we introduce a self-supervised zero-shot co-segmentation framework (SZCo) that transforms the implicit visual consensus into an explicit textual signal for zero-shot semantic transfer. This mechanism overcomes the constraints of limited visual patterns by leveraging the extensibility of textual concepts. Specifically, we first infer the common textual representation from the image set by the contrastive language-image pre-training framework (CLIP) and compute a correlation map between each feature map and the corresponding text embedding. Then, we propose an iterative region filter (IRF) to align the common object region iteratively. We introduce three region-text alignment learning methods: text-based, region-based, and globality-based methods. Moreover, we introduce an asynchronous pseudo-label update method and leverage the foundation segmentation model (SAM) for further refinement. Experimental results on ten datasets validate the superiority of our approach over existing state-of-the-art methods.
| Original language | English |
|---|---|
| Article number | 113308 |
| Journal | Pattern Recognition |
| Volume | 177 |
| DOIs | |
| Publication status | Published - Sept 2026 |
Keywords
- CLIP
- Co-segmentation
- Common semantic
- Region-text alignment
- Self-supervised learning
Fingerprint
Dive into the research topics of 'SZCo: Self-supervised zero-shot co-segmentation with region-text alignment learning'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver