TY - JOUR
T1 - Cross-Image Pixel Contrasting for Semantic Segmentation
AU - Zhou, Tianfei
AU - Wang, Wenguan
N1 - Publisher Copyright:
IEEE
PY - 2024
Y1 - 2024
N2 - This work studies the problem of image semantic segmentation. Current approaches focus mainly on mining “local” context, i.e., dependencies between pixels within individual images, by specifically-designed, context aggregation modules (e.g., dilated convolution, neural attention) or structure-aware optimization objectives (e.g., IoU-like loss). However, they ignore “global” context of the training data, i.e., rich semantic relations between pixels across different images. Inspired by recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive algorithm, dubbed as PiCo, for semantic segmentation in the fully supervised learning setting. The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes. It raises a pixel-wise metric learning paradigm for semantic segmentation, by explicitly exploring the structures of labeled pixels, which were rarely studied before. Our training algorithm is compatible with modern segmentation solutions without extra overhead during testing. We experimentally show that, with famous segmentation models (i.e., DeepLabV3, HRNet, OCRNet, SegFormer, Segmenter, MaskFormer) and backbones (i.e., MobileNet, ResNet, HRNet, MiT, ViT), our algorithm brings consistent performance improvements across diverse datasets (i.e., Cityscapes, ADE20K, PASCAL-Context, COCO-Stuff, CamVid). We expect that this work will encourage our community to rethink the current de facto training paradigm in semantic segmentation. Our code is available at https://github.com/tfzhou/ContrastiveSeg.
AB - This work studies the problem of image semantic segmentation. Current approaches focus mainly on mining “local” context, i.e., dependencies between pixels within individual images, by specifically-designed, context aggregation modules (e.g., dilated convolution, neural attention) or structure-aware optimization objectives (e.g., IoU-like loss). However, they ignore “global” context of the training data, i.e., rich semantic relations between pixels across different images. Inspired by recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive algorithm, dubbed as PiCo, for semantic segmentation in the fully supervised learning setting. The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes. It raises a pixel-wise metric learning paradigm for semantic segmentation, by explicitly exploring the structures of labeled pixels, which were rarely studied before. Our training algorithm is compatible with modern segmentation solutions without extra overhead during testing. We experimentally show that, with famous segmentation models (i.e., DeepLabV3, HRNet, OCRNet, SegFormer, Segmenter, MaskFormer) and backbones (i.e., MobileNet, ResNet, HRNet, MiT, ViT), our algorithm brings consistent performance improvements across diverse datasets (i.e., Cityscapes, ADE20K, PASCAL-Context, COCO-Stuff, CamVid). We expect that this work will encourage our community to rethink the current de facto training paradigm in semantic segmentation. Our code is available at https://github.com/tfzhou/ContrastiveSeg.
KW - Contrastive Learning
KW - Cross-Image Context
KW - Image segmentation
KW - Measurement
KW - Metric Learning
KW - Self-supervised learning
KW - Semantic Segmentation
KW - Semantic segmentation
KW - Semantics
KW - Task analysis
KW - Training
UR - http://www.scopus.com/inward/record.url?scp=85186089686&partnerID=8YFLogxK
U2 - 10.1109/TPAMI.2024.3367952
DO - 10.1109/TPAMI.2024.3367952
M3 - Article
AN - SCOPUS:85186089686
SN - 0162-8828
SP - 1
EP - 15
JO - IEEE Transactions on Pattern Analysis and Machine Intelligence
JF - IEEE Transactions on Pattern Analysis and Machine Intelligence
ER -