Consistency Regularization Based on Masked Image Modeling for Semisupervised Remote Sensing Semantic Segmentation

Miaoxin Cai; He Chen; Tong Zhang; Yin Zhuang; Liang Chen

doi:10.1109/JSTARS.2024.3435509

Consistency Regularization Based on Masked Image Modeling for Semisupervised Remote Sensing Semantic Segmentation

Miaoxin Cai, He Chen^*, Tong Zhang, Yin Zhuang, Liang Chen^*

^*Corresponding author for this work

School of Information and Electronics

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

Semisupervised semantic segmentation aims to effectively leverage both unlabeled and scare labeled images, reducing the reliance on labor-intensive pixel-level labeling for extensive training processes. The leading semisupervised learning method, consistency regularization, employs weak and strong data augmentations to diversify input representations. Ultimately the model is compelled to maintain consistent predictions across different input views, thus boosting the model's generalization. However, previous methods suffered from limited input representation space introduced by linear transformations such as cutmix. To address such issue, a consistency regularization based on masked image modeling (MIM) called MIMSeg is proposed to achieve accurate segmentation with limited labeled images. First, MIM pixel-wise perception with ViT encoder-decoder lays the foundation for expanding the data representation space. Second, collaborating with weak data augmentations, two MIM-related strong data augmentations are developed to generate more challenging input views for consistent predictions. Precisely, weak data augmentations are employed to replicate input views from various perspectives while a controllable generative strong data augmentation called masked image reconstruction (MIR) is crafted to simulate multiple imaging diversity while preserving the original semantic information intact. In addition, a more severe strong data augmentation masked context perturbation (MCP) is designed to further generate more challenging input views and alleviate semantic deficiency via masked category prediction. Leveraging the MIM perception and two MIM-related strong data augmentations, the model is compelled to achieve consistency predictions across diverse input views from weak data augmentations, MIR and MCP. These components result in the generation of more stable pixel-level pseudo-labels and facilitate collaborative training between unlabeled and labeled images. Extensive experiments have shown that MIMSeg can achieve state-of-the-art performance in pixel-level prediction with very limited sample annotations.

Original language	English
Pages (from-to)	17442-17460
Number of pages	19
Journal	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Volume	17
DOIs	https://doi.org/10.1109/JSTARS.2024.3435509
Publication status	Published - 2024

Keywords

Consistency regularization
masked image modeling (MIM)
semisupervised semantic segmentation

Access to Document

10.1109/JSTARS.2024.3435509

Cite this

Cai, M., Chen, H., Zhang, T., Zhuang, Y., & Chen, L. (2024). Consistency Regularization Based on Masked Image Modeling for Semisupervised Remote Sensing Semantic Segmentation. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17, 17442-17460. https://doi.org/10.1109/JSTARS.2024.3435509

@article{0e16bef45b7646679ef45ce52aef2b77,

title = "Consistency Regularization Based on Masked Image Modeling for Semisupervised Remote Sensing Semantic Segmentation",

abstract = "Semisupervised semantic segmentation aims to effectively leverage both unlabeled and scare labeled images, reducing the reliance on labor-intensive pixel-level labeling for extensive training processes. The leading semisupervised learning method, consistency regularization, employs weak and strong data augmentations to diversify input representations. Ultimately the model is compelled to maintain consistent predictions across different input views, thus boosting the model's generalization. However, previous methods suffered from limited input representation space introduced by linear transformations such as cutmix. To address such issue, a consistency regularization based on masked image modeling (MIM) called MIMSeg is proposed to achieve accurate segmentation with limited labeled images. First, MIM pixel-wise perception with ViT encoder-decoder lays the foundation for expanding the data representation space. Second, collaborating with weak data augmentations, two MIM-related strong data augmentations are developed to generate more challenging input views for consistent predictions. Precisely, weak data augmentations are employed to replicate input views from various perspectives while a controllable generative strong data augmentation called masked image reconstruction (MIR) is crafted to simulate multiple imaging diversity while preserving the original semantic information intact. In addition, a more severe strong data augmentation masked context perturbation (MCP) is designed to further generate more challenging input views and alleviate semantic deficiency via masked category prediction. Leveraging the MIM perception and two MIM-related strong data augmentations, the model is compelled to achieve consistency predictions across diverse input views from weak data augmentations, MIR and MCP. These components result in the generation of more stable pixel-level pseudo-labels and facilitate collaborative training between unlabeled and labeled images. Extensive experiments have shown that MIMSeg can achieve state-of-the-art performance in pixel-level prediction with very limited sample annotations.",

keywords = "Consistency regularization, masked image modeling (MIM), semisupervised semantic segmentation",

author = "Miaoxin Cai and He Chen and Tong Zhang and Yin Zhuang and Liang Chen",

note = "Publisher Copyright: {\textcopyright} 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.",

year = "2024",

doi = "10.1109/JSTARS.2024.3435509",

language = "English",

volume = "17",

pages = "17442--17460",

journal = "IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing",

issn = "1939-1404",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Consistency Regularization Based on Masked Image Modeling for Semisupervised Remote Sensing Semantic Segmentation

AU - Cai, Miaoxin

AU - Chen, He

AU - Zhang, Tong

AU - Zhuang, Yin

AU - Chen, Liang

PY - 2024

Y1 - 2024

N2 - Semisupervised semantic segmentation aims to effectively leverage both unlabeled and scare labeled images, reducing the reliance on labor-intensive pixel-level labeling for extensive training processes. The leading semisupervised learning method, consistency regularization, employs weak and strong data augmentations to diversify input representations. Ultimately the model is compelled to maintain consistent predictions across different input views, thus boosting the model's generalization. However, previous methods suffered from limited input representation space introduced by linear transformations such as cutmix. To address such issue, a consistency regularization based on masked image modeling (MIM) called MIMSeg is proposed to achieve accurate segmentation with limited labeled images. First, MIM pixel-wise perception with ViT encoder-decoder lays the foundation for expanding the data representation space. Second, collaborating with weak data augmentations, two MIM-related strong data augmentations are developed to generate more challenging input views for consistent predictions. Precisely, weak data augmentations are employed to replicate input views from various perspectives while a controllable generative strong data augmentation called masked image reconstruction (MIR) is crafted to simulate multiple imaging diversity while preserving the original semantic information intact. In addition, a more severe strong data augmentation masked context perturbation (MCP) is designed to further generate more challenging input views and alleviate semantic deficiency via masked category prediction. Leveraging the MIM perception and two MIM-related strong data augmentations, the model is compelled to achieve consistency predictions across diverse input views from weak data augmentations, MIR and MCP. These components result in the generation of more stable pixel-level pseudo-labels and facilitate collaborative training between unlabeled and labeled images. Extensive experiments have shown that MIMSeg can achieve state-of-the-art performance in pixel-level prediction with very limited sample annotations.

AB - Semisupervised semantic segmentation aims to effectively leverage both unlabeled and scare labeled images, reducing the reliance on labor-intensive pixel-level labeling for extensive training processes. The leading semisupervised learning method, consistency regularization, employs weak and strong data augmentations to diversify input representations. Ultimately the model is compelled to maintain consistent predictions across different input views, thus boosting the model's generalization. However, previous methods suffered from limited input representation space introduced by linear transformations such as cutmix. To address such issue, a consistency regularization based on masked image modeling (MIM) called MIMSeg is proposed to achieve accurate segmentation with limited labeled images. First, MIM pixel-wise perception with ViT encoder-decoder lays the foundation for expanding the data representation space. Second, collaborating with weak data augmentations, two MIM-related strong data augmentations are developed to generate more challenging input views for consistent predictions. Precisely, weak data augmentations are employed to replicate input views from various perspectives while a controllable generative strong data augmentation called masked image reconstruction (MIR) is crafted to simulate multiple imaging diversity while preserving the original semantic information intact. In addition, a more severe strong data augmentation masked context perturbation (MCP) is designed to further generate more challenging input views and alleviate semantic deficiency via masked category prediction. Leveraging the MIM perception and two MIM-related strong data augmentations, the model is compelled to achieve consistency predictions across diverse input views from weak data augmentations, MIR and MCP. These components result in the generation of more stable pixel-level pseudo-labels and facilitate collaborative training between unlabeled and labeled images. Extensive experiments have shown that MIMSeg can achieve state-of-the-art performance in pixel-level prediction with very limited sample annotations.

KW - Consistency regularization

KW - masked image modeling (MIM)

KW - semisupervised semantic segmentation

UR - http://www.scopus.com/inward/record.url?scp=85200803682&partnerID=8YFLogxK

U2 - 10.1109/JSTARS.2024.3435509

DO - 10.1109/JSTARS.2024.3435509

M3 - Article

AN - SCOPUS:85200803682

SN - 1939-1404

VL - 17

SP - 17442

EP - 17460

JO - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

JF - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ER -

Consistency Regularization Based on Masked Image Modeling for Semisupervised Remote Sensing Semantic Segmentation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this