TY - GEN
T1 - CNN-Transformer Collaborative Learning for Remote Sensing Semantic Segmentation
AU - Guo, Huazhe
AU - Dong, Shan
AU - Li, Shangyu
AU - Zhuang, Yin
AU - Chen, He
AU - Qi, Baogui
N1 - Publisher Copyright:
© 2026 SPIE.
PY - 2026/3/4
Y1 - 2026/3/4
N2 - Land cover classification is an important task in remote sensing image analysis. However, it remains challenging because land cover has complex spatial patterns and large variations in object scale and texture. Current methods for land cover classification primarily rely on convolutional neural networks (CNNs) and self-attention mechanisms(Transformers) to extract features. CNN-based methods have strong inductive biases and are good at extracting local features and texture information. However, their limited receptive fields and the lack of long-range dependencies can cause inaccurate boundaries and blurred edges. Transformer-based methods use global self-attention to capture long-range semantic relationships, but their lack of local inductive bias may lead to missing details. Some CNN-Transformer feature-fusion methods have been proposed to combine global and local information and achieve better segmentation, but these combinations introduce more parameters and increase computational cost. To overcome these limitations, we propose a CNN-Transformer Collaborative Learning Model (CTCLM). The model enables two-way knowledge transfer and joint learning between the CNN and Transformer branches through bidirectional distillation. This can avoid the limitation of traditional one-way distillation from CNN to Transformer that depends on a fixed teacher model. During inference, only the Transformer branch is used for prediction, improving efficiency by eliminating the need for a dual-branch CNN-Transformer framework. In addition, CTCLM uses a parallel multi-scale convolutional encoder to strengthen multi-scale feature extraction and a collaborative learning mechanism to align features in a shared representation space. Experiments on the Potsdam and LoveDA datasets show that CTCLM achieves higher segmentation accuracy and clearer boundaries than existing methods, demonstrating its effectiveness in combining global context with local detail.
AB - Land cover classification is an important task in remote sensing image analysis. However, it remains challenging because land cover has complex spatial patterns and large variations in object scale and texture. Current methods for land cover classification primarily rely on convolutional neural networks (CNNs) and self-attention mechanisms(Transformers) to extract features. CNN-based methods have strong inductive biases and are good at extracting local features and texture information. However, their limited receptive fields and the lack of long-range dependencies can cause inaccurate boundaries and blurred edges. Transformer-based methods use global self-attention to capture long-range semantic relationships, but their lack of local inductive bias may lead to missing details. Some CNN-Transformer feature-fusion methods have been proposed to combine global and local information and achieve better segmentation, but these combinations introduce more parameters and increase computational cost. To overcome these limitations, we propose a CNN-Transformer Collaborative Learning Model (CTCLM). The model enables two-way knowledge transfer and joint learning between the CNN and Transformer branches through bidirectional distillation. This can avoid the limitation of traditional one-way distillation from CNN to Transformer that depends on a fixed teacher model. During inference, only the Transformer branch is used for prediction, improving efficiency by eliminating the need for a dual-branch CNN-Transformer framework. In addition, CTCLM uses a parallel multi-scale convolutional encoder to strengthen multi-scale feature extraction and a collaborative learning mechanism to align features in a shared representation space. Experiments on the Potsdam and LoveDA datasets show that CTCLM achieves higher segmentation accuracy and clearer boundaries than existing methods, demonstrating its effectiveness in combining global context with local detail.
KW - collaborative learning
KW - remote sensing
KW - semantic segmentation
UR - https://www.scopus.com/pages/publications/105034109079
U2 - 10.1117/12.3096292
DO - 10.1117/12.3096292
M3 - Conference contribution
AN - SCOPUS:105034109079
T3 - Proceedings of SPIE - The International Society for Optical Engineering
BT - Seventeenth International Conference on Graphics and Image Processing, ICGIP 2025
A2 - Xiao, Liang
PB - SPIE
T2 - 17th International Conference on Graphics and Image Processing, ICGIP 2025
Y2 - 7 November 2025 through 9 November 2025
ER -