Skip to main navigation Skip to search Skip to main content

CNN-Transformer Collaborative Learning for Remote Sensing Semantic Segmentation

  • Huazhe Guo*
  • , Shan Dong
  • , Shangyu Li
  • , Yin Zhuang
  • , He Chen
  • , Baogui Qi
  • *Corresponding author for this work
  • Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Land cover classification is an important task in remote sensing image analysis. However, it remains challenging because land cover has complex spatial patterns and large variations in object scale and texture. Current methods for land cover classification primarily rely on convolutional neural networks (CNNs) and self-attention mechanisms(Transformers) to extract features. CNN-based methods have strong inductive biases and are good at extracting local features and texture information. However, their limited receptive fields and the lack of long-range dependencies can cause inaccurate boundaries and blurred edges. Transformer-based methods use global self-attention to capture long-range semantic relationships, but their lack of local inductive bias may lead to missing details. Some CNN-Transformer feature-fusion methods have been proposed to combine global and local information and achieve better segmentation, but these combinations introduce more parameters and increase computational cost. To overcome these limitations, we propose a CNN-Transformer Collaborative Learning Model (CTCLM). The model enables two-way knowledge transfer and joint learning between the CNN and Transformer branches through bidirectional distillation. This can avoid the limitation of traditional one-way distillation from CNN to Transformer that depends on a fixed teacher model. During inference, only the Transformer branch is used for prediction, improving efficiency by eliminating the need for a dual-branch CNN-Transformer framework. In addition, CTCLM uses a parallel multi-scale convolutional encoder to strengthen multi-scale feature extraction and a collaborative learning mechanism to align features in a shared representation space. Experiments on the Potsdam and LoveDA datasets show that CTCLM achieves higher segmentation accuracy and clearer boundaries than existing methods, demonstrating its effectiveness in combining global context with local detail.

Original languageEnglish
Title of host publicationSeventeenth International Conference on Graphics and Image Processing, ICGIP 2025
EditorsLiang Xiao
PublisherSPIE
ISBN (Electronic)9798902322078
DOIs
Publication statusPublished - 4 Mar 2026
Externally publishedYes
Event17th International Conference on Graphics and Image Processing, ICGIP 2025 - Nanjing, China
Duration: 7 Nov 20259 Nov 2025

Publication series

NameProceedings of SPIE - The International Society for Optical Engineering
Volume14124
ISSN (Print)0277-786X
ISSN (Electronic)1996-756X

Conference

Conference17th International Conference on Graphics and Image Processing, ICGIP 2025
Country/TerritoryChina
CityNanjing
Period7/11/259/11/25

Keywords

  • collaborative learning
  • remote sensing
  • semantic segmentation

Fingerprint

Dive into the research topics of 'CNN-Transformer Collaborative Learning for Remote Sensing Semantic Segmentation'. Together they form a unique fingerprint.

Cite this