Abstract
Objective Convolutional neural networks(CNNs)have received significant attention in remote sensing scene classification due to their powerful feature induction and learning capabilities;however, their local induction mechanism hinders the acquisition of global dependencies and limits the model performance. Visual Transformers(ViTs)have gained considerable popularity in various visual tasks including remote sensing image processing. The core of ViTs lies in their self-attention mechanism, which enables the establishment of global dependencies and alleviates the limitations of CNN-based algorithms. However, this mechanism introduces high computational costs. Calculating the interactions between key-value pairs requires computations across all spatial locations, leading to huge computational burden and heavy memory footprint. Furthermore, the self-attention mechanism focuses on modeling global information while ignoring local detailed feature. To solve the above problems, this work proposes a global-local feature coupling network for remote sensing scene classification. Method The overall network architecture of the proposed global-local feature coupling network consists of multiple convolutional layers and dual-channel coupling modules, which include a ViT branch based on dual-grained attention and a branch with depth-wise separable convolution. Feature fusion is achieved using the proposed adaptive coupling module, facilitating an effective combination of global and local features and thereby enhancing the model's capability to understand remote sensing scene images. On the one hand, a dual-grained attention is proposed to dynamically perceive data content and achieve flexible computation allocation to alleviate the huge computational burden caused by self-attention mechanisms. This dual-grained attention enables each query to focus on a small subset of key-value pairs that are semantically most relevant. Less relevant key-value pairs are initially filtered out at a coarse-grained region level so that the most relevant key-value pairs can be identified to efficiently achieve global attention. This step is accomplished by constructing a regional correlation graph and pruning it to retain only the top-k regions with the highest correlation. Each region only needs to focus on its top-k most relevant regions. Once the attention regions are determined, the next step involves collecting fine-grained key/value tokens to achieve token-to-token attention, thereby realizing a dynamic and query-aware sparse attention. For each query, the irrelevant key-value pairs are initially filtered out based on a coarse-grained region level, and fine-grained token-to-token attention is then employed within the set of retained candidate regions. On the other hand, an adaptive coupling module is utilized to combine the branches of CNN and ViT for the integration of global and local features. This module consists of two coupling operations:spatial coupling and channel coupling, which take the outputs of the two branches of ViT and depth-wise separable convolution as input and adaptively reweight the features from global and local feature dimensions. At this point, the global and local information from the scene image can be aggregated within the same module, achieving a comprehensive fusion. Result Experiments are conducted on the UC merced land-use dataset (UCM), aerial image dataset(AID), and Northwestern Polytechnical University remote sensing image scene classification dataset(NWPU-RESISC4). The proposed method is compared with state-of-the-art CNN-based and ViT-based methods to demonstrate its superiority. At different training ratios, the proposed method achieves the best classification results with an accuracy of 99. 71% ± 0. 20%, 94. 75% ± 0. 09%, 97. 05% ± 0. 12%, 92. 11% ± 0. 20%, and 94. 10% ± 0. 17%. Ablation experiments are also performed on the three datasets to intuitively demonstrate the positive effect of the proposed two modules on the experimental results. The dual-grained attention and adaptive coupling module alleviate the model's calculation pressure and improve its classification performance. Conclusion A novel global-local feature coupling network is proposed for remote sensing scene classification. First, a new dynamic dual-grained attention that utilizes sparsity to save computation while involving only GPU-friendly dense matrix multiplication is proposed to address the computational cost issue associated with the conventional attention mechanism in ViTs. Furthermore, an adaptive coupling module is designed to facilitate the mixing and interaction of information from two branches to comprehensively integrate global and local detailed features, thus significantly enhancing the representational capabilities of the extracted features. Extensive experimental results on the three datasets have demonstrated the effectiveness of the global-local feature coupling network.
Translated title of the contribution | Global-local feature coupling network for remote sensing scene classification |
---|---|
Original language | Chinese (Traditional) |
Pages (from-to) | 1003-1016 |
Number of pages | 14 |
Journal | Journal of Image and Graphics |
Volume | 30 |
Issue number | 4 |
DOIs | |
Publication status | Published - Apr 2025 |
Externally published | Yes |