摘要
Transformer possesses a broader perceptual scope, while the Convolutional Neural Network (CNN) excels at capturing local information. In this paper, the authors propose the Multi-Sclale Feature Aggregation Network (MSFA-Net) for single-image dehazing which is fused with the advantages of Transformer and CNN. Our MSFA-Net is based on the encoder–decoder structure, and there are four main innovations. Firstly, the authors make some improvements to the original Swin Transformer to make it more effective for dehazing tasks, and the authors name it Spatial Information Aggregation Transformer (SIAT). The authors place the SIAT in both encoder and decoder of MSFA-Net for feature extraction. The authors propose an upsampling module called Efficient Spatial Resolution Recovery (ESRR) which is placed in the decoder part. Compared to commonly used transposed convolutions, the authors’ ESRR module has fewer computational cost. Considering that the haze distribution is always uneven and the information from each channel is different, the authors introduce the Dynamic Multi-Attention (DMA) module to provide pixel-wise weights and channel-wise weights for input features. The authors place the DMA module between the encoder and decoder parts. As the network depth increases, the spatial structural information from the high-resolution layer tends to degrade. To deal with the problem, the authors propose the Multi-Scale Feature Fusion (MSFF) module to recover missing spatial structural information. The authors place the MSFF module in both the encoder and decoder parts. Extensive experimental results show that the authors’ proposed dehazing network achieves state-of-the-art dehazing performance with relatively low computational cost.
源语言 | 英语 |
---|---|
页(从-至) | 2943-2961 |
页数 | 19 |
期刊 | IET Image Processing |
卷 | 18 |
期 | 11 |
DOI | |
出版状态 | 已出版 - 18 9月 2024 |