Self-Enhanced Mixed Attention Network for Three-Modal Images Few-Shot Semantic Segmentation

Kechen Song; Yiming Zhang; Yanqi Bao; Ying Zhao; Yunhui Yan

doi:10.3390/s23146612

Self-Enhanced Mixed Attention Network for Three-Modal Images Few-Shot Semantic Segmentation

Kechen Song, Yiming Zhang, Yanqi Bao^*, Ying Zhao, Yunhui Yan

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

3 Citations (Scopus)

Abstract

As an important computer vision technique, image segmentation has been widely used in various tasks. However, in some extreme cases, the insufficient illumination would result in a great impact on the performance of the model. So more and more fully supervised methods use multi-modal images as their input. The dense annotated large datasets are difficult to obtain, but the few-shot methods still can have satisfactory results with few pixel-annotated samples. Therefore, we propose the Visible-Depth-Thermal (three-modal) images few-shot semantic segmentation method. It utilizes the homogeneous information of three-modal images and the complementary information of different modal images, which can improve the performance of few-shot segmentation tasks. We constructed a novel indoor dataset VDT-2048-5ⁱ for the three-modal images few-shot semantic segmentation task. We also proposed a Self-Enhanced Mixed Attention Network (SEMANet), which consists of a Self-Enhanced module (SE) and a Mixed Attention module (MA). The SE module amplifies the difference between the different kinds of features and strengthens the weak connection for the foreground features. The MA module fuses the three-modal feature to obtain a better feature. Compared with the most advanced methods before, our model improves mIoU by 3.8% and 3.3% in 1-shot and 5-shot settings, respectively, which achieves state-of-the-art performance. In the future, we will solve failure cases by obtaining more discriminative and robust feature representations, and explore achieving high performance with fewer parameters and computational costs.

Original language	English
Article number	6612
Journal	Sensors
Volume	23
Issue number	14
DOIs	https://doi.org/10.3390/s23146612
Publication status	Published - Jul 2023
Externally published	Yes

Keywords

few-shot semantic segmentation
multi-modal images
three-modal registration

Access to Document

10.3390/s23146612

Cite this

Song, K., Zhang, Y., Bao, Y., Zhao, Y., & Yan, Y. (2023). Self-Enhanced Mixed Attention Network for Three-Modal Images Few-Shot Semantic Segmentation. Sensors, 23(14), Article 6612. https://doi.org/10.3390/s23146612

@article{43dea3968b114abda69ea8f29222aa55,

title = "Self-Enhanced Mixed Attention Network for Three-Modal Images Few-Shot Semantic Segmentation",

abstract = "As an important computer vision technique, image segmentation has been widely used in various tasks. However, in some extreme cases, the insufficient illumination would result in a great impact on the performance of the model. So more and more fully supervised methods use multi-modal images as their input. The dense annotated large datasets are difficult to obtain, but the few-shot methods still can have satisfactory results with few pixel-annotated samples. Therefore, we propose the Visible-Depth-Thermal (three-modal) images few-shot semantic segmentation method. It utilizes the homogeneous information of three-modal images and the complementary information of different modal images, which can improve the performance of few-shot segmentation tasks. We constructed a novel indoor dataset VDT-2048-5i for the three-modal images few-shot semantic segmentation task. We also proposed a Self-Enhanced Mixed Attention Network (SEMANet), which consists of a Self-Enhanced module (SE) and a Mixed Attention module (MA). The SE module amplifies the difference between the different kinds of features and strengthens the weak connection for the foreground features. The MA module fuses the three-modal feature to obtain a better feature. Compared with the most advanced methods before, our model improves mIoU by 3.8% and 3.3% in 1-shot and 5-shot settings, respectively, which achieves state-of-the-art performance. In the future, we will solve failure cases by obtaining more discriminative and robust feature representations, and explore achieving high performance with fewer parameters and computational costs.",

keywords = "few-shot semantic segmentation, multi-modal images, three-modal registration",

author = "Kechen Song and Yiming Zhang and Yanqi Bao and Ying Zhao and Yunhui Yan",

note = "Publisher Copyright: {\textcopyright} 2023 by the authors.",

year = "2023",

month = jul,

doi = "10.3390/s23146612",

language = "English",

volume = "23",

journal = "Sensors",

issn = "1424-8220",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "14",

}

TY - JOUR

T1 - Self-Enhanced Mixed Attention Network for Three-Modal Images Few-Shot Semantic Segmentation

AU - Song, Kechen

AU - Zhang, Yiming

AU - Bao, Yanqi

AU - Zhao, Ying

AU - Yan, Yunhui

PY - 2023/7

Y1 - 2023/7

N2 - As an important computer vision technique, image segmentation has been widely used in various tasks. However, in some extreme cases, the insufficient illumination would result in a great impact on the performance of the model. So more and more fully supervised methods use multi-modal images as their input. The dense annotated large datasets are difficult to obtain, but the few-shot methods still can have satisfactory results with few pixel-annotated samples. Therefore, we propose the Visible-Depth-Thermal (three-modal) images few-shot semantic segmentation method. It utilizes the homogeneous information of three-modal images and the complementary information of different modal images, which can improve the performance of few-shot segmentation tasks. We constructed a novel indoor dataset VDT-2048-5i for the three-modal images few-shot semantic segmentation task. We also proposed a Self-Enhanced Mixed Attention Network (SEMANet), which consists of a Self-Enhanced module (SE) and a Mixed Attention module (MA). The SE module amplifies the difference between the different kinds of features and strengthens the weak connection for the foreground features. The MA module fuses the three-modal feature to obtain a better feature. Compared with the most advanced methods before, our model improves mIoU by 3.8% and 3.3% in 1-shot and 5-shot settings, respectively, which achieves state-of-the-art performance. In the future, we will solve failure cases by obtaining more discriminative and robust feature representations, and explore achieving high performance with fewer parameters and computational costs.

AB - As an important computer vision technique, image segmentation has been widely used in various tasks. However, in some extreme cases, the insufficient illumination would result in a great impact on the performance of the model. So more and more fully supervised methods use multi-modal images as their input. The dense annotated large datasets are difficult to obtain, but the few-shot methods still can have satisfactory results with few pixel-annotated samples. Therefore, we propose the Visible-Depth-Thermal (three-modal) images few-shot semantic segmentation method. It utilizes the homogeneous information of three-modal images and the complementary information of different modal images, which can improve the performance of few-shot segmentation tasks. We constructed a novel indoor dataset VDT-2048-5i for the three-modal images few-shot semantic segmentation task. We also proposed a Self-Enhanced Mixed Attention Network (SEMANet), which consists of a Self-Enhanced module (SE) and a Mixed Attention module (MA). The SE module amplifies the difference between the different kinds of features and strengthens the weak connection for the foreground features. The MA module fuses the three-modal feature to obtain a better feature. Compared with the most advanced methods before, our model improves mIoU by 3.8% and 3.3% in 1-shot and 5-shot settings, respectively, which achieves state-of-the-art performance. In the future, we will solve failure cases by obtaining more discriminative and robust feature representations, and explore achieving high performance with fewer parameters and computational costs.

KW - few-shot semantic segmentation

KW - multi-modal images

KW - three-modal registration

UR - http://www.scopus.com/inward/record.url?scp=85165987993&partnerID=8YFLogxK

U2 - 10.3390/s23146612

DO - 10.3390/s23146612

M3 - Article

C2 - 37514905

AN - SCOPUS:85165987993

SN - 1424-8220

VL - 23

JO - Sensors

JF - Sensors

IS - 14

M1 - 6612

ER -

Self-Enhanced Mixed Attention Network for Three-Modal Images Few-Shot Semantic Segmentation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this