TY - GEN
T1 - GTMS
T2 - 18th European Conference on Computer Vision, ECCV 2024
AU - Lyu, Haoxin
AU - Zhong, Tianxiong
AU - Zhao, Sanyuan
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
PY - 2025
Y1 - 2025
N2 - Referring image segmentation (RIS) aims to segment an object of interest by a given natural language expression. As fully-supervised methods require expensive pixel-wise labeling, mask-free solutions supervised by low-cost labels are largely desired. However, existing mask-free RIS methods suffer from complicated architectures or insufficient utilization of structural and semantic information resulting in unsatisfactory performance. In this paper, we propose a gradient-driven tree-guided mask-free RIS method, GTMS, which utilizes both structural and semantic information, while only using a bounding box as the supervised signal. Specifically, we first construct the structural information of the input image as a tree structure. Meanwhile, we utilize gradient information to explore semantically related regions from the text feature. Finally, the structural information and semantic information are used to refine the output of the segmentation model to generate pseudo labels, which in turn are used to optimize the model. To verify the effectiveness of our method, the experiments are conducted on three benchmarks, i.e., RefCOCO/+/g. Our method achieves SOTA performance compared with other mask-free RIS methods and even outperforms many fully supervised RIS methods. Specifically, GTMS attains 66.54%, 69.98% and 63.41% IoU on RefCOCO Val-Test, TestA and TestB. Our code will be available at https://github.com/eternalld/GTMS.
AB - Referring image segmentation (RIS) aims to segment an object of interest by a given natural language expression. As fully-supervised methods require expensive pixel-wise labeling, mask-free solutions supervised by low-cost labels are largely desired. However, existing mask-free RIS methods suffer from complicated architectures or insufficient utilization of structural and semantic information resulting in unsatisfactory performance. In this paper, we propose a gradient-driven tree-guided mask-free RIS method, GTMS, which utilizes both structural and semantic information, while only using a bounding box as the supervised signal. Specifically, we first construct the structural information of the input image as a tree structure. Meanwhile, we utilize gradient information to explore semantically related regions from the text feature. Finally, the structural information and semantic information are used to refine the output of the segmentation model to generate pseudo labels, which in turn are used to optimize the model. To verify the effectiveness of our method, the experiments are conducted on three benchmarks, i.e., RefCOCO/+/g. Our method achieves SOTA performance compared with other mask-free RIS methods and even outperforms many fully supervised RIS methods. Specifically, GTMS attains 66.54%, 69.98% and 63.41% IoU on RefCOCO Val-Test, TestA and TestB. Our code will be available at https://github.com/eternalld/GTMS.
KW - Referring Image Segmentation
KW - Weakly supervision
UR - http://www.scopus.com/inward/record.url?scp=85211336669&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-72848-8_17
DO - 10.1007/978-3-031-72848-8_17
M3 - Conference contribution
AN - SCOPUS:85211336669
SN - 9783031728471
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 288
EP - 304
BT - Computer Vision – ECCV 2024 - 18th European Conference, Proceedings
A2 - Leonardis, Aleš
A2 - Ricci, Elisa
A2 - Roth, Stefan
A2 - Russakovsky, Olga
A2 - Sattler, Torsten
A2 - Varol, Gül
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 29 September 2024 through 4 October 2024
ER -