TY - JOUR
T1 - Multi-Scale interaction and enhancement network for referring camouflaged objects image segmentation
AU - Sun, Qiyang
AU - Zhang, Xin
AU - Wang, Xia
AU - Xu, Shiwei
AU - Li, Yuyang
N1 - Publisher Copyright:
© 2025 The Author(s)
PY - 2025/12/3
Y1 - 2025/12/3
N2 - Camouflaged Objects Detection (COD) aims to identify objects seamlessly blending into their surrounding environments. Existing COD methods treat COD as a binary segmentation problem based on Salient Object Detection techniques, which separate objects from the background. While these methods have been widely applied, their inability to identify object categories limits the scope of applications. Moreover, multi-target selection and localization rely heavily on expert-driven post-processing, resulting in poor interactivity. To address these limitations, we reformulate COD as a Referring Image Segmentation (RIS) challenge, enabling precise localization and segmentation of language-specified objects through natural language instructions. Accordingly, this paper proposes a novel RIS framework named MSIENet for the COD task, which integrates a language encoder, an image encoder, and a multi-modal fusion module. This framework bridges the modality gap between visual and linguistic features through a cross-attention-based fusion and alignment module. MSIENet also contains two key components: multi-scale edge enhancement and texture enhancement modules, which effectively aggregate and refine texture details and boundary information, facilitating the generation of high-quality segmentation masks. We also collect a Language-image camouflaged dataset Ref-ACOD, establishing a rigorous evaluation benchmark for COD tasks based on RIS approaches. Experiments demonstrate that the MSIENet surpasses SOTA RIS methods on COD tasks, with MIoUs and OIoUs on LAVT increasing by 8 % and 14.5 %. All datasets are available at http://github.com/samsunq/Ref-ACOD.git
AB - Camouflaged Objects Detection (COD) aims to identify objects seamlessly blending into their surrounding environments. Existing COD methods treat COD as a binary segmentation problem based on Salient Object Detection techniques, which separate objects from the background. While these methods have been widely applied, their inability to identify object categories limits the scope of applications. Moreover, multi-target selection and localization rely heavily on expert-driven post-processing, resulting in poor interactivity. To address these limitations, we reformulate COD as a Referring Image Segmentation (RIS) challenge, enabling precise localization and segmentation of language-specified objects through natural language instructions. Accordingly, this paper proposes a novel RIS framework named MSIENet for the COD task, which integrates a language encoder, an image encoder, and a multi-modal fusion module. This framework bridges the modality gap between visual and linguistic features through a cross-attention-based fusion and alignment module. MSIENet also contains two key components: multi-scale edge enhancement and texture enhancement modules, which effectively aggregate and refine texture details and boundary information, facilitating the generation of high-quality segmentation masks. We also collect a Language-image camouflaged dataset Ref-ACOD, establishing a rigorous evaluation benchmark for COD tasks based on RIS approaches. Experiments demonstrate that the MSIENet surpasses SOTA RIS methods on COD tasks, with MIoUs and OIoUs on LAVT increasing by 8 % and 14.5 %. All datasets are available at http://github.com/samsunq/Ref-ACOD.git
KW - Camouflaged objects detection
KW - Cross-attention
KW - Multi-scale features
KW - Multi-task learning
KW - Referring image segmentation
UR - https://www.scopus.com/pages/publications/105021872679
U2 - 10.1016/j.knosys.2025.114883
DO - 10.1016/j.knosys.2025.114883
M3 - Article
AN - SCOPUS:105021872679
SN - 0950-7051
VL - 331
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 114883
ER -