Structural Transformer with Region Strip Attention for Video Object Segmentation

Qingfeng Guan; Hao Fang; Chenchen Han; Zhicheng Wang; Ruiheng Zhang; Yitian Zhang; Xiankai Lu

doi:10.1016/j.neucom.2024.128076

Structural Transformer with Region Strip Attention for Video Object Segmentation

Qingfeng Guan, Hao Fang, Chenchen Han, Zhicheng Wang, Ruiheng Zhang, Yitian Zhang^*, Xiankai Lu

^*此作品的通讯作者

机电学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Memory-based methods in semi-supervised video object segmentation (VOS) achieve competitive performance by performing feature similarity between the current frame and memory frames. However, this operation involves two challenges: 1) instances of occlusion caused by interaction, and 2) interference from similar objects or clutters in the background. In this work, we propose a Structural Transformer with Region Strip Attention (STRSA) approach to address these challenges. Specifically, we build a Structural Transformer (ST) architecture to decompose the feature similarity between the long-term memory frames and the current frame into two aspects: a time–space part and an object significance part. This allows us to investigate the spatio-temporal relationship of pixels and capture the salient features of the objects. Therefore, the differences between pixels and the specificity of objects are fully exploited. In addition, we leverage the object location information from the long-term memory masks and the strip pooling to design a Region Strip Attention (RSA) module, which boosts the attention for the foreground regions and suppresses the background clutters. Extensive experiments on DAVIS, YouTube-VOS, and MOSE benchmarks prove that our method achieves satisfactory results and outperforms the retrained baseline model.

源语言	英语
文章编号	128076
期刊	Neurocomputing
卷	596
DOI	https://doi.org/10.1016/j.neucom.2024.128076
出版状态	已出版 - 1 9月 2024

访问文件

10.1016/j.neucom.2024.128076

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{374ec72b5a0a41b4818bc6c41f579da7,

title = "Structural Transformer with Region Strip Attention for Video Object Segmentation",

abstract = "Memory-based methods in semi-supervised video object segmentation (VOS) achieve competitive performance by performing feature similarity between the current frame and memory frames. However, this operation involves two challenges: 1) instances of occlusion caused by interaction, and 2) interference from similar objects or clutters in the background. In this work, we propose a Structural Transformer with Region Strip Attention (STRSA) approach to address these challenges. Specifically, we build a Structural Transformer (ST) architecture to decompose the feature similarity between the long-term memory frames and the current frame into two aspects: a time–space part and an object significance part. This allows us to investigate the spatio-temporal relationship of pixels and capture the salient features of the objects. Therefore, the differences between pixels and the specificity of objects are fully exploited. In addition, we leverage the object location information from the long-term memory masks and the strip pooling to design a Region Strip Attention (RSA) module, which boosts the attention for the foreground regions and suppresses the background clutters. Extensive experiments on DAVIS, YouTube-VOS, and MOSE benchmarks prove that our method achieves satisfactory results and outperforms the retrained baseline model.",

keywords = "Region Strip Attention, Structural Transformer, Video Object Segmentation",

author = "Qingfeng Guan and Hao Fang and Chenchen Han and Zhicheng Wang and Ruiheng Zhang and Yitian Zhang and Xiankai Lu",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier B.V.",

year = "2024",

month = sep,

day = "1",

doi = "10.1016/j.neucom.2024.128076",

language = "English",

volume = "596",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Structural Transformer with Region Strip Attention for Video Object Segmentation

AU - Guan, Qingfeng

AU - Fang, Hao

AU - Han, Chenchen

AU - Wang, Zhicheng

AU - Zhang, Ruiheng

AU - Zhang, Yitian

AU - Lu, Xiankai

PY - 2024/9/1

Y1 - 2024/9/1

N2 - Memory-based methods in semi-supervised video object segmentation (VOS) achieve competitive performance by performing feature similarity between the current frame and memory frames. However, this operation involves two challenges: 1) instances of occlusion caused by interaction, and 2) interference from similar objects or clutters in the background. In this work, we propose a Structural Transformer with Region Strip Attention (STRSA) approach to address these challenges. Specifically, we build a Structural Transformer (ST) architecture to decompose the feature similarity between the long-term memory frames and the current frame into two aspects: a time–space part and an object significance part. This allows us to investigate the spatio-temporal relationship of pixels and capture the salient features of the objects. Therefore, the differences between pixels and the specificity of objects are fully exploited. In addition, we leverage the object location information from the long-term memory masks and the strip pooling to design a Region Strip Attention (RSA) module, which boosts the attention for the foreground regions and suppresses the background clutters. Extensive experiments on DAVIS, YouTube-VOS, and MOSE benchmarks prove that our method achieves satisfactory results and outperforms the retrained baseline model.

AB - Memory-based methods in semi-supervised video object segmentation (VOS) achieve competitive performance by performing feature similarity between the current frame and memory frames. However, this operation involves two challenges: 1) instances of occlusion caused by interaction, and 2) interference from similar objects or clutters in the background. In this work, we propose a Structural Transformer with Region Strip Attention (STRSA) approach to address these challenges. Specifically, we build a Structural Transformer (ST) architecture to decompose the feature similarity between the long-term memory frames and the current frame into two aspects: a time–space part and an object significance part. This allows us to investigate the spatio-temporal relationship of pixels and capture the salient features of the objects. Therefore, the differences between pixels and the specificity of objects are fully exploited. In addition, we leverage the object location information from the long-term memory masks and the strip pooling to design a Region Strip Attention (RSA) module, which boosts the attention for the foreground regions and suppresses the background clutters. Extensive experiments on DAVIS, YouTube-VOS, and MOSE benchmarks prove that our method achieves satisfactory results and outperforms the retrained baseline model.

KW - Region Strip Attention

KW - Structural Transformer

KW - Video Object Segmentation

UR - http://www.scopus.com/inward/record.url?scp=85197025756&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2024.128076

DO - 10.1016/j.neucom.2024.128076

M3 - Article

AN - SCOPUS:85197025756

SN - 0925-2312

VL - 596

JO - Neurocomputing

JF - Neurocomputing

M1 - 128076

ER -

Structural Transformer with Region Strip Attention for Video Object Segmentation

摘要

访问文件

其它文件与链接

指纹

引用此