TY - JOUR
T1 - Structural Transformer with Region Strip Attention for Video Object Segmentation
AU - Guan, Qingfeng
AU - Fang, Hao
AU - Han, Chenchen
AU - Wang, Zhicheng
AU - Zhang, Ruiheng
AU - Zhang, Yitian
AU - Lu, Xiankai
N1 - Publisher Copyright:
© 2024 Elsevier B.V.
PY - 2024/9/1
Y1 - 2024/9/1
N2 - Memory-based methods in semi-supervised video object segmentation (VOS) achieve competitive performance by performing feature similarity between the current frame and memory frames. However, this operation involves two challenges: 1) instances of occlusion caused by interaction, and 2) interference from similar objects or clutters in the background. In this work, we propose a Structural Transformer with Region Strip Attention (STRSA) approach to address these challenges. Specifically, we build a Structural Transformer (ST) architecture to decompose the feature similarity between the long-term memory frames and the current frame into two aspects: a time–space part and an object significance part. This allows us to investigate the spatio-temporal relationship of pixels and capture the salient features of the objects. Therefore, the differences between pixels and the specificity of objects are fully exploited. In addition, we leverage the object location information from the long-term memory masks and the strip pooling to design a Region Strip Attention (RSA) module, which boosts the attention for the foreground regions and suppresses the background clutters. Extensive experiments on DAVIS, YouTube-VOS, and MOSE benchmarks prove that our method achieves satisfactory results and outperforms the retrained baseline model.
AB - Memory-based methods in semi-supervised video object segmentation (VOS) achieve competitive performance by performing feature similarity between the current frame and memory frames. However, this operation involves two challenges: 1) instances of occlusion caused by interaction, and 2) interference from similar objects or clutters in the background. In this work, we propose a Structural Transformer with Region Strip Attention (STRSA) approach to address these challenges. Specifically, we build a Structural Transformer (ST) architecture to decompose the feature similarity between the long-term memory frames and the current frame into two aspects: a time–space part and an object significance part. This allows us to investigate the spatio-temporal relationship of pixels and capture the salient features of the objects. Therefore, the differences between pixels and the specificity of objects are fully exploited. In addition, we leverage the object location information from the long-term memory masks and the strip pooling to design a Region Strip Attention (RSA) module, which boosts the attention for the foreground regions and suppresses the background clutters. Extensive experiments on DAVIS, YouTube-VOS, and MOSE benchmarks prove that our method achieves satisfactory results and outperforms the retrained baseline model.
KW - Region Strip Attention
KW - Structural Transformer
KW - Video Object Segmentation
UR - http://www.scopus.com/inward/record.url?scp=85197025756&partnerID=8YFLogxK
U2 - 10.1016/j.neucom.2024.128076
DO - 10.1016/j.neucom.2024.128076
M3 - Article
AN - SCOPUS:85197025756
SN - 0925-2312
VL - 596
JO - Neurocomputing
JF - Neurocomputing
M1 - 128076
ER -