Local-Global Context Aware Transformer for Language-Guided Video Segmentation

Chen Liang; Wenguan Wang; Tianfei Zhou; Jiaxu Miao; Yawei Luo; Yi Yang

doi:10.1109/TPAMI.2023.3262578

Local-Global Context Aware Transformer for Language-Guided Video Segmentation

Chen Liang, Wenguan Wang, Tianfei Zhou, Jiaxu Miao, Yawei Luo, Yi Yang^*

^*此作品的通讯作者

科研成果: 期刊稿件 › 文章 › 同行评审

40 引用（Scopus）

摘要

We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components - one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, Locater holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows Locater to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S$^+$+, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S$^+$+ show that Locater outperforms previous state-of-the-arts. Further, we won the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge, where Locater served as the foundation for the winning solution.

源语言	英语
页（从-至）	10055-10069
页数	15
期刊	IEEE Transactions on Pattern Analysis and Machine Intelligence
卷	45
期	8
DOI	https://doi.org/10.1109/TPAMI.2023.3262578
出版状态	已出版 - 1 8月 2023
已对外发布	是

访问文件

10.1109/TPAMI.2023.3262578

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{d6d61a45efb9427f81d5f7592a623770,

title = "Local-Global Context Aware Transformer for Language-Guided Video Segmentation",

abstract = "We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components - one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, Locater holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows Locater to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S$^+$+, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S$^+$+ show that Locater outperforms previous state-of-the-arts. Further, we won the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge, where Locater served as the foundation for the winning solution.",

keywords = "Language-guided video segmentation, memory network, multi-modal transformer",

author = "Chen Liang and Wenguan Wang and Tianfei Zhou and Jiaxu Miao and Yawei Luo and Yi Yang",

note = "Publisher Copyright: {\textcopyright} 1979-2012 IEEE.",

year = "2023",

month = aug,

day = "1",

doi = "10.1109/TPAMI.2023.3262578",

language = "English",

volume = "45",

pages = "10055--10069",

journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence",

issn = "0162-8828",

publisher = "IEEE Computer Society",

number = "8",

}

TY - JOUR

T1 - Local-Global Context Aware Transformer for Language-Guided Video Segmentation

AU - Liang, Chen

AU - Wang, Wenguan

AU - Zhou, Tianfei

AU - Miao, Jiaxu

AU - Luo, Yawei

AU - Yang, Yi

PY - 2023/8/1

Y1 - 2023/8/1

N2 - We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components - one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, Locater holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows Locater to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S$^+$+, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S$^+$+ show that Locater outperforms previous state-of-the-arts. Further, we won the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge, where Locater served as the foundation for the winning solution.

AB - We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components - one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, Locater holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows Locater to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S$^+$+, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S$^+$+ show that Locater outperforms previous state-of-the-arts. Further, we won the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge, where Locater served as the foundation for the winning solution.

KW - Language-guided video segmentation

KW - memory network

KW - multi-modal transformer

UR - http://www.scopus.com/inward/record.url?scp=85151514596&partnerID=8YFLogxK

U2 - 10.1109/TPAMI.2023.3262578

DO - 10.1109/TPAMI.2023.3262578

M3 - Article

C2 - 37819831

AN - SCOPUS:85151514596

SN - 0162-8828

VL - 45

SP - 10055

EP - 10069

JO - IEEE Transactions on Pattern Analysis and Machine Intelligence

JF - IEEE Transactions on Pattern Analysis and Machine Intelligence

IS - 8

ER -

Local-Global Context Aware Transformer for Language-Guided Video Segmentation

摘要

访问文件

其它文件与链接

指纹

引用此