Dynamic Graph Modeling for Weakly-Supervised Temporal Action Localization

Haichao Shi; Xiao Yu Zhang; Changsheng Li; Lixing Gong; Yong Li; Yongjun Bao

doi:10.1145/3503161.3548077

Dynamic Graph Modeling for Weakly-Supervised Temporal Action Localization

Haichao Shi, Xiao Yu Zhang^*, Changsheng Li, Lixing Gong, Yong Li, Yongjun Bao

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

16 引用（Scopus）

摘要

Weakly supervised action localization is a challenging task that aims to localize action instances in untrimmed videos given only video-level supervision. Existing methods mostly distinguish action from background via attentive feature fusion with RGB and optical flow modalities. Unfortunately, this strategy fails to retain the distinct characteristics of each modality, leading to inaccurate localization under hard-to-discriminate cases such as action-context interference and in-action stationary period. As an action is typically comprised of multiple stages, an intuitive solution is to model the relation between the finer-grained action segments to obtain a more detailed analysis. In this paper, we propose a dynamic graph-based method, namely DGCNN, to explore the two-stream relation between action segments. To be specific, segments within a video which are likely to be actions are dynamically selected to construct an action graph. For each graph, a triplet adjacency matrix is devised to explore the temporal and contextual correlations between the pseudo action segments, which consists of three components, i.e., mutual importance, feature similarity, and high-level contextual similarity. The two-stream dynamic pseudo graphs, along with the pseudo background segments, are used to derive more detailed video representation. For action localization, a non-local based temporal refinement module is proposed to fully leverage the temporal consistency between consecutive segments. Experimental results on three datasets, i.e., THUMOS14, ActivityNet v1.2 and v1.3, demonstrate that our method is superior to the state-of-the-arts.

源语言	英语
主期刊名	MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
出版商	Association for Computing Machinery, Inc
页	3820-3828
页数	9
ISBN（电子版）	9781450392037
DOI	https://doi.org/10.1145/3503161.3548077
出版状态	已出版 - 10 10月 2022
活动	30th ACM International Conference on Multimedia, MM 2022 - Lisboa, 葡萄牙期限: 10 10月 2022 → 14 10月 2022

出版系列

姓名	MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia

会议

会议	30th ACM International Conference on Multimedia, MM 2022
国家/地区	葡萄牙
市	Lisboa
时期	10/10/22 → 14/10/22

访问文件

10.1145/3503161.3548077

其它文件与链接

链接到 Scopus 的出版物

引用此

Shi, H., Zhang, X. Y., Li, C., Gong, L., Li, Y., & Bao, Y. (2022). Dynamic Graph Modeling for Weakly-Supervised Temporal Action Localization. 在 MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia (页码 3820-3828). (MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3503161.3548077

@inproceedings{a79c1be96bae44cfb101e62a752b2682,

title = "Dynamic Graph Modeling for Weakly-Supervised Temporal Action Localization",

abstract = "Weakly supervised action localization is a challenging task that aims to localize action instances in untrimmed videos given only video-level supervision. Existing methods mostly distinguish action from background via attentive feature fusion with RGB and optical flow modalities. Unfortunately, this strategy fails to retain the distinct characteristics of each modality, leading to inaccurate localization under hard-to-discriminate cases such as action-context interference and in-action stationary period. As an action is typically comprised of multiple stages, an intuitive solution is to model the relation between the finer-grained action segments to obtain a more detailed analysis. In this paper, we propose a dynamic graph-based method, namely DGCNN, to explore the two-stream relation between action segments. To be specific, segments within a video which are likely to be actions are dynamically selected to construct an action graph. For each graph, a triplet adjacency matrix is devised to explore the temporal and contextual correlations between the pseudo action segments, which consists of three components, i.e., mutual importance, feature similarity, and high-level contextual similarity. The two-stream dynamic pseudo graphs, along with the pseudo background segments, are used to derive more detailed video representation. For action localization, a non-local based temporal refinement module is proposed to fully leverage the temporal consistency between consecutive segments. Experimental results on three datasets, i.e., THUMOS14, ActivityNet v1.2 and v1.3, demonstrate that our method is superior to the state-of-the-arts.",

keywords = "dynamic graph modeling, pseudo action generation, temporal action localization, weakly supervised learning",

author = "Haichao Shi and Zhang, {Xiao Yu} and Changsheng Li and Lixing Gong and Yong Li and Yongjun Bao",

note = "Publisher Copyright: {\textcopyright} 2022 Owner/Author.; 30th ACM International Conference on Multimedia, MM 2022 ; Conference date: 10-10-2022 Through 14-10-2022",

year = "2022",

month = oct,

day = "10",

doi = "10.1145/3503161.3548077",

language = "English",

series = "MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "3820--3828",

booktitle = "MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia",

}

Shi, H, Zhang, XY, Li, C, Gong, L, Li, Y & Bao, Y 2022, Dynamic Graph Modeling for Weakly-Supervised Temporal Action Localization. 在 MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia. MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia, Association for Computing Machinery, Inc, 页码 3820-3828, 30th ACM International Conference on Multimedia, MM 2022, Lisboa, 葡萄牙, 10/10/22. https://doi.org/10.1145/3503161.3548077

Dynamic Graph Modeling for Weakly-Supervised Temporal Action Localization. / Shi, Haichao; Zhang, Xiao Yu; Li, Changsheng 等.
MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2022. 页码 3820-3828 (MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Dynamic Graph Modeling for Weakly-Supervised Temporal Action Localization

AU - Shi, Haichao

AU - Zhang, Xiao Yu

AU - Li, Changsheng

AU - Gong, Lixing

AU - Li, Yong

AU - Bao, Yongjun

PY - 2022/10/10

Y1 - 2022/10/10

N2 - Weakly supervised action localization is a challenging task that aims to localize action instances in untrimmed videos given only video-level supervision. Existing methods mostly distinguish action from background via attentive feature fusion with RGB and optical flow modalities. Unfortunately, this strategy fails to retain the distinct characteristics of each modality, leading to inaccurate localization under hard-to-discriminate cases such as action-context interference and in-action stationary period. As an action is typically comprised of multiple stages, an intuitive solution is to model the relation between the finer-grained action segments to obtain a more detailed analysis. In this paper, we propose a dynamic graph-based method, namely DGCNN, to explore the two-stream relation between action segments. To be specific, segments within a video which are likely to be actions are dynamically selected to construct an action graph. For each graph, a triplet adjacency matrix is devised to explore the temporal and contextual correlations between the pseudo action segments, which consists of three components, i.e., mutual importance, feature similarity, and high-level contextual similarity. The two-stream dynamic pseudo graphs, along with the pseudo background segments, are used to derive more detailed video representation. For action localization, a non-local based temporal refinement module is proposed to fully leverage the temporal consistency between consecutive segments. Experimental results on three datasets, i.e., THUMOS14, ActivityNet v1.2 and v1.3, demonstrate that our method is superior to the state-of-the-arts.

AB - Weakly supervised action localization is a challenging task that aims to localize action instances in untrimmed videos given only video-level supervision. Existing methods mostly distinguish action from background via attentive feature fusion with RGB and optical flow modalities. Unfortunately, this strategy fails to retain the distinct characteristics of each modality, leading to inaccurate localization under hard-to-discriminate cases such as action-context interference and in-action stationary period. As an action is typically comprised of multiple stages, an intuitive solution is to model the relation between the finer-grained action segments to obtain a more detailed analysis. In this paper, we propose a dynamic graph-based method, namely DGCNN, to explore the two-stream relation between action segments. To be specific, segments within a video which are likely to be actions are dynamically selected to construct an action graph. For each graph, a triplet adjacency matrix is devised to explore the temporal and contextual correlations between the pseudo action segments, which consists of three components, i.e., mutual importance, feature similarity, and high-level contextual similarity. The two-stream dynamic pseudo graphs, along with the pseudo background segments, are used to derive more detailed video representation. For action localization, a non-local based temporal refinement module is proposed to fully leverage the temporal consistency between consecutive segments. Experimental results on three datasets, i.e., THUMOS14, ActivityNet v1.2 and v1.3, demonstrate that our method is superior to the state-of-the-arts.

KW - dynamic graph modeling

KW - pseudo action generation

KW - temporal action localization

KW - weakly supervised learning

UR - http://www.scopus.com/inward/record.url?scp=85151161620&partnerID=8YFLogxK

U2 - 10.1145/3503161.3548077

DO - 10.1145/3503161.3548077

M3 - Conference contribution

AN - SCOPUS:85151161620

T3 - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia

SP - 3820

EP - 3828

BT - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

T2 - 30th ACM International Conference on Multimedia, MM 2022

Y2 - 10 October 2022 through 14 October 2022

ER -

Shi H, Zhang XY, Li C, Gong L, Li Y, Bao Y. Dynamic Graph Modeling for Weakly-Supervised Temporal Action Localization. 在 MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery, Inc. 2022. 页码 3820-3828. (MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia). doi: 10.1145/3503161.3548077

Dynamic Graph Modeling for Weakly-Supervised Temporal Action Localization

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此