TY - GEN
T1 - Dynamic Graph Modeling for Weakly-Supervised Temporal Action Localization
AU - Shi, Haichao
AU - Zhang, Xiao Yu
AU - Li, Changsheng
AU - Gong, Lixing
AU - Li, Yong
AU - Bao, Yongjun
N1 - Publisher Copyright:
© 2022 Owner/Author.
PY - 2022/10/10
Y1 - 2022/10/10
N2 - Weakly supervised action localization is a challenging task that aims to localize action instances in untrimmed videos given only video-level supervision. Existing methods mostly distinguish action from background via attentive feature fusion with RGB and optical flow modalities. Unfortunately, this strategy fails to retain the distinct characteristics of each modality, leading to inaccurate localization under hard-to-discriminate cases such as action-context interference and in-action stationary period. As an action is typically comprised of multiple stages, an intuitive solution is to model the relation between the finer-grained action segments to obtain a more detailed analysis. In this paper, we propose a dynamic graph-based method, namely DGCNN, to explore the two-stream relation between action segments. To be specific, segments within a video which are likely to be actions are dynamically selected to construct an action graph. For each graph, a triplet adjacency matrix is devised to explore the temporal and contextual correlations between the pseudo action segments, which consists of three components, i.e., mutual importance, feature similarity, and high-level contextual similarity. The two-stream dynamic pseudo graphs, along with the pseudo background segments, are used to derive more detailed video representation. For action localization, a non-local based temporal refinement module is proposed to fully leverage the temporal consistency between consecutive segments. Experimental results on three datasets, i.e., THUMOS14, ActivityNet v1.2 and v1.3, demonstrate that our method is superior to the state-of-the-arts.
AB - Weakly supervised action localization is a challenging task that aims to localize action instances in untrimmed videos given only video-level supervision. Existing methods mostly distinguish action from background via attentive feature fusion with RGB and optical flow modalities. Unfortunately, this strategy fails to retain the distinct characteristics of each modality, leading to inaccurate localization under hard-to-discriminate cases such as action-context interference and in-action stationary period. As an action is typically comprised of multiple stages, an intuitive solution is to model the relation between the finer-grained action segments to obtain a more detailed analysis. In this paper, we propose a dynamic graph-based method, namely DGCNN, to explore the two-stream relation between action segments. To be specific, segments within a video which are likely to be actions are dynamically selected to construct an action graph. For each graph, a triplet adjacency matrix is devised to explore the temporal and contextual correlations between the pseudo action segments, which consists of three components, i.e., mutual importance, feature similarity, and high-level contextual similarity. The two-stream dynamic pseudo graphs, along with the pseudo background segments, are used to derive more detailed video representation. For action localization, a non-local based temporal refinement module is proposed to fully leverage the temporal consistency between consecutive segments. Experimental results on three datasets, i.e., THUMOS14, ActivityNet v1.2 and v1.3, demonstrate that our method is superior to the state-of-the-arts.
KW - dynamic graph modeling
KW - pseudo action generation
KW - temporal action localization
KW - weakly supervised learning
UR - http://www.scopus.com/inward/record.url?scp=85151161620&partnerID=8YFLogxK
U2 - 10.1145/3503161.3548077
DO - 10.1145/3503161.3548077
M3 - Conference contribution
AN - SCOPUS:85151161620
T3 - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
SP - 3820
EP - 3828
BT - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 30th ACM International Conference on Multimedia, MM 2022
Y2 - 10 October 2022 through 14 October 2022
ER -