Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

Jiahao Zhu; Daizong Liu; Pan Zhou; Xing Di; Yu Cheng; Song Yang; Wenzheng Xu; Zichuan Xu; Yao Wan; Lichao Sun; Zeyu Xiong

Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

Jiahao Zhu, Daizong Liu^*, Pan Zhou^*, Xing Di, Yu Cheng, Song Yang, Wenzheng Xu, Zichuan Xu, Yao Wan, Lichao Sun, Zeyu Xiong

^*此作品的通讯作者

计算机学院

科研成果: 会议稿件 › 论文 › 同行评审

6 引用（Scopus）

摘要

Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multimodal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.

源语言	英语
页	590-600
页数	11
出版状态	已出版 - 2022
活动	2022 Findings of the Association for Computational Linguistics: EMNLP 2022 - Abu Dhabi, 阿拉伯联合酋长国期限: 7 12月 2022 → 11 12月 2022

会议

会议	2022 Findings of the Association for Computational Linguistics: EMNLP 2022
国家/地区	阿拉伯联合酋长国
市	Abu Dhabi
时期	7/12/22 → 11/12/22

其它文件与链接

链接到 Scopus 的出版物

引用此

Zhu, J., Liu, D., Zhou, P., Di, X., Cheng, Y., Yang, S., Xu, W., Xu, Z., Wan, Y., Sun, L., & Xiong, Z. (2022). Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding. 590-600. 论文发表于 2022 Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, 阿拉伯联合酋长国.

@conference{45990cb37c6e4c0a95bfc3e10e5a2168,

title = "Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding",

abstract = "Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multimodal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.",

author = "Jiahao Zhu and Daizong Liu and Pan Zhou and Xing Di and Yu Cheng and Song Yang and Wenzheng Xu and Zichuan Xu and Yao Wan and Lichao Sun and Zeyu Xiong",

note = "Publisher Copyright: {\textcopyright} 2022 Association for Computational Linguistics.; 2022 Findings of the Association for Computational Linguistics: EMNLP 2022 ; Conference date: 07-12-2022 Through 11-12-2022",

year = "2022",

language = "English",

pages = "590--600",

}

TY - CONF

T1 - Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

AU - Zhu, Jiahao

AU - Liu, Daizong

AU - Zhou, Pan

AU - Di, Xing

AU - Cheng, Yu

AU - Yang, Song

AU - Xu, Wenzheng

AU - Xu, Zichuan

AU - Wan, Yao

AU - Sun, Lichao

AU - Xiong, Zeyu

PY - 2022

Y1 - 2022

N2 - Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multimodal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.

AB - Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multimodal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.

UR - http://www.scopus.com/inward/record.url?scp=85148988979&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85148988979

SP - 590

EP - 600

T2 - 2022 Findings of the Association for Computational Linguistics: EMNLP 2022

Y2 - 7 December 2022 through 11 December 2022

ER -

Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

摘要

会议

其它文件与链接

指纹

引用此