TY - GEN
T1 - Synthesizing Counterfactual Samples for Overcoming Moment Biases in Temporal Video Grounding
AU - Zhai, Mingliang
AU - Li, Chuanhao
AU - Jing, Chenchen
AU - Wu, Yuwei
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - Moment bias is a critical issue in temporal video grounding (TVG), where models often exploit superficial correlations between language queries and moment locations as shortcuts to predict temporal boundaries. In this paper, we propose a model-agnostic counterfactual samples synthesizing method to overcome moment biases by endowing TVG models with sensitivity to linguistic and visual variations. The models with sensitivity sufficiently utilize linguistic information and focus on important video clips rather than fixed patterns, therefore are not dominated by moment biases. Specifically, we synthesize counterfactual samples by masking important words in queries or deleting important frames in videos for training TVG models. During training, we penalize the model if it makes similar predictions on counterfactual samples and original samples to encourage the model to perceive linguistic and visual variations. Experiment results on two datasets (i.e., Charades-CD and ActivityNet-CD) demonstrate the effectiveness of our method.
AB - Moment bias is a critical issue in temporal video grounding (TVG), where models often exploit superficial correlations between language queries and moment locations as shortcuts to predict temporal boundaries. In this paper, we propose a model-agnostic counterfactual samples synthesizing method to overcome moment biases by endowing TVG models with sensitivity to linguistic and visual variations. The models with sensitivity sufficiently utilize linguistic information and focus on important video clips rather than fixed patterns, therefore are not dominated by moment biases. Specifically, we synthesize counterfactual samples by masking important words in queries or deleting important frames in videos for training TVG models. During training, we penalize the model if it makes similar predictions on counterfactual samples and original samples to encourage the model to perceive linguistic and visual variations. Experiment results on two datasets (i.e., Charades-CD and ActivityNet-CD) demonstrate the effectiveness of our method.
KW - Counterfactual samples
KW - Moment biases
KW - Temporal video grounding
UR - http://www.scopus.com/inward/record.url?scp=85142697764&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-18907-4_34
DO - 10.1007/978-3-031-18907-4_34
M3 - Conference contribution
AN - SCOPUS:85142697764
SN - 9783031189067
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 436
EP - 448
BT - Pattern Recognition and Computer Vision - 5th Chinese Conference, PRCV 2022, Proceedings
A2 - Yu, Shiqi
A2 - Zhang, Jianguo
A2 - Zhang, Zhaoxiang
A2 - Tan, Tieniu
A2 - Yuen, Pong C.
A2 - Guo, Yike
A2 - Han, Junwei
A2 - Lai, Jianhuang
PB - Springer Science and Business Media Deutschland GmbH
T2 - 5th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2022
Y2 - 4 November 2022 through 7 November 2022
ER -