TY - GEN
T1 - Counterfactual Inference for Visual Relationship Detection in Videos
AU - Ji, Xiaofeng
AU - Chen, Jin
AU - Wu, Xinxiao
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Visual relationship detection in videos is a challenging task since it requires not only to detect static relationships but also to infer dynamic relationships. Recent progress has been made through enriching visual representations by appearance and motion fusion or spatial and temporal reasoning, but without exploring the intrinsic causality between representations and predictions. In this paper, we propose a novel counterfactual inference method for video relationship detection, which infers the causal effects of appearance, motion and language features on the predictions of static and dynamic relationships. Specifically, starting with building a causal graph to represent the causality between features and relationship categories, we then construct counterfactual scenes by intervening the features to infer their effects on prediction, and finally incorporate the inferred effects into the relationship categorization by adaptively learning the weights of appearance, motion and language. Extensive experiments on two benchmark datasets demonstrate the effectiveness of our method.
AB - Visual relationship detection in videos is a challenging task since it requires not only to detect static relationships but also to infer dynamic relationships. Recent progress has been made through enriching visual representations by appearance and motion fusion or spatial and temporal reasoning, but without exploring the intrinsic causality between representations and predictions. In this paper, we propose a novel counterfactual inference method for video relationship detection, which infers the causal effects of appearance, motion and language features on the predictions of static and dynamic relationships. Specifically, starting with building a causal graph to represent the causality between features and relationship categories, we then construct counterfactual scenes by intervening the features to infer their effects on prediction, and finally incorporate the inferred effects into the relationship categorization by adaptively learning the weights of appearance, motion and language. Extensive experiments on two benchmark datasets demonstrate the effectiveness of our method.
KW - Counterfactual Inference
KW - Video Relationship Detection
KW - Video Understanding
UR - http://www.scopus.com/inward/record.url?scp=85171154518&partnerID=8YFLogxK
U2 - 10.1109/ICME55011.2023.00036
DO - 10.1109/ICME55011.2023.00036
M3 - Conference contribution
AN - SCOPUS:85171154518
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
SP - 162
EP - 167
BT - Proceedings - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023
PB - IEEE Computer Society
T2 - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023
Y2 - 10 July 2023 through 14 July 2023
ER -