Probability Distribution Based Frame-supervised Language-driven Action Localization

Shuo Yang; Zirui Shang; Xinxiao Wu

doi:10.1145/3581783.3612512

Probability Distribution Based Frame-supervised Language-driven Action Localization

Shuo Yang, Zirui Shang, Xinxiao Wu^*

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

2 引用（Scopus）

摘要

Frame-supervised language-driven action localization aims to localize action boundaries in untrimmed videos corresponding to the input natural language query, with only a single frame annotation within the target action in training. This task is challenging due to the absence of complete and accurate annotation of action boundaries, hindering visual-language alignment and action boundary prediction. To address this challenge, we propose a novel method that introduces distribution functions to model both the probability of action frame and that of boundary frame. Specifically, we assign each video frame the probability of being the action frame based on the estimated shape parameters of the distribution function, serving as a foreground pseudo-label that guides cross-modal feature learning. Moreover, we model the probabilities of start frame and end frame of the target action using different distribution functions, and then estimate the probability of each action candidate being a positive candidate based on its start and end boundaries, which facilitates predicting action boundaries by exploring more positive terms in training. Experiments on two benchmark datasets demonstrate that our method outperforms existing methods, achieving a gain of more than 10% of R1@ 0.5 on the challenging TACoS dataset. These results emphasize the significance of generating pseudo labels with appropriate probabilities via distribution functions to address the challenge of frame-supervised language-driven action localization.

源语言	英语
主期刊名	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
出版商	Association for Computing Machinery, Inc
页	5164-5173
页数	10
ISBN（电子版）	9798400701085
DOI	https://doi.org/10.1145/3581783.3612512
出版状态	已出版 - 26 10月 2023
活动	31st ACM International Conference on Multimedia, MM 2023 - Ottawa, 加拿大期限: 29 10月 2023 → 3 11月 2023

出版系列

姓名	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

会议

会议	31st ACM International Conference on Multimedia, MM 2023
国家/地区	加拿大
市	Ottawa
时期	29/10/23 → 3/11/23

访问文件

10.1145/3581783.3612512

其它文件与链接

链接到 Scopus 的出版物

引用此

Yang, S., Shang, Z., & Wu, X. (2023). Probability Distribution Based Frame-supervised Language-driven Action Localization. 在 MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia (页码 5164-5173). (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3581783.3612512

@inproceedings{f8a6515a4c274c58a1d53fe2e469f862,

title = "Probability Distribution Based Frame-supervised Language-driven Action Localization",

abstract = "Frame-supervised language-driven action localization aims to localize action boundaries in untrimmed videos corresponding to the input natural language query, with only a single frame annotation within the target action in training. This task is challenging due to the absence of complete and accurate annotation of action boundaries, hindering visual-language alignment and action boundary prediction. To address this challenge, we propose a novel method that introduces distribution functions to model both the probability of action frame and that of boundary frame. Specifically, we assign each video frame the probability of being the action frame based on the estimated shape parameters of the distribution function, serving as a foreground pseudo-label that guides cross-modal feature learning. Moreover, we model the probabilities of start frame and end frame of the target action using different distribution functions, and then estimate the probability of each action candidate being a positive candidate based on its start and end boundaries, which facilitates predicting action boundaries by exploring more positive terms in training. Experiments on two benchmark datasets demonstrate that our method outperforms existing methods, achieving a gain of more than 10% of R1@ 0.5 on the challenging TACoS dataset. These results emphasize the significance of generating pseudo labels with appropriate probabilities via distribution functions to address the challenge of frame-supervised language-driven action localization.",

keywords = "distribution, frame-supervised, language-driven action localization, video moment retrieval",

author = "Shuo Yang and Zirui Shang and Xinxiao Wu",

note = "Publisher Copyright: {\textcopyright} 2023 ACM.; 31st ACM International Conference on Multimedia, MM 2023 ; Conference date: 29-10-2023 Through 03-11-2023",

year = "2023",

month = oct,

day = "26",

doi = "10.1145/3581783.3612512",

language = "English",

series = "MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "5164--5173",

booktitle = "MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia",

}

Yang, S, Shang, Z & Wu, X 2023, Probability Distribution Based Frame-supervised Language-driven Action Localization. 在 MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, Association for Computing Machinery, Inc, 页码 5164-5173, 31st ACM International Conference on Multimedia, MM 2023, Ottawa, 加拿大, 29/10/23. https://doi.org/10.1145/3581783.3612512

Probability Distribution Based Frame-supervised Language-driven Action Localization. / Yang, Shuo; Shang, Zirui; Wu, Xinxiao.
MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2023. 页码 5164-5173 (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Probability Distribution Based Frame-supervised Language-driven Action Localization

AU - Yang, Shuo

AU - Shang, Zirui

AU - Wu, Xinxiao

PY - 2023/10/26

Y1 - 2023/10/26

N2 - Frame-supervised language-driven action localization aims to localize action boundaries in untrimmed videos corresponding to the input natural language query, with only a single frame annotation within the target action in training. This task is challenging due to the absence of complete and accurate annotation of action boundaries, hindering visual-language alignment and action boundary prediction. To address this challenge, we propose a novel method that introduces distribution functions to model both the probability of action frame and that of boundary frame. Specifically, we assign each video frame the probability of being the action frame based on the estimated shape parameters of the distribution function, serving as a foreground pseudo-label that guides cross-modal feature learning. Moreover, we model the probabilities of start frame and end frame of the target action using different distribution functions, and then estimate the probability of each action candidate being a positive candidate based on its start and end boundaries, which facilitates predicting action boundaries by exploring more positive terms in training. Experiments on two benchmark datasets demonstrate that our method outperforms existing methods, achieving a gain of more than 10% of R1@ 0.5 on the challenging TACoS dataset. These results emphasize the significance of generating pseudo labels with appropriate probabilities via distribution functions to address the challenge of frame-supervised language-driven action localization.

AB - Frame-supervised language-driven action localization aims to localize action boundaries in untrimmed videos corresponding to the input natural language query, with only a single frame annotation within the target action in training. This task is challenging due to the absence of complete and accurate annotation of action boundaries, hindering visual-language alignment and action boundary prediction. To address this challenge, we propose a novel method that introduces distribution functions to model both the probability of action frame and that of boundary frame. Specifically, we assign each video frame the probability of being the action frame based on the estimated shape parameters of the distribution function, serving as a foreground pseudo-label that guides cross-modal feature learning. Moreover, we model the probabilities of start frame and end frame of the target action using different distribution functions, and then estimate the probability of each action candidate being a positive candidate based on its start and end boundaries, which facilitates predicting action boundaries by exploring more positive terms in training. Experiments on two benchmark datasets demonstrate that our method outperforms existing methods, achieving a gain of more than 10% of R1@ 0.5 on the challenging TACoS dataset. These results emphasize the significance of generating pseudo labels with appropriate probabilities via distribution functions to address the challenge of frame-supervised language-driven action localization.

KW - distribution

KW - frame-supervised

KW - language-driven action localization

KW - video moment retrieval

UR - http://www.scopus.com/inward/record.url?scp=85179557043&partnerID=8YFLogxK

U2 - 10.1145/3581783.3612512

DO - 10.1145/3581783.3612512

M3 - Conference contribution

AN - SCOPUS:85179557043

T3 - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

SP - 5164

EP - 5173

BT - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

T2 - 31st ACM International Conference on Multimedia, MM 2023

Y2 - 29 October 2023 through 3 November 2023

ER -

Probability Distribution Based Frame-supervised Language-driven Action Localization

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此