An Inverse Partial Optimal Transport Framework for Music-guided Trailer Generation

Yutong Wang; Sidan Zhu; Hongteng Xu; Dixin Luo

doi:10.1145/3664647.3680751

An Inverse Partial Optimal Transport Framework for Music-guided Trailer Generation

Yutong Wang, Sidan Zhu, Hongteng Xu, Dixin Luo^*

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

摘要

Trailer generation is a challenging video clipping task that aims to select highlighting shots from long videos like movies and re-organize them in an attractive way. In this study, we propose an inverse partial optimal transport (IPOT) framework to achieve music-guided movie trailer generation. In particular, we formulate the trailer generation task as selecting and sorting key movie shots based on audio shots, which involves matching the latent representations across visual and acoustic modalities. We learn a multi-modal latent representation model in the proposed IPOT framework to achieve this aim. In this framework, a two-tower encoder derives the latent representations of movie and music shots, respectively, and an attention-assisted Sinkhorn matching network parameterizes the grounding distance between the shots' latent representations and the distribution of the movie shots. Taking the correspondence between the movie shots and its trailer music shots as the observed optimal transport plan defined on the grounding distances, we learn the model by solving an inverse partial optimal transport problem, leading to a bi-level optimization strategy. We collect real-world movies and their trailers to construct a dataset with abundant label information called CMTD and, accordingly, train and evaluate various automatic trailer generators. Compared with state-of-the-art methods, our IPOT method consistently shows superiority in subjective visual effects and objective quantitative measurements. The code is available at https://github.com/Dixin-Lab/Automatic-Movie-Trailer-Generator.

源语言	英语
主期刊名	MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
出版商	Association for Computing Machinery, Inc
页	9739-9748
页数	10
ISBN（电子版）	9798400706868
DOI	https://doi.org/10.1145/3664647.3680751
出版状态	已出版 - 28 10月 2024
活动	32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, 澳大利亚期限: 28 10月 2024 → 1 11月 2024

出版系列

姓名	MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

会议

会议	32nd ACM International Conference on Multimedia, MM 2024
国家/地区	澳大利亚
市	Melbourne
时期	28/10/24 → 1/11/24

访问文件

10.1145/3664647.3680751

其它文件与链接

链接到 Scopus 的出版物

引用此

Wang, Y., Zhu, S., Xu, H., & Luo, D. (2024). An Inverse Partial Optimal Transport Framework for Music-guided Trailer Generation. 在 MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia (页码 9739-9748). (MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3664647.3680751

@inproceedings{2da9a292450c403da9951555df19f9f0,

title = "An Inverse Partial Optimal Transport Framework for Music-guided Trailer Generation",

abstract = "Trailer generation is a challenging video clipping task that aims to select highlighting shots from long videos like movies and re-organize them in an attractive way. In this study, we propose an inverse partial optimal transport (IPOT) framework to achieve music-guided movie trailer generation. In particular, we formulate the trailer generation task as selecting and sorting key movie shots based on audio shots, which involves matching the latent representations across visual and acoustic modalities. We learn a multi-modal latent representation model in the proposed IPOT framework to achieve this aim. In this framework, a two-tower encoder derives the latent representations of movie and music shots, respectively, and an attention-assisted Sinkhorn matching network parameterizes the grounding distance between the shots' latent representations and the distribution of the movie shots. Taking the correspondence between the movie shots and its trailer music shots as the observed optimal transport plan defined on the grounding distances, we learn the model by solving an inverse partial optimal transport problem, leading to a bi-level optimization strategy. We collect real-world movies and their trailers to construct a dataset with abundant label information called CMTD and, accordingly, train and evaluate various automatic trailer generators. Compared with state-of-the-art methods, our IPOT method consistently shows superiority in subjective visual effects and objective quantitative measurements. The code is available at https://github.com/Dixin-Lab/Automatic-Movie-Trailer-Generator.",

keywords = "inverse optimal transport, movie-trailer dataset, trailer generation, video clipping",

author = "Yutong Wang and Sidan Zhu and Hongteng Xu and Dixin Luo",

note = "Publisher Copyright: {\textcopyright} 2024 ACM.; 32nd ACM International Conference on Multimedia, MM 2024 ; Conference date: 28-10-2024 Through 01-11-2024",

year = "2024",

month = oct,

day = "28",

doi = "10.1145/3664647.3680751",

language = "English",

series = "MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "9739--9748",

booktitle = "MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia",

}

Wang, Y, Zhu, S, Xu, H & Luo, D 2024, An Inverse Partial Optimal Transport Framework for Music-guided Trailer Generation. 在 MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia, Association for Computing Machinery, Inc, 页码 9739-9748, 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, 澳大利亚, 28/10/24. https://doi.org/10.1145/3664647.3680751

An Inverse Partial Optimal Transport Framework for Music-guided Trailer Generation. / Wang, Yutong; Zhu, Sidan; Xu, Hongteng 等.
MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2024. 页码 9739-9748 (MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - An Inverse Partial Optimal Transport Framework for Music-guided Trailer Generation

AU - Wang, Yutong

AU - Zhu, Sidan

AU - Xu, Hongteng

AU - Luo, Dixin

PY - 2024/10/28

Y1 - 2024/10/28

N2 - Trailer generation is a challenging video clipping task that aims to select highlighting shots from long videos like movies and re-organize them in an attractive way. In this study, we propose an inverse partial optimal transport (IPOT) framework to achieve music-guided movie trailer generation. In particular, we formulate the trailer generation task as selecting and sorting key movie shots based on audio shots, which involves matching the latent representations across visual and acoustic modalities. We learn a multi-modal latent representation model in the proposed IPOT framework to achieve this aim. In this framework, a two-tower encoder derives the latent representations of movie and music shots, respectively, and an attention-assisted Sinkhorn matching network parameterizes the grounding distance between the shots' latent representations and the distribution of the movie shots. Taking the correspondence between the movie shots and its trailer music shots as the observed optimal transport plan defined on the grounding distances, we learn the model by solving an inverse partial optimal transport problem, leading to a bi-level optimization strategy. We collect real-world movies and their trailers to construct a dataset with abundant label information called CMTD and, accordingly, train and evaluate various automatic trailer generators. Compared with state-of-the-art methods, our IPOT method consistently shows superiority in subjective visual effects and objective quantitative measurements. The code is available at https://github.com/Dixin-Lab/Automatic-Movie-Trailer-Generator.

AB - Trailer generation is a challenging video clipping task that aims to select highlighting shots from long videos like movies and re-organize them in an attractive way. In this study, we propose an inverse partial optimal transport (IPOT) framework to achieve music-guided movie trailer generation. In particular, we formulate the trailer generation task as selecting and sorting key movie shots based on audio shots, which involves matching the latent representations across visual and acoustic modalities. We learn a multi-modal latent representation model in the proposed IPOT framework to achieve this aim. In this framework, a two-tower encoder derives the latent representations of movie and music shots, respectively, and an attention-assisted Sinkhorn matching network parameterizes the grounding distance between the shots' latent representations and the distribution of the movie shots. Taking the correspondence between the movie shots and its trailer music shots as the observed optimal transport plan defined on the grounding distances, we learn the model by solving an inverse partial optimal transport problem, leading to a bi-level optimization strategy. We collect real-world movies and their trailers to construct a dataset with abundant label information called CMTD and, accordingly, train and evaluate various automatic trailer generators. Compared with state-of-the-art methods, our IPOT method consistently shows superiority in subjective visual effects and objective quantitative measurements. The code is available at https://github.com/Dixin-Lab/Automatic-Movie-Trailer-Generator.

KW - inverse optimal transport

KW - movie-trailer dataset

KW - trailer generation

KW - video clipping

UR - http://www.scopus.com/inward/record.url?scp=85209799181&partnerID=8YFLogxK

U2 - 10.1145/3664647.3680751

DO - 10.1145/3664647.3680751

M3 - Conference contribution

AN - SCOPUS:85209799181

T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

SP - 9739

EP - 9748

BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

T2 - 32nd ACM International Conference on Multimedia, MM 2024

Y2 - 28 October 2024 through 1 November 2024

ER -

An Inverse Partial Optimal Transport Framework for Music-guided Trailer Generation

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此