TY - GEN
T1 - Weakly-Supervised Movie Trailer Generation Driven by Multi-Modal Semantic Consistency
AU - Zhu, Sidan
AU - Wang, Yutong
AU - Xu, Hongteng
AU - Luo, Dixin
N1 - Publisher Copyright:
© 2025 International Joint Conferences on Artificial Intelligence. All rights reserved.
PY - 2025
Y1 - 2025
N2 - As an essential movie promotional tool, trailers are designed to capture the audience's interest through the skillful editing of key movie shots. Although some attempts have been made for automatic trailer generation, existing methods often rely on predefined rules or manual fine-grained annotations and fail to fully leverage the multi-modal information of movies, resulting in unsatisfactory trailer generation results. In this study, we introduce a weakly-supervised trailer generation method driven by multi-modal semantic consistency. Specifically, we design a multi-modal trailer generation framework that selects and sorts key movie shots based on input music and movie metadata (e.g., category tags and plot keywords) and adds narration to the generated trailer based on movie subtitles. We utilize two pseudo-scores derived from the proposed framework as labels and thus train the model under a weakly-supervised learning paradigm, ensuring trailerness consistency for key shot selection and emotion consistency for key shot sorting, respectively. As a result, we can learn the proposed model solely based on movie-trailer pairs without any fine-grained annotations. Both objective experimental results and subjective user studies demonstrate the superior performance of our method over previous works. The code is available at https://github.com/Dixin-Lab/MMSC.
AB - As an essential movie promotional tool, trailers are designed to capture the audience's interest through the skillful editing of key movie shots. Although some attempts have been made for automatic trailer generation, existing methods often rely on predefined rules or manual fine-grained annotations and fail to fully leverage the multi-modal information of movies, resulting in unsatisfactory trailer generation results. In this study, we introduce a weakly-supervised trailer generation method driven by multi-modal semantic consistency. Specifically, we design a multi-modal trailer generation framework that selects and sorts key movie shots based on input music and movie metadata (e.g., category tags and plot keywords) and adds narration to the generated trailer based on movie subtitles. We utilize two pseudo-scores derived from the proposed framework as labels and thus train the model under a weakly-supervised learning paradigm, ensuring trailerness consistency for key shot selection and emotion consistency for key shot sorting, respectively. As a result, we can learn the proposed model solely based on movie-trailer pairs without any fine-grained annotations. Both objective experimental results and subjective user studies demonstrate the superior performance of our method over previous works. The code is available at https://github.com/Dixin-Lab/MMSC.
UR - https://www.scopus.com/pages/publications/105021820999
U2 - 10.24963/ijcai.2025/1137
DO - 10.24963/ijcai.2025/1137
M3 - Conference contribution
AN - SCOPUS:105021820999
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 10234
EP - 10242
BT - Proceedings of the 34th International Joint Conference on Artificial Intelligence, IJCAI 2025
A2 - Kwok, James
PB - International Joint Conferences on Artificial Intelligence
T2 - 34th Internationa Joint Conference on Artificial Intelligence, IJCAI 2025
Y2 - 16 August 2025 through 22 August 2025
ER -