TY - JOUR
T1 - Let storytelling tell vivid stories
T2 - A multi-modal-agent-based unified storytelling framework
AU - Zhang, Rongsheng
AU - Tang, Jiji
AU - Zang, Chuanqi
AU - Pei, Mingtao
AU - Liang, Wei
AU - Zhao, Zeng
AU - Zhao, Zhou
N1 - Publisher Copyright:
© 2025 Elsevier B.V.
PY - 2025/3/14
Y1 - 2025/3/14
N2 - Storytelling via image sequences demands the creation of compelling and vivid narratives that adhere to the visual content while maintaining engaging plot divergence. Previous works have incrementally refined the alignment of multiple modalities, yet often resulted in the generation of simplistic storylines for image sequences. In this study, we introduce the LLaMS framework, designed to generate multi-modal human-level stories characterized by expressiveness and consistency. Our approach involves the utilization of a multi-modal agent framework to elevate the expression of factual content and applies a textual reasoning architecture for the purpose of expressive story generation and prediction. Furthermore, we introduce the Story-Adapter module, tailored for long image sequence illustration, with a focus on maintaining prolonged story consistency as opposed to short-term object consistency. Extensive experiments are conducted to validate the superior performance of the proposed LLaMS with human evaluation. Evaluations demonstrate that LLaMS achieves state-of-the-art storytelling performance, with an 86% correlation and 100% consistency win rate compared to prior state-of-the-art methods. Additionally, we conduct ablation experiments to confirm the efficacy of the proposed multi-modal agent framework and Story-Adapter module. Our code is accessible at https://anonymous.4open.science/status/LLams-FF83.
AB - Storytelling via image sequences demands the creation of compelling and vivid narratives that adhere to the visual content while maintaining engaging plot divergence. Previous works have incrementally refined the alignment of multiple modalities, yet often resulted in the generation of simplistic storylines for image sequences. In this study, we introduce the LLaMS framework, designed to generate multi-modal human-level stories characterized by expressiveness and consistency. Our approach involves the utilization of a multi-modal agent framework to elevate the expression of factual content and applies a textual reasoning architecture for the purpose of expressive story generation and prediction. Furthermore, we introduce the Story-Adapter module, tailored for long image sequence illustration, with a focus on maintaining prolonged story consistency as opposed to short-term object consistency. Extensive experiments are conducted to validate the superior performance of the proposed LLaMS with human evaluation. Evaluations demonstrate that LLaMS achieves state-of-the-art storytelling performance, with an 86% correlation and 100% consistency win rate compared to prior state-of-the-art methods. Additionally, we conduct ablation experiments to confirm the efficacy of the proposed multi-modal agent framework and Story-Adapter module. Our code is accessible at https://anonymous.4open.science/status/LLams-FF83.
KW - Agent framework
KW - Large language model
KW - Long story consistency
KW - Multi-modal storytelling
UR - http://www.scopus.com/inward/record.url?scp=85214332565&partnerID=8YFLogxK
U2 - 10.1016/j.neucom.2024.129316
DO - 10.1016/j.neucom.2024.129316
M3 - Article
AN - SCOPUS:85214332565
SN - 0925-2312
VL - 622
JO - Neurocomputing
JF - Neurocomputing
M1 - 129316
ER -