Let storytelling tell vivid stories: A multi-modal-agent-based unified storytelling framework

Rongsheng Zhang*, Jiji Tang, Chuanqi Zang, Mingtao Pei, Wei Liang, Zeng Zhao, Zhou Zhao

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

摘要

Storytelling via image sequences demands the creation of compelling and vivid narratives that adhere to the visual content while maintaining engaging plot divergence. Previous works have incrementally refined the alignment of multiple modalities, yet often resulted in the generation of simplistic storylines for image sequences. In this study, we introduce the LLaMS framework, designed to generate multi-modal human-level stories characterized by expressiveness and consistency. Our approach involves the utilization of a multi-modal agent framework to elevate the expression of factual content and applies a textual reasoning architecture for the purpose of expressive story generation and prediction. Furthermore, we introduce the Story-Adapter module, tailored for long image sequence illustration, with a focus on maintaining prolonged story consistency as opposed to short-term object consistency. Extensive experiments are conducted to validate the superior performance of the proposed LLaMS with human evaluation. Evaluations demonstrate that LLaMS achieves state-of-the-art storytelling performance, with an 86% correlation and 100% consistency win rate compared to prior state-of-the-art methods. Additionally, we conduct ablation experiments to confirm the efficacy of the proposed multi-modal agent framework and Story-Adapter module. Our code is accessible at https://anonymous.4open.science/status/LLams-FF83.

源语言英语
文章编号129316
期刊Neurocomputing
622
DOI
出版状态已出版 - 14 3月 2025

指纹

探究 'Let storytelling tell vivid stories: A multi-modal-agent-based unified storytelling framework' 的科研主题。它们共同构成独一无二的指纹。

引用此