Let storytelling tell vivid stories: A multi-modal-agent-based unified storytelling framework

Rongsheng Zhang; Jiji Tang; Chuanqi Zang; Mingtao Pei; Wei Liang; Zeng Zhao; Zhou Zhao

doi:10.1016/j.neucom.2024.129316

Let storytelling tell vivid stories: A multi-modal-agent-based unified storytelling framework

Rongsheng Zhang^*, Jiji Tang, Chuanqi Zang, Mingtao Pei, Wei Liang, Zeng Zhao, Zhou Zhao

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

Abstract

Storytelling via image sequences demands the creation of compelling and vivid narratives that adhere to the visual content while maintaining engaging plot divergence. Previous works have incrementally refined the alignment of multiple modalities, yet often resulted in the generation of simplistic storylines for image sequences. In this study, we introduce the LLaMS framework, designed to generate multi-modal human-level stories characterized by expressiveness and consistency. Our approach involves the utilization of a multi-modal agent framework to elevate the expression of factual content and applies a textual reasoning architecture for the purpose of expressive story generation and prediction. Furthermore, we introduce the Story-Adapter module, tailored for long image sequence illustration, with a focus on maintaining prolonged story consistency as opposed to short-term object consistency. Extensive experiments are conducted to validate the superior performance of the proposed LLaMS with human evaluation. Evaluations demonstrate that LLaMS achieves state-of-the-art storytelling performance, with an 86% correlation and 100% consistency win rate compared to prior state-of-the-art methods. Additionally, we conduct ablation experiments to confirm the efficacy of the proposed multi-modal agent framework and Story-Adapter module. Our code is accessible at https://anonymous.4open.science/status/LLams-FF83.

Original language	English
Article number	129316
Journal	Neurocomputing
Volume	622
DOIs	https://doi.org/10.1016/j.neucom.2024.129316
Publication status	Published - 14 Mar 2025

Keywords

Agent framework
Large language model
Long story consistency
Multi-modal storytelling

Access to Document

10.1016/j.neucom.2024.129316

Cite this

Zhang, R., Tang, J., Zang, C., Pei, M., Liang, W., Zhao, Z., & Zhao, Z. (2025). Let storytelling tell vivid stories: A multi-modal-agent-based unified storytelling framework. Neurocomputing, 622, Article 129316. https://doi.org/10.1016/j.neucom.2024.129316

@article{d99f1dfd07d3492dab7382e2f563a78e,

title = "Let storytelling tell vivid stories: A multi-modal-agent-based unified storytelling framework",

abstract = "Storytelling via image sequences demands the creation of compelling and vivid narratives that adhere to the visual content while maintaining engaging plot divergence. Previous works have incrementally refined the alignment of multiple modalities, yet often resulted in the generation of simplistic storylines for image sequences. In this study, we introduce the LLaMS framework, designed to generate multi-modal human-level stories characterized by expressiveness and consistency. Our approach involves the utilization of a multi-modal agent framework to elevate the expression of factual content and applies a textual reasoning architecture for the purpose of expressive story generation and prediction. Furthermore, we introduce the Story-Adapter module, tailored for long image sequence illustration, with a focus on maintaining prolonged story consistency as opposed to short-term object consistency. Extensive experiments are conducted to validate the superior performance of the proposed LLaMS with human evaluation. Evaluations demonstrate that LLaMS achieves state-of-the-art storytelling performance, with an 86% correlation and 100% consistency win rate compared to prior state-of-the-art methods. Additionally, we conduct ablation experiments to confirm the efficacy of the proposed multi-modal agent framework and Story-Adapter module. Our code is accessible at https://anonymous.4open.science/status/LLams-FF83.",

keywords = "Agent framework, Large language model, Long story consistency, Multi-modal storytelling",

author = "Rongsheng Zhang and Jiji Tang and Chuanqi Zang and Mingtao Pei and Wei Liang and Zeng Zhao and Zhou Zhao",

note = "Publisher Copyright: {\textcopyright} 2025 Elsevier B.V.",

year = "2025",

month = mar,

day = "14",

doi = "10.1016/j.neucom.2024.129316",

language = "English",

volume = "622",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Let storytelling tell vivid stories

T2 - A multi-modal-agent-based unified storytelling framework

AU - Zhang, Rongsheng

AU - Tang, Jiji

AU - Zang, Chuanqi

AU - Pei, Mingtao

AU - Liang, Wei

AU - Zhao, Zeng

AU - Zhao, Zhou

PY - 2025/3/14

Y1 - 2025/3/14

N2 - Storytelling via image sequences demands the creation of compelling and vivid narratives that adhere to the visual content while maintaining engaging plot divergence. Previous works have incrementally refined the alignment of multiple modalities, yet often resulted in the generation of simplistic storylines for image sequences. In this study, we introduce the LLaMS framework, designed to generate multi-modal human-level stories characterized by expressiveness and consistency. Our approach involves the utilization of a multi-modal agent framework to elevate the expression of factual content and applies a textual reasoning architecture for the purpose of expressive story generation and prediction. Furthermore, we introduce the Story-Adapter module, tailored for long image sequence illustration, with a focus on maintaining prolonged story consistency as opposed to short-term object consistency. Extensive experiments are conducted to validate the superior performance of the proposed LLaMS with human evaluation. Evaluations demonstrate that LLaMS achieves state-of-the-art storytelling performance, with an 86% correlation and 100% consistency win rate compared to prior state-of-the-art methods. Additionally, we conduct ablation experiments to confirm the efficacy of the proposed multi-modal agent framework and Story-Adapter module. Our code is accessible at https://anonymous.4open.science/status/LLams-FF83.

AB - Storytelling via image sequences demands the creation of compelling and vivid narratives that adhere to the visual content while maintaining engaging plot divergence. Previous works have incrementally refined the alignment of multiple modalities, yet often resulted in the generation of simplistic storylines for image sequences. In this study, we introduce the LLaMS framework, designed to generate multi-modal human-level stories characterized by expressiveness and consistency. Our approach involves the utilization of a multi-modal agent framework to elevate the expression of factual content and applies a textual reasoning architecture for the purpose of expressive story generation and prediction. Furthermore, we introduce the Story-Adapter module, tailored for long image sequence illustration, with a focus on maintaining prolonged story consistency as opposed to short-term object consistency. Extensive experiments are conducted to validate the superior performance of the proposed LLaMS with human evaluation. Evaluations demonstrate that LLaMS achieves state-of-the-art storytelling performance, with an 86% correlation and 100% consistency win rate compared to prior state-of-the-art methods. Additionally, we conduct ablation experiments to confirm the efficacy of the proposed multi-modal agent framework and Story-Adapter module. Our code is accessible at https://anonymous.4open.science/status/LLams-FF83.

KW - Agent framework

KW - Large language model

KW - Long story consistency

KW - Multi-modal storytelling

UR - http://www.scopus.com/inward/record.url?scp=85214332565&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2024.129316

DO - 10.1016/j.neucom.2024.129316

M3 - Article

AN - SCOPUS:85214332565

SN - 0925-2312

VL - 622

JO - Neurocomputing

JF - Neurocomputing

M1 - 129316

ER -

Let storytelling tell vivid stories: A multi-modal-agent-based unified storytelling framework

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this