Let storytelling tell vivid stories: A multi-modal-agent-based unified storytelling framework

Rongsheng Zhang*, Jiji Tang, Chuanqi Zang, Mingtao Pei, Wei Liang, Zeng Zhao, Zhou Zhao

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Storytelling via image sequences demands the creation of compelling and vivid narratives that adhere to the visual content while maintaining engaging plot divergence. Previous works have incrementally refined the alignment of multiple modalities, yet often resulted in the generation of simplistic storylines for image sequences. In this study, we introduce the LLaMS framework, designed to generate multi-modal human-level stories characterized by expressiveness and consistency. Our approach involves the utilization of a multi-modal agent framework to elevate the expression of factual content and applies a textual reasoning architecture for the purpose of expressive story generation and prediction. Furthermore, we introduce the Story-Adapter module, tailored for long image sequence illustration, with a focus on maintaining prolonged story consistency as opposed to short-term object consistency. Extensive experiments are conducted to validate the superior performance of the proposed LLaMS with human evaluation. Evaluations demonstrate that LLaMS achieves state-of-the-art storytelling performance, with an 86% correlation and 100% consistency win rate compared to prior state-of-the-art methods. Additionally, we conduct ablation experiments to confirm the efficacy of the proposed multi-modal agent framework and Story-Adapter module. Our code is accessible at https://anonymous.4open.science/status/LLams-FF83.

Original languageEnglish
Article number129316
JournalNeurocomputing
Volume622
DOIs
Publication statusPublished - 14 Mar 2025

Keywords

  • Agent framework
  • Large language model
  • Long story consistency
  • Multi-modal storytelling

Fingerprint

Dive into the research topics of 'Let storytelling tell vivid stories: A multi-modal-agent-based unified storytelling framework'. Together they form a unique fingerprint.

Cite this

Zhang, R., Tang, J., Zang, C., Pei, M., Liang, W., Zhao, Z., & Zhao, Z. (2025). Let storytelling tell vivid stories: A multi-modal-agent-based unified storytelling framework. Neurocomputing, 622, Article 129316. https://doi.org/10.1016/j.neucom.2024.129316