MODE: a multimodal open-domain dialogue dataset with explanation

Hang Yin, Pinren Lu, Ziang Li, Bin Sun, Kan Li*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

The need for high-quality data has been a key issue hindering the research of dialogue tasks. Recent studies try to build datasets through manual, web crawling and so on. However, man-made data is expensive and data collected from the internet often includes generic responses, meaningless statements even toxic information. With the development of LLM (large language models), generating data through LLM has broad application potential. For open-domain multimodal dialogue tasks, there are still three drawbacks: 1) There is currently a lack of a unified and effective framework for collecting high-quality multimodal dialogue data; 2) The output of LLM in Multimodal dialogue generation lacks scene explanation, affecting human understanding; 3) Previous work has not quantitatively examined the impact of data quality on model performance. To improve data quality and reduce expenditure in the data collection process, we propose the Multimodal Data Construction Framework (MDCF). MDCF utilizes the modal conversion module and designs proper prompts to the LLM to generate well-formed and high-quality content. It also provides explanation for the multimodal dialogue, helping to understand conversation scenarios and facilitate manual subsequent quality inspection. Based on this, we release a Multimodal Open-domain Dialogue dataset with Explanation(MODE). We mainly compared open domain datasets such as Image-Chat. Both human evaluation and experiments show that high-quality datasets enable models to have greater understanding and generation capabilities.

Original languageEnglish
JournalApplied Intelligence
DOIs
Publication statusAccepted/In press - 2024

Keywords

  • AIGC
  • Explainability
  • Multimodal data construction
  • Open-domain dialogue

Fingerprint

Dive into the research topics of 'MODE: a multimodal open-domain dialogue dataset with explanation'. Together they form a unique fingerprint.

Cite this