Fusion Classification Method Based on Audiovisual Information Processing

Peiju Chen, Xuan Zhang, Huijun Zhao, Huiliang Cao, Xuemei Chen*, Xiaochen Liu*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

In the presence of external interference, multimodal target classification plays a crucial role. Traditional single-modal classification systems are limited by the singularity of data representation and their sensitivity to environmental conditions, making it challenging to meet the robustness requirements for target classification under external disturbances. This paper addresses the inadequacies of single-modal target classification by proposing a target classification algorithm based on audiovisual fusion. The innovative contributions of this work are as follows. (1) To resolve the issue of the lack of correlation between audio signals and image signals, we introduce a method that converts audio signals into spectrograms and fuses them with target images. The advantage of this method is that the spectrogram can fully utilize the effective information in the audio, ensuring stability, while also effectively addressing the challenge of fusing one-dimensional time series audio signals with two-dimensional discrete image signals. (2) We propose a convolutional extraction and modal fusion network framework that incorporates an attention mechanism module during the fusion process, ensuring the stability and robustness of the fused data for audiovisual target classification. Validation was conducted on both a custom dataset and the YouTube-8M dataset. The experimental results indicate that the proposed method demonstrates improvements in accuracy of 2.9%, 2.4%, 1.2%, and 0.9% compared to other multimodal fusion target classification methods on the custom dataset. This demonstrates the effectiveness of the proposed multimodal fusion recognition approach and fully validates the theoretical rationale behind our method.

Original languageEnglish
Article number4104
JournalApplied Sciences (Switzerland)
Volume15
Issue number8
DOIs
Publication statusPublished - Apr 2025
Externally publishedYes

Keywords

  • audiovisual fusion
  • modal fusion network framework
  • multimodal classification
  • sound spectrum

Fingerprint

Dive into the research topics of 'Fusion Classification Method Based on Audiovisual Information Processing'. Together they form a unique fingerprint.

Cite this