Abstract
In the presence of external interference, multimodal target classification plays a crucial role. Traditional single-modal classification systems are limited by the singularity of data representation and their sensitivity to environmental conditions, making it challenging to meet the robustness requirements for target classification under external disturbances. This paper addresses the inadequacies of single-modal target classification by proposing a target classification algorithm based on audiovisual fusion. The innovative contributions of this work are as follows. (1) To resolve the issue of the lack of correlation between audio signals and image signals, we introduce a method that converts audio signals into spectrograms and fuses them with target images. The advantage of this method is that the spectrogram can fully utilize the effective information in the audio, ensuring stability, while also effectively addressing the challenge of fusing one-dimensional time series audio signals with two-dimensional discrete image signals. (2) We propose a convolutional extraction and modal fusion network framework that incorporates an attention mechanism module during the fusion process, ensuring the stability and robustness of the fused data for audiovisual target classification. Validation was conducted on both a custom dataset and the YouTube-8M dataset. The experimental results indicate that the proposed method demonstrates improvements in accuracy of 2.9%, 2.4%, 1.2%, and 0.9% compared to other multimodal fusion target classification methods on the custom dataset. This demonstrates the effectiveness of the proposed multimodal fusion recognition approach and fully validates the theoretical rationale behind our method.
Original language | English |
---|---|
Article number | 4104 |
Journal | Applied Sciences (Switzerland) |
Volume | 15 |
Issue number | 8 |
DOIs | |
Publication status | Published - Apr 2025 |
Externally published | Yes |
Keywords
- audiovisual fusion
- modal fusion network framework
- multimodal classification
- sound spectrum