TY - JOUR
T1 - Improved mini-batch multiple augmentation for low-resource spoken word recognition
AU - Kivaisi, Alexander Rogath
AU - Zhao, Qingjie
N1 - Publisher Copyright:
© 2024 Elsevier Ltd
PY - 2024/10/15
Y1 - 2024/10/15
N2 - Data augmentation techniques have been useful in dealing with limited data for machine learning tasks. Recently, spectrogram data augmentation techniques have been investigated for voice conversion and sound classification tasks and have produced better results. However, applying multiple data augmentation techniques within a mini-batch has been observed to lead to performance degradation. While applying multiple augmentation methods sequentially has shown performance gains in image data, transferring this approach to spectrogram data leads to loss of acoustic information. Hence, an alternative approach is needed to effectively utilize multiple augmentation methods in the speech domain. This study addressed these challenges in low-resource settings for spoken word recognition within the mini-batch. First, we investigated the effect of data augmentation techniques. Second, we investigated the effect of multiple data augmentation techniques. Finally, we proposed a new approach that uses an alternate mechanism to utilize multiple spectrogram augmentation techniques more effectively. The results of our experiment show that the proposed approach (new pattern) outperforms the sequential approach (traditional pattern) significantly at different scales of datasets, including low-resource settings. In addition, the proposed approach achieves approximately 2x actual speedup over the sequential approach. A combination of frequency-warping and time length control augmentation methods was found to be stable and robust in performance across all datasets evaluated.
AB - Data augmentation techniques have been useful in dealing with limited data for machine learning tasks. Recently, spectrogram data augmentation techniques have been investigated for voice conversion and sound classification tasks and have produced better results. However, applying multiple data augmentation techniques within a mini-batch has been observed to lead to performance degradation. While applying multiple augmentation methods sequentially has shown performance gains in image data, transferring this approach to spectrogram data leads to loss of acoustic information. Hence, an alternative approach is needed to effectively utilize multiple augmentation methods in the speech domain. This study addressed these challenges in low-resource settings for spoken word recognition within the mini-batch. First, we investigated the effect of data augmentation techniques. Second, we investigated the effect of multiple data augmentation techniques. Finally, we proposed a new approach that uses an alternate mechanism to utilize multiple spectrogram augmentation techniques more effectively. The results of our experiment show that the proposed approach (new pattern) outperforms the sequential approach (traditional pattern) significantly at different scales of datasets, including low-resource settings. In addition, the proposed approach achieves approximately 2x actual speedup over the sequential approach. A combination of frequency-warping and time length control augmentation methods was found to be stable and robust in performance across all datasets evaluated.
KW - Low-resource languages
KW - Mini-batch
KW - Multiple augmentation
KW - Spectrogram
KW - Spoken word recognition
UR - http://www.scopus.com/inward/record.url?scp=85193004605&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2024.124157
DO - 10.1016/j.eswa.2024.124157
M3 - Article
AN - SCOPUS:85193004605
SN - 0957-4174
VL - 252
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 124157
ER -