TY - JOUR
T1 - Dynamic Sampling-Based Meta-Learning Using Multilingual Acoustic Data for Under-Resourced Speech Recognition
AU - Hsieh, I. Ting
AU - Wu, Chung Hsien
AU - Zhao, Zhe Hong
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2024
Y1 - 2024
N2 - Under-resourced automatic speech recognition (ASR) has become an active field of research and has experienced significant progress during the past decade. However, the performance of under-resourced ASR trained by existing methods is still far inferior to high-resourced ASR for practical applications. In this paper, speech data from languages that share the most phonemes with the under-resourced language are selected as supplementary resources for meta-training based on the Model-Agnostic Meta-Learning (MAML) strategy. Besides supplementary language selection, this paper proposes a dynamic sampling method instead of the original random sampling method to select support and query sets for each task in MAML to improve meta-training performance. In this study, Taiwanese is selected as the under-resourced language, and the speech corpus of five languages, including Mandarin, English, Japanese, Cantonese, and Thai, are chosen as supplementary training data for acoustic model training. The proposed dynamic sampling approach uses phonemes, pronunciation, and speech recognition models as the basis to determine the proportion of each supplementary language to select helpful utterances for MAML. For evaluation, with the selected utterances from each supplementary language for meta-training, we obtained a Word Error Rate of 20.24% and a Syllable Error Rate of 8.35% for Taiwanese ASR, which were better than the baseline model (26.18% and 13.99%) using only the Taiwanese corpus and other methods.
AB - Under-resourced automatic speech recognition (ASR) has become an active field of research and has experienced significant progress during the past decade. However, the performance of under-resourced ASR trained by existing methods is still far inferior to high-resourced ASR for practical applications. In this paper, speech data from languages that share the most phonemes with the under-resourced language are selected as supplementary resources for meta-training based on the Model-Agnostic Meta-Learning (MAML) strategy. Besides supplementary language selection, this paper proposes a dynamic sampling method instead of the original random sampling method to select support and query sets for each task in MAML to improve meta-training performance. In this study, Taiwanese is selected as the under-resourced language, and the speech corpus of five languages, including Mandarin, English, Japanese, Cantonese, and Thai, are chosen as supplementary training data for acoustic model training. The proposed dynamic sampling approach uses phonemes, pronunciation, and speech recognition models as the basis to determine the proportion of each supplementary language to select helpful utterances for MAML. For evaluation, with the selected utterances from each supplementary language for meta-training, we obtained a Word Error Rate of 20.24% and a Syllable Error Rate of 8.35% for Taiwanese ASR, which were better than the baseline model (26.18% and 13.99%) using only the Taiwanese corpus and other methods.
UR - http://www.scopus.com/inward/record.url?scp=85200211525&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85200211525&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2024.3437755
DO - 10.1109/ACCESS.2024.3437755
M3 - Article
AN - SCOPUS:85200211525
SN - 2169-3536
VL - 12
SP - 106070
EP - 106083
JO - IEEE Access
JF - IEEE Access
ER -