TY - GEN
T1 - Mandarin Electrolaryngeal Speech Voice Conversion with Sequence-to-Sequence Modeling
AU - Yen, Ming Chi
AU - Huang, Wen Chin
AU - Kobayashi, Kazuhiro
AU - Peng, Yu Huai
AU - Tsai, Shu Wei
AU - Tsao, Yu
AU - Toda, Tomoki
AU - Jang, Jyh Shing Roger
AU - Wang, Hsin Min
N1 - Funding Information:
This work was partly supported by MOST-Taiwan Grant: 107-2221-E-001-008-MY3. This work was also partly supported by Tsvoice LLC Project, and by JST CREST Grant Number JPMJCR19A3, Japan.
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - The electrolaryngeal speech (EL speech) is typically spoken with an electrolarynx device that generates excitation signals to substitute human vocal fold vibrations. Because the excitation signals cannot perfectly characterize sound sources generated by vocal folds, the naturalness and intelligibility of the EL speech are inevitably worse than that of the natural speech (NL speech). To improve speech naturalness, statistical models, such as Gaussian mixture models and deep-learning-based models, have been employed for EL speech voice conversion (ELVC). The ELVC task aims to convert EL speech into NL speech through an ELVC model. To implement a frame-wise ELVC system, accurate feature alignment is crucial for model training. However, the abnormal acoustic characteristics of the EL speech cause misalignments and accordingly limit the ELVC performance. To address this issue, we propose a novel ELVC system based on sequence-to-sequence (seq2seq) modeling with text-to-speech (TTS) pretraining. The seq2seq model involves an attention mechanism to concurrently perform representation learning and alignment. Meanwhile, TTS pretraining provides efficient training with limited data. Experimental results show that the proposed ELVC system yields notable improvements in terms of standardized evaluation metrics and subjective listening tests over a well-known frame-wise ELVC system.
AB - The electrolaryngeal speech (EL speech) is typically spoken with an electrolarynx device that generates excitation signals to substitute human vocal fold vibrations. Because the excitation signals cannot perfectly characterize sound sources generated by vocal folds, the naturalness and intelligibility of the EL speech are inevitably worse than that of the natural speech (NL speech). To improve speech naturalness, statistical models, such as Gaussian mixture models and deep-learning-based models, have been employed for EL speech voice conversion (ELVC). The ELVC task aims to convert EL speech into NL speech through an ELVC model. To implement a frame-wise ELVC system, accurate feature alignment is crucial for model training. However, the abnormal acoustic characteristics of the EL speech cause misalignments and accordingly limit the ELVC performance. To address this issue, we propose a novel ELVC system based on sequence-to-sequence (seq2seq) modeling with text-to-speech (TTS) pretraining. The seq2seq model involves an attention mechanism to concurrently perform representation learning and alignment. Meanwhile, TTS pretraining provides efficient training with limited data. Experimental results show that the proposed ELVC system yields notable improvements in terms of standardized evaluation metrics and subjective listening tests over a well-known frame-wise ELVC system.
UR - http://www.scopus.com/inward/record.url?scp=85126824776&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85126824776&partnerID=8YFLogxK
U2 - 10.1109/ASRU51503.2021.9687908
DO - 10.1109/ASRU51503.2021.9687908
M3 - Conference contribution
AN - SCOPUS:85126824776
T3 - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings
SP - 650
EP - 657
BT - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021
Y2 - 13 December 2021 through 17 December 2021
ER -