TY - GEN
T1 - SPLGAN-TTS
T2 - 31st International Conference on Multimedia Modeling, MMM 2025
AU - Chang, Ding Chi
AU - Li, Shiou Chi
AU - Huang, Jen Wei
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - Autoregressive-based models have proven effective in speech synthesis; however, numerous parameters and slow inference limit their applicabili ty. Though non-autoregressive models can resolve these issues, speech synthesis quality is unsatisfactory. This study employed a tree-based structure to enhance the learning of semantic and prosody information using a lightweight model. A Variational Encoder (VAE) is used for the generator architecture, and a novel normalizing-flow module is used to enhance the complexity of the VAE-generated distribution. We also developed a speech discriminator with a multi-length architecture to reduce computational overhead as well as multiple auxiliary losses to assist in model training. The proposed model is smaller than existing state-of-the-art models, and synthesis performance is faster, particularly when applied to longer texts. Despite the fact that the proposed model is roughly 30% smaller than FastSpeech2 [1], its mean opinion score surpasses FastSpeech2 as well as other models.
AB - Autoregressive-based models have proven effective in speech synthesis; however, numerous parameters and slow inference limit their applicabili ty. Though non-autoregressive models can resolve these issues, speech synthesis quality is unsatisfactory. This study employed a tree-based structure to enhance the learning of semantic and prosody information using a lightweight model. A Variational Encoder (VAE) is used for the generator architecture, and a novel normalizing-flow module is used to enhance the complexity of the VAE-generated distribution. We also developed a speech discriminator with a multi-length architecture to reduce computational overhead as well as multiple auxiliary losses to assist in model training. The proposed model is smaller than existing state-of-the-art models, and synthesis performance is faster, particularly when applied to longer texts. Despite the fact that the proposed model is roughly 30% smaller than FastSpeech2 [1], its mean opinion score surpasses FastSpeech2 as well as other models.
UR - http://www.scopus.com/inward/record.url?scp=85216026869&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85216026869&partnerID=8YFLogxK
U2 - 10.1007/978-981-96-2071-5_5
DO - 10.1007/978-981-96-2071-5_5
M3 - Conference contribution
AN - SCOPUS:85216026869
SN - 9789819620708
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 58
EP - 70
BT - MultiMedia Modeling - 31st International Conference on Multimedia Modeling, MMM 2025, Proceedings
A2 - Ide, Ichiro
A2 - Kompatsiaris, Ioannis
A2 - Xu, Changsheng
A2 - Yanai, Keiji
A2 - Chu, Wei-Ta
A2 - Nitta, Naoko
A2 - Riegler, Michael
A2 - Yamasaki, Toshihiko
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 8 January 2025 through 10 January 2025
ER -