SPLGAN-TTS: Learning Semantic and Prosody to Enhance the Text-to-Speech Quality of Lightweight GAN Models

Ding Chi Chang, Shiou Chi Li, Jen Wei Huang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Autoregressive-based models have proven effective in speech synthesis; however, numerous parameters and slow inference limit their applicabili ty. Though non-autoregressive models can resolve these issues, speech synthesis quality is unsatisfactory. This study employed a tree-based structure to enhance the learning of semantic and prosody information using a lightweight model. A Variational Encoder (VAE) is used for the generator architecture, and a novel normalizing-flow module is used to enhance the complexity of the VAE-generated distribution. We also developed a speech discriminator with a multi-length architecture to reduce computational overhead as well as multiple auxiliary losses to assist in model training. The proposed model is smaller than existing state-of-the-art models, and synthesis performance is faster, particularly when applied to longer texts. Despite the fact that the proposed model is roughly 30% smaller than FastSpeech2 [1], its mean opinion score surpasses FastSpeech2 as well as other models.

Original languageEnglish
Title of host publicationMultiMedia Modeling - 31st International Conference on Multimedia Modeling, MMM 2025, Proceedings
EditorsIchiro Ide, Ioannis Kompatsiaris, Changsheng Xu, Keiji Yanai, Wei-Ta Chu, Naoko Nitta, Michael Riegler, Toshihiko Yamasaki
PublisherSpringer Science and Business Media Deutschland GmbH
Pages58-70
Number of pages13
ISBN (Print)9789819620708
DOIs
Publication statusPublished - 2025
Event31st International Conference on Multimedia Modeling, MMM 2025 - Nara, Japan
Duration: 2025 Jan 82025 Jan 10

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume15523 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference31st International Conference on Multimedia Modeling, MMM 2025
Country/TerritoryJapan
CityNara
Period25-01-0825-01-10

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science

Cite this