Improving Mandarin Prosody Generation Using Alternative Smoothing Techniques

Yi Chin Huang, Chung-Hsien Wu, Si Ting Weng

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Prosody plays a vital role for conveying both communicative meanings and specific speaking styles in speech communication. In recent years, Hidden Markov Model (HMM)-based synthesis system (HTS) has been developed in triumph, which can synthesize stable and smooth speech. However, the prosody of the synthesized speech suffers from the over-smoothing problem. Thus, a better prosodic model is required to improve the natural variability of the synthesized speech. This study exploits a hybrid method to alleviate this problem by combining the statistical and the template-based unit selection methods. First, a two-level clustering approach is proposed to obtain representative prosodic patterns (denoted by codewords) of the hierarchical prosodic structure modeled by a modified Fujisaki model. The prosodic codewords are then used to represent the prosody of each sentence in the parallel corpus consisting of the real speech corpus and the synthesized counterpart obtained from the HTS. The synthesized speech utterance is then used as the query for retrieving the prosodic codewords of the utterances in the synthesized corpus. The retrieved synthesized prosodic codewords are mapped to the prosodic codewords of the real speech based on linear mapping rules obtained from the parallel corpus. The prosodic codeword language models for prosodic word and prosodic phrase are employed respectively to choose the optimal codeword sequence of the real speech. Finally, the most likely sequence of prosodic codewords can be obtained based on the NURBS-based continuity measure for synthesizing speech with natural prosody. The experimental results of subjective and objective tests demonstrate that the proposed prosodic model substantially improves naturalness of the intonation of the synthesized speech compared to that of the HMM-based method.

Original languageEnglish
Pages (from-to)1897-1907
Number of pages11
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume24
Issue number11
DOIs
Publication statusPublished - 2016 Nov 1

Fingerprint

Prosody
Smoothing Techniques
smoothing
Alternatives
Hidden Markov models
Markov Model
Speech
Speech communication
Model-based
NURBS
sentences
Conveying
Language Model
Hierarchical Structure
Hybrid Method
continuity
Cluster Analysis
Smoothing
speaking
Template

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Media Technology
  • Instrumentation
  • Acoustics and Ultrasonics
  • Linguistics and Language
  • Electrical and Electronic Engineering
  • Speech and Hearing

Cite this

@article{ee5cb85d4ee84ed49e0f61c7fe4f0417,
title = "Improving Mandarin Prosody Generation Using Alternative Smoothing Techniques",
abstract = "Prosody plays a vital role for conveying both communicative meanings and specific speaking styles in speech communication. In recent years, Hidden Markov Model (HMM)-based synthesis system (HTS) has been developed in triumph, which can synthesize stable and smooth speech. However, the prosody of the synthesized speech suffers from the over-smoothing problem. Thus, a better prosodic model is required to improve the natural variability of the synthesized speech. This study exploits a hybrid method to alleviate this problem by combining the statistical and the template-based unit selection methods. First, a two-level clustering approach is proposed to obtain representative prosodic patterns (denoted by codewords) of the hierarchical prosodic structure modeled by a modified Fujisaki model. The prosodic codewords are then used to represent the prosody of each sentence in the parallel corpus consisting of the real speech corpus and the synthesized counterpart obtained from the HTS. The synthesized speech utterance is then used as the query for retrieving the prosodic codewords of the utterances in the synthesized corpus. The retrieved synthesized prosodic codewords are mapped to the prosodic codewords of the real speech based on linear mapping rules obtained from the parallel corpus. The prosodic codeword language models for prosodic word and prosodic phrase are employed respectively to choose the optimal codeword sequence of the real speech. Finally, the most likely sequence of prosodic codewords can be obtained based on the NURBS-based continuity measure for synthesizing speech with natural prosody. The experimental results of subjective and objective tests demonstrate that the proposed prosodic model substantially improves naturalness of the intonation of the synthesized speech compared to that of the HMM-based method.",
author = "Huang, {Yi Chin} and Chung-Hsien Wu and Weng, {Si Ting}",
year = "2016",
month = "11",
day = "1",
doi = "10.1109/TASLP.2016.2588727",
language = "English",
volume = "24",
pages = "1897--1907",
journal = "IEEE/ACM Transactions on Speech and Language Processing",
issn = "2329-9290",
publisher = "IEEE Advancing Technology for Humanity",
number = "11",

}

Improving Mandarin Prosody Generation Using Alternative Smoothing Techniques. / Huang, Yi Chin; Wu, Chung-Hsien; Weng, Si Ting.

In: IEEE/ACM Transactions on Audio Speech and Language Processing, Vol. 24, No. 11, 01.11.2016, p. 1897-1907.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Improving Mandarin Prosody Generation Using Alternative Smoothing Techniques

AU - Huang, Yi Chin

AU - Wu, Chung-Hsien

AU - Weng, Si Ting

PY - 2016/11/1

Y1 - 2016/11/1

N2 - Prosody plays a vital role for conveying both communicative meanings and specific speaking styles in speech communication. In recent years, Hidden Markov Model (HMM)-based synthesis system (HTS) has been developed in triumph, which can synthesize stable and smooth speech. However, the prosody of the synthesized speech suffers from the over-smoothing problem. Thus, a better prosodic model is required to improve the natural variability of the synthesized speech. This study exploits a hybrid method to alleviate this problem by combining the statistical and the template-based unit selection methods. First, a two-level clustering approach is proposed to obtain representative prosodic patterns (denoted by codewords) of the hierarchical prosodic structure modeled by a modified Fujisaki model. The prosodic codewords are then used to represent the prosody of each sentence in the parallel corpus consisting of the real speech corpus and the synthesized counterpart obtained from the HTS. The synthesized speech utterance is then used as the query for retrieving the prosodic codewords of the utterances in the synthesized corpus. The retrieved synthesized prosodic codewords are mapped to the prosodic codewords of the real speech based on linear mapping rules obtained from the parallel corpus. The prosodic codeword language models for prosodic word and prosodic phrase are employed respectively to choose the optimal codeword sequence of the real speech. Finally, the most likely sequence of prosodic codewords can be obtained based on the NURBS-based continuity measure for synthesizing speech with natural prosody. The experimental results of subjective and objective tests demonstrate that the proposed prosodic model substantially improves naturalness of the intonation of the synthesized speech compared to that of the HMM-based method.

AB - Prosody plays a vital role for conveying both communicative meanings and specific speaking styles in speech communication. In recent years, Hidden Markov Model (HMM)-based synthesis system (HTS) has been developed in triumph, which can synthesize stable and smooth speech. However, the prosody of the synthesized speech suffers from the over-smoothing problem. Thus, a better prosodic model is required to improve the natural variability of the synthesized speech. This study exploits a hybrid method to alleviate this problem by combining the statistical and the template-based unit selection methods. First, a two-level clustering approach is proposed to obtain representative prosodic patterns (denoted by codewords) of the hierarchical prosodic structure modeled by a modified Fujisaki model. The prosodic codewords are then used to represent the prosody of each sentence in the parallel corpus consisting of the real speech corpus and the synthesized counterpart obtained from the HTS. The synthesized speech utterance is then used as the query for retrieving the prosodic codewords of the utterances in the synthesized corpus. The retrieved synthesized prosodic codewords are mapped to the prosodic codewords of the real speech based on linear mapping rules obtained from the parallel corpus. The prosodic codeword language models for prosodic word and prosodic phrase are employed respectively to choose the optimal codeword sequence of the real speech. Finally, the most likely sequence of prosodic codewords can be obtained based on the NURBS-based continuity measure for synthesizing speech with natural prosody. The experimental results of subjective and objective tests demonstrate that the proposed prosodic model substantially improves naturalness of the intonation of the synthesized speech compared to that of the HMM-based method.

UR - http://www.scopus.com/inward/record.url?scp=84982292118&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84982292118&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2016.2588727

DO - 10.1109/TASLP.2016.2588727

M3 - Article

VL - 24

SP - 1897

EP - 1907

JO - IEEE/ACM Transactions on Speech and Language Processing

JF - IEEE/ACM Transactions on Speech and Language Processing

SN - 2329-9290

IS - 11

ER -