Improving Mandarin Prosody Generation Using Alternative Smoothing Techniques

Yi Chin Huang, Chung Hsien Wu, Si Ting Weng

研究成果: Article

2 引文 (Scopus)

摘要

Prosody plays a vital role for conveying both communicative meanings and specific speaking styles in speech communication. In recent years, Hidden Markov Model (HMM)-based synthesis system (HTS) has been developed in triumph, which can synthesize stable and smooth speech. However, the prosody of the synthesized speech suffers from the over-smoothing problem. Thus, a better prosodic model is required to improve the natural variability of the synthesized speech. This study exploits a hybrid method to alleviate this problem by combining the statistical and the template-based unit selection methods. First, a two-level clustering approach is proposed to obtain representative prosodic patterns (denoted by codewords) of the hierarchical prosodic structure modeled by a modified Fujisaki model. The prosodic codewords are then used to represent the prosody of each sentence in the parallel corpus consisting of the real speech corpus and the synthesized counterpart obtained from the HTS. The synthesized speech utterance is then used as the query for retrieving the prosodic codewords of the utterances in the synthesized corpus. The retrieved synthesized prosodic codewords are mapped to the prosodic codewords of the real speech based on linear mapping rules obtained from the parallel corpus. The prosodic codeword language models for prosodic word and prosodic phrase are employed respectively to choose the optimal codeword sequence of the real speech. Finally, the most likely sequence of prosodic codewords can be obtained based on the NURBS-based continuity measure for synthesizing speech with natural prosody. The experimental results of subjective and objective tests demonstrate that the proposed prosodic model substantially improves naturalness of the intonation of the synthesized speech compared to that of the HMM-based method.

原文English
頁(從 - 到)1897-1907
頁數11
期刊IEEE/ACM Transactions on Audio Speech and Language Processing
24
發行號11
DOIs
出版狀態Published - 2016 十一月

指紋

Prosody
Smoothing Techniques
smoothing
Alternatives
Hidden Markov models
Markov Model
Speech
Speech communication
Model-based
NURBS
sentences
Conveying
Language Model
Hierarchical Structure
Hybrid Method
continuity
Cluster Analysis
Smoothing
speaking
Template

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering

引用此文

@article{ee5cb85d4ee84ed49e0f61c7fe4f0417,
title = "Improving Mandarin Prosody Generation Using Alternative Smoothing Techniques",
abstract = "Prosody plays a vital role for conveying both communicative meanings and specific speaking styles in speech communication. In recent years, Hidden Markov Model (HMM)-based synthesis system (HTS) has been developed in triumph, which can synthesize stable and smooth speech. However, the prosody of the synthesized speech suffers from the over-smoothing problem. Thus, a better prosodic model is required to improve the natural variability of the synthesized speech. This study exploits a hybrid method to alleviate this problem by combining the statistical and the template-based unit selection methods. First, a two-level clustering approach is proposed to obtain representative prosodic patterns (denoted by codewords) of the hierarchical prosodic structure modeled by a modified Fujisaki model. The prosodic codewords are then used to represent the prosody of each sentence in the parallel corpus consisting of the real speech corpus and the synthesized counterpart obtained from the HTS. The synthesized speech utterance is then used as the query for retrieving the prosodic codewords of the utterances in the synthesized corpus. The retrieved synthesized prosodic codewords are mapped to the prosodic codewords of the real speech based on linear mapping rules obtained from the parallel corpus. The prosodic codeword language models for prosodic word and prosodic phrase are employed respectively to choose the optimal codeword sequence of the real speech. Finally, the most likely sequence of prosodic codewords can be obtained based on the NURBS-based continuity measure for synthesizing speech with natural prosody. The experimental results of subjective and objective tests demonstrate that the proposed prosodic model substantially improves naturalness of the intonation of the synthesized speech compared to that of the HMM-based method.",
author = "Huang, {Yi Chin} and Wu, {Chung Hsien} and Weng, {Si Ting}",
year = "2016",
month = "11",
doi = "10.1109/TASLP.2016.2588727",
language = "English",
volume = "24",
pages = "1897--1907",
journal = "IEEE/ACM Transactions on Speech and Language Processing",
issn = "2329-9290",
publisher = "IEEE Advancing Technology for Humanity",
number = "11",

}

TY - JOUR

T1 - Improving Mandarin Prosody Generation Using Alternative Smoothing Techniques

AU - Huang, Yi Chin

AU - Wu, Chung Hsien

AU - Weng, Si Ting

PY - 2016/11

Y1 - 2016/11

N2 - Prosody plays a vital role for conveying both communicative meanings and specific speaking styles in speech communication. In recent years, Hidden Markov Model (HMM)-based synthesis system (HTS) has been developed in triumph, which can synthesize stable and smooth speech. However, the prosody of the synthesized speech suffers from the over-smoothing problem. Thus, a better prosodic model is required to improve the natural variability of the synthesized speech. This study exploits a hybrid method to alleviate this problem by combining the statistical and the template-based unit selection methods. First, a two-level clustering approach is proposed to obtain representative prosodic patterns (denoted by codewords) of the hierarchical prosodic structure modeled by a modified Fujisaki model. The prosodic codewords are then used to represent the prosody of each sentence in the parallel corpus consisting of the real speech corpus and the synthesized counterpart obtained from the HTS. The synthesized speech utterance is then used as the query for retrieving the prosodic codewords of the utterances in the synthesized corpus. The retrieved synthesized prosodic codewords are mapped to the prosodic codewords of the real speech based on linear mapping rules obtained from the parallel corpus. The prosodic codeword language models for prosodic word and prosodic phrase are employed respectively to choose the optimal codeword sequence of the real speech. Finally, the most likely sequence of prosodic codewords can be obtained based on the NURBS-based continuity measure for synthesizing speech with natural prosody. The experimental results of subjective and objective tests demonstrate that the proposed prosodic model substantially improves naturalness of the intonation of the synthesized speech compared to that of the HMM-based method.

AB - Prosody plays a vital role for conveying both communicative meanings and specific speaking styles in speech communication. In recent years, Hidden Markov Model (HMM)-based synthesis system (HTS) has been developed in triumph, which can synthesize stable and smooth speech. However, the prosody of the synthesized speech suffers from the over-smoothing problem. Thus, a better prosodic model is required to improve the natural variability of the synthesized speech. This study exploits a hybrid method to alleviate this problem by combining the statistical and the template-based unit selection methods. First, a two-level clustering approach is proposed to obtain representative prosodic patterns (denoted by codewords) of the hierarchical prosodic structure modeled by a modified Fujisaki model. The prosodic codewords are then used to represent the prosody of each sentence in the parallel corpus consisting of the real speech corpus and the synthesized counterpart obtained from the HTS. The synthesized speech utterance is then used as the query for retrieving the prosodic codewords of the utterances in the synthesized corpus. The retrieved synthesized prosodic codewords are mapped to the prosodic codewords of the real speech based on linear mapping rules obtained from the parallel corpus. The prosodic codeword language models for prosodic word and prosodic phrase are employed respectively to choose the optimal codeword sequence of the real speech. Finally, the most likely sequence of prosodic codewords can be obtained based on the NURBS-based continuity measure for synthesizing speech with natural prosody. The experimental results of subjective and objective tests demonstrate that the proposed prosodic model substantially improves naturalness of the intonation of the synthesized speech compared to that of the HMM-based method.

UR - http://www.scopus.com/inward/record.url?scp=84982292118&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84982292118&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2016.2588727

DO - 10.1109/TASLP.2016.2588727

M3 - Article

AN - SCOPUS:84982292118

VL - 24

SP - 1897

EP - 1907

JO - IEEE/ACM Transactions on Speech and Language Processing

JF - IEEE/ACM Transactions on Speech and Language Processing

SN - 2329-9290

IS - 11

ER -