Personalized Spontaneous Speech Synthesis Using a Small-Sized Unsegmented Semispontaneous Speech

Yi Chin Huang, Chung-Hsien Wu, Yan You Chen, Ming Ge Shie, Jhing Fa Wang

研究成果: Article

1 引文 (Scopus)

摘要

A systematic approach is proposed to synthesizing personalized spontaneous speech using a small-sized unsegmented speech corpus of the target speaker. First, an automatic segmentation algorithm is employed to segment and label a collected semispontaneous speech corpus of the target speaker. Then, a pretrained average voice model is adapted to the voice model of the target speaker by using the segmented data. A postfilter based on modulation spectrum is adopted to further improve the speaker similarity of the synthesized speech as well as alleviate the over-smoothing problem of the synthesized speech. For generating spontaneous speech, a smoothing method applied at the prosodic word level is proposed to improve speech fluency. For objective evaluation on spontaneous speech segmentation, the segmentation accuracy of the proposed method is superior to that of Viterbi-based forced alignment. The results of subjective listening test also show that the proposed method can improve the spontaneity and speaker similarity of the synthesized speech compared to the maximum likelihood linear regression based speaker adaptation method.

原文English
頁(從 - 到)1048-1060
頁數13
期刊IEEE/ACM Transactions on Audio Speech and Language Processing
25
發行號5
DOIs
出版狀態Published - 2017 五月 1

指紋

Speech Synthesis
Speech synthesis
synthesis
Segmentation
smoothing
Target
Speaker Adaptation
spontaneity
Speech
Smoothing Methods
Linear regression
Maximum Likelihood
Smoothing
regression analysis
Linear Models
Alignment
Modulation
Maximum likelihood
alignment
Labels

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Media Technology
  • Instrumentation
  • Acoustics and Ultrasonics
  • Linguistics and Language
  • Electrical and Electronic Engineering
  • Speech and Hearing

引用此文

@article{bbc11b8c5e5f46ab918abc0356048d9c,
title = "Personalized Spontaneous Speech Synthesis Using a Small-Sized Unsegmented Semispontaneous Speech",
abstract = "A systematic approach is proposed to synthesizing personalized spontaneous speech using a small-sized unsegmented speech corpus of the target speaker. First, an automatic segmentation algorithm is employed to segment and label a collected semispontaneous speech corpus of the target speaker. Then, a pretrained average voice model is adapted to the voice model of the target speaker by using the segmented data. A postfilter based on modulation spectrum is adopted to further improve the speaker similarity of the synthesized speech as well as alleviate the over-smoothing problem of the synthesized speech. For generating spontaneous speech, a smoothing method applied at the prosodic word level is proposed to improve speech fluency. For objective evaluation on spontaneous speech segmentation, the segmentation accuracy of the proposed method is superior to that of Viterbi-based forced alignment. The results of subjective listening test also show that the proposed method can improve the spontaneity and speaker similarity of the synthesized speech compared to the maximum likelihood linear regression based speaker adaptation method.",
author = "Huang, {Yi Chin} and Chung-Hsien Wu and Chen, {Yan You} and Shie, {Ming Ge} and Wang, {Jhing Fa}",
year = "2017",
month = "5",
day = "1",
doi = "10.1109/TASLP.2017.2679603",
language = "English",
volume = "25",
pages = "1048--1060",
journal = "IEEE/ACM Transactions on Speech and Language Processing",
issn = "2329-9290",
publisher = "IEEE Advancing Technology for Humanity",
number = "5",

}

Personalized Spontaneous Speech Synthesis Using a Small-Sized Unsegmented Semispontaneous Speech. / Huang, Yi Chin; Wu, Chung-Hsien; Chen, Yan You; Shie, Ming Ge; Wang, Jhing Fa.

於: IEEE/ACM Transactions on Audio Speech and Language Processing, 卷 25, 編號 5, 01.05.2017, p. 1048-1060.

研究成果: Article

TY - JOUR

T1 - Personalized Spontaneous Speech Synthesis Using a Small-Sized Unsegmented Semispontaneous Speech

AU - Huang, Yi Chin

AU - Wu, Chung-Hsien

AU - Chen, Yan You

AU - Shie, Ming Ge

AU - Wang, Jhing Fa

PY - 2017/5/1

Y1 - 2017/5/1

N2 - A systematic approach is proposed to synthesizing personalized spontaneous speech using a small-sized unsegmented speech corpus of the target speaker. First, an automatic segmentation algorithm is employed to segment and label a collected semispontaneous speech corpus of the target speaker. Then, a pretrained average voice model is adapted to the voice model of the target speaker by using the segmented data. A postfilter based on modulation spectrum is adopted to further improve the speaker similarity of the synthesized speech as well as alleviate the over-smoothing problem of the synthesized speech. For generating spontaneous speech, a smoothing method applied at the prosodic word level is proposed to improve speech fluency. For objective evaluation on spontaneous speech segmentation, the segmentation accuracy of the proposed method is superior to that of Viterbi-based forced alignment. The results of subjective listening test also show that the proposed method can improve the spontaneity and speaker similarity of the synthesized speech compared to the maximum likelihood linear regression based speaker adaptation method.

AB - A systematic approach is proposed to synthesizing personalized spontaneous speech using a small-sized unsegmented speech corpus of the target speaker. First, an automatic segmentation algorithm is employed to segment and label a collected semispontaneous speech corpus of the target speaker. Then, a pretrained average voice model is adapted to the voice model of the target speaker by using the segmented data. A postfilter based on modulation spectrum is adopted to further improve the speaker similarity of the synthesized speech as well as alleviate the over-smoothing problem of the synthesized speech. For generating spontaneous speech, a smoothing method applied at the prosodic word level is proposed to improve speech fluency. For objective evaluation on spontaneous speech segmentation, the segmentation accuracy of the proposed method is superior to that of Viterbi-based forced alignment. The results of subjective listening test also show that the proposed method can improve the spontaneity and speaker similarity of the synthesized speech compared to the maximum likelihood linear regression based speaker adaptation method.

UR - http://www.scopus.com/inward/record.url?scp=85018839213&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85018839213&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2017.2679603

DO - 10.1109/TASLP.2017.2679603

M3 - Article

AN - SCOPUS:85018839213

VL - 25

SP - 1048

EP - 1060

JO - IEEE/ACM Transactions on Speech and Language Processing

JF - IEEE/ACM Transactions on Speech and Language Processing

SN - 2329-9290

IS - 5

ER -