TY - JOUR
T1 - Hierarchical prosody conversion using regression-based clustering for emotional speech synthesis
AU - Wu, Chung Hsien
AU - Hsia, Chi Chun
AU - Lee, Chung Han
AU - Lin, Mai Chun
N1 - Funding Information:
Manuscript received October 27, 2008; revised September 23, 2009. First published October 20, 2009; current version published July 14, 2010.This work was supported by the National Science Council, Taiwan, under Contract NSC96-2221-E-006-154-MY3. The STRAIGHT analysis/synthesis program was supported by Dr. Kawahara. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Simon King.
PY - 2010
Y1 - 2010
N2 - This paper presents an approach to hierarchical prosody conversion for emotional speech synthesis. The pitch contour of the source speech is decomposed into a hierarchical prosodic structure consisting of sentence, prosodic word, and subsyllable levels. The pitch contour in the higher level is encoded by the discrete Legendre polynomial coefficients. The residual, the difference between the source pitch contour and the pitch contour decoded from the discrete Legendre polynomial coefficients, is then used for pitch modeling at the lower level. For prosody conversion, Gaussian mixture models (GMMs) are used for sentence- and prosodic word-level conversion. At subsyllable level, the pitch feature vectors are clustered via a proposed regression-based clustering method to generate the prosody conversion functions for selection. Linguistic and symbolic prosody features of the source speech are adopted to select the most suitable function using the classification and regression tree for prosody conversion. Three small-sized emotional parallel speech databases with happy, angry, and sad emotions, respectively, were designed and collected for training and evaluation. Objective and subjective evaluations were conducted and the comparison results to the GMM-based method for prosody conversion achieved an improved performance using the hierarchical prosodic structure and the proposed regression-based clustering method.
AB - This paper presents an approach to hierarchical prosody conversion for emotional speech synthesis. The pitch contour of the source speech is decomposed into a hierarchical prosodic structure consisting of sentence, prosodic word, and subsyllable levels. The pitch contour in the higher level is encoded by the discrete Legendre polynomial coefficients. The residual, the difference between the source pitch contour and the pitch contour decoded from the discrete Legendre polynomial coefficients, is then used for pitch modeling at the lower level. For prosody conversion, Gaussian mixture models (GMMs) are used for sentence- and prosodic word-level conversion. At subsyllable level, the pitch feature vectors are clustered via a proposed regression-based clustering method to generate the prosody conversion functions for selection. Linguistic and symbolic prosody features of the source speech are adopted to select the most suitable function using the classification and regression tree for prosody conversion. Three small-sized emotional parallel speech databases with happy, angry, and sad emotions, respectively, were designed and collected for training and evaluation. Objective and subjective evaluations were conducted and the comparison results to the GMM-based method for prosody conversion achieved an improved performance using the hierarchical prosodic structure and the proposed regression-based clustering method.
UR - http://www.scopus.com/inward/record.url?scp=77955722263&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77955722263&partnerID=8YFLogxK
U2 - 10.1109/TASL.2009.2034771
DO - 10.1109/TASL.2009.2034771
M3 - Article
AN - SCOPUS:77955722263
SN - 1558-7916
VL - 18
SP - 1394
EP - 1405
JO - IEEE Transactions on Audio, Speech and Language Processing
JF - IEEE Transactions on Audio, Speech and Language Processing
IS - 6
M1 - 5289985
ER -