TY - JOUR
T1 - Exploiting prosody hierarchy and dynamic features for pitch modeling and generation in HMM-based speech synthesis
AU - Hsia, Chi Chun
AU - Wu, Chung Hsien
AU - Wu, Jung Yun
N1 - Funding Information:
Manuscript received November 19, 2008; revised October 15, 2009. Date of publication April 05, 2010; date of current version September 01, 2010. This work was supported by the National Science Council of the Republic of China, Taiwan, under Contract NSC96-2221-E-006-155-MY3. The associate editor co-ordinating the review of this manuscript and approving it for publication was Prof. Gaël Richard.
PY - 2010
Y1 - 2010
N2 - This paper proposes a method for modeling and generating pitch in hidden Markov model (HMM)-based Mandarin speech synthesis by exploiting prosody hierarchy and dynamic pitch features. The prosodic structure of a sentence is represented by a prosody hierarchy, which is constructed from the predicted prosodic breaks using a supervised classification and regression tree (S-CART). The S-CART is trained by maximizing the proportional reduction of entropy to minimize the errors in the prediction of the prosodic breaks. The pitch contour of a speech sentence is estimated using the STRAIGHT algorithm and decomposed into the prosodic features (static features) at prosodic word, syllable, and frame layers, based on the predicted prosodic structure. Dynamic features at each layer are estimated to preserve the temporal correlation between adjacent units. A hierarchical prosody model is constructed using an unsupervised CART (U-CART) for generating pitch contour. Minimum description length (MDL) is adopted in U-CART training. Objective and subjective evaluations with statistical hypothesis testing were conducted, and the results compared to corresponding results for HMM-based pitch modeling. The comparison confirms the improved performance of the proposed method.
AB - This paper proposes a method for modeling and generating pitch in hidden Markov model (HMM)-based Mandarin speech synthesis by exploiting prosody hierarchy and dynamic pitch features. The prosodic structure of a sentence is represented by a prosody hierarchy, which is constructed from the predicted prosodic breaks using a supervised classification and regression tree (S-CART). The S-CART is trained by maximizing the proportional reduction of entropy to minimize the errors in the prediction of the prosodic breaks. The pitch contour of a speech sentence is estimated using the STRAIGHT algorithm and decomposed into the prosodic features (static features) at prosodic word, syllable, and frame layers, based on the predicted prosodic structure. Dynamic features at each layer are estimated to preserve the temporal correlation between adjacent units. A hierarchical prosody model is constructed using an unsupervised CART (U-CART) for generating pitch contour. Minimum description length (MDL) is adopted in U-CART training. Objective and subjective evaluations with statistical hypothesis testing were conducted, and the results compared to corresponding results for HMM-based pitch modeling. The comparison confirms the improved performance of the proposed method.
UR - http://www.scopus.com/inward/record.url?scp=77956285048&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77956285048&partnerID=8YFLogxK
U2 - 10.1109/TASL.2010.2040791
DO - 10.1109/TASL.2010.2040791
M3 - Article
AN - SCOPUS:77956285048
VL - 18
SP - 1994
EP - 2003
JO - IEEE Transactions on Speech and Audio Processing
JF - IEEE Transactions on Speech and Audio Processing
SN - 1558-7916
IS - 8
M1 - 5443736
ER -