TY - JOUR
T1 - Personalized spectral and prosody conversion using frame-based codeword distribution and adaptive CRF
AU - Huang, Yi Chin
AU - Wu, Chung Hsien
AU - Chao, Yu Ting
N1 - Funding Information:
The authors would like to acknowledge financial assistance from the Islamic Development Bank (fellowship) and Universiti Sains Malaysia: Postgraduate Research Grant Scheme (USM-RU-PRGS) Account No. 1001/ PTEKIND/842023 ; and the research facilities made available by the Dean of the School of Industrial Technology, Penang, in carrying out this research.
PY - 2013
Y1 - 2013
N2 - This study proposes a voice conversion-based approach to personalized text-to-speech (TTS) synthesis. The conversion functions, trained using a small parallel corpus with source and target speech data, can impose the voice characteristics of a target speaker on an existing synthesizer. Frame alignment between a pair of sentences in the parallel corpus is generally used for training voice conversion functions. However, with incorrect alignment, the resultant conversion functions may generate unacceptable conversion results. Traditional frame alignment using minimal spectral distance between the frame-based feature vectors of the source and the target phone sequences can be imprecise because the voice properties of the source and target phones inherently differ. In the proposed method, feature vectors of the parallel corpus are transformed into codewords in an eigenspace. A more precise frame alignment can be obtained by integrating the codeword occurrence distributions into distance estimation. In addition to the spectral property, a prosodic word/phrase boundary prediction model was constructed using an adaptive conditional random field (CRF) to generate personalized prosodic information. Objective and subjective tests were conducted to evaluate the performance of the proposed approach. The experimental results showed that the proposed voice conversion method, based on distribution-based alignment and prosodic word boundary detection, can improve the speech quality and speaker similarity of the converted speech. Compared to other methods, the evaluation results verified the improved performance of the proposed method.
AB - This study proposes a voice conversion-based approach to personalized text-to-speech (TTS) synthesis. The conversion functions, trained using a small parallel corpus with source and target speech data, can impose the voice characteristics of a target speaker on an existing synthesizer. Frame alignment between a pair of sentences in the parallel corpus is generally used for training voice conversion functions. However, with incorrect alignment, the resultant conversion functions may generate unacceptable conversion results. Traditional frame alignment using minimal spectral distance between the frame-based feature vectors of the source and the target phone sequences can be imprecise because the voice properties of the source and target phones inherently differ. In the proposed method, feature vectors of the parallel corpus are transformed into codewords in an eigenspace. A more precise frame alignment can be obtained by integrating the codeword occurrence distributions into distance estimation. In addition to the spectral property, a prosodic word/phrase boundary prediction model was constructed using an adaptive conditional random field (CRF) to generate personalized prosodic information. Objective and subjective tests were conducted to evaluate the performance of the proposed approach. The experimental results showed that the proposed voice conversion method, based on distribution-based alignment and prosodic word boundary detection, can improve the speech quality and speaker similarity of the converted speech. Compared to other methods, the evaluation results verified the improved performance of the proposed method.
UR - http://www.scopus.com/inward/record.url?scp=84867950508&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84867950508&partnerID=8YFLogxK
U2 - 10.1109/TASL.2012.2213247
DO - 10.1109/TASL.2012.2213247
M3 - Article
AN - SCOPUS:84867950508
VL - 21
SP - 51
EP - 62
JO - IEEE Transactions on Speech and Audio Processing
JF - IEEE Transactions on Speech and Audio Processing
SN - 1558-7916
IS - 1
M1 - 6269060
ER -