TY - GEN
T1 - Natural speech synthesis based on hybrid approach with candidate expansion and verification
AU - Wu, Chung-Hsien
AU - Huang, Yi Chin
AU - Lin, Shih Lun
AU - Chen, Chia Ping
PY - 2014/1/1
Y1 - 2014/1/1
N2 - A hybrid Mandarin speech synthesis system combining concatenation-based and model-based methodology is investigated in this research. To effectively exploit a small-size corpus, the candidate sets for unit selection are expanded via clusters based on articulatory features (AF), which are estimated as the outputs of an artificial neural network. This is followed by a filtering operation incorporating residual compensation, to remove unsuitable units. Given an input text, an optimal unit sequence is decided by the minimization of a total cost, which depends on the spectral features, contextual articulatory features, formants, and pitch values. Furthermore, prosodic word verification is integrated to check the smoothness of the output speech. The units failing to pass the prosodic word verification are replaced by model-based synthesized units for better speech quality. Objective and subjective evaluations have been conducted. Comparisons among the proposed method, the HMM-based method, and the conventional hybrid method clearly show that candidate set expansion based on articulatory features lead to more units suitable for selection, and the verification process is effective in improving the naturalness of the output speech.
AB - A hybrid Mandarin speech synthesis system combining concatenation-based and model-based methodology is investigated in this research. To effectively exploit a small-size corpus, the candidate sets for unit selection are expanded via clusters based on articulatory features (AF), which are estimated as the outputs of an artificial neural network. This is followed by a filtering operation incorporating residual compensation, to remove unsuitable units. Given an input text, an optimal unit sequence is decided by the minimization of a total cost, which depends on the spectral features, contextual articulatory features, formants, and pitch values. Furthermore, prosodic word verification is integrated to check the smoothness of the output speech. The units failing to pass the prosodic word verification are replaced by model-based synthesized units for better speech quality. Objective and subjective evaluations have been conducted. Comparisons among the proposed method, the HMM-based method, and the conventional hybrid method clearly show that candidate set expansion based on articulatory features lead to more units suitable for selection, and the verification process is effective in improving the naturalness of the output speech.
UR - http://www.scopus.com/inward/record.url?scp=84905276701&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84905276701&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2014.6853596
DO - 10.1109/ICASSP.2014.6853596
M3 - Conference contribution
AN - SCOPUS:84905276701
SN - 9781479928927
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 250
EP - 254
BT - 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2014
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2014
Y2 - 4 May 2014 through 9 May 2014
ER -