TY - JOUR
T1 - Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features
AU - Chen, Chia Ping
AU - Huang, Yi Chin
AU - Wu, Chung Hsien
AU - Lee, Kuan De
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/10/1
Y1 - 2014/10/1
N2 - In this paper, an approach for polyglot speech synthesis based on cross-lingual frame selection is proposed. This method requires only mono-lingual speech data of different speakers in different languages for building a polyglot synthesis system, thus reducing the burden of data collection. Essentially, a set of artificial utterances in the second language for a target speaker is constructed based on the proposed cross-lingual frame-selection process, and this data set is used to adapt a synthesis model in the second language to the speaker. In the cross-lingual frame-selection process, we propose to use auditory and articulatory features to improve the quality of the synthesized polyglot speech. For evaluation, a Mandarin-English polyglot system is implemented where the target speaker only speaks Mandarin. The results show that decent performance regarding voice identity and speech quality can be achieved with the proposed method.
AB - In this paper, an approach for polyglot speech synthesis based on cross-lingual frame selection is proposed. This method requires only mono-lingual speech data of different speakers in different languages for building a polyglot synthesis system, thus reducing the burden of data collection. Essentially, a set of artificial utterances in the second language for a target speaker is constructed based on the proposed cross-lingual frame-selection process, and this data set is used to adapt a synthesis model in the second language to the speaker. In the cross-lingual frame-selection process, we propose to use auditory and articulatory features to improve the quality of the synthesized polyglot speech. For evaluation, a Mandarin-English polyglot system is implemented where the target speaker only speaks Mandarin. The results show that decent performance regarding voice identity and speech quality can be achieved with the proposed method.
UR - http://www.scopus.com/inward/record.url?scp=84911408131&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84911408131&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2014.2339738
DO - 10.1109/TASLP.2014.2339738
M3 - Article
AN - SCOPUS:84911408131
SN - 1558-7916
VL - 22
SP - 1558
EP - 1570
JO - IEEE Transactions on Audio, Speech and Language Processing
JF - IEEE Transactions on Audio, Speech and Language Processing
IS - 10
M1 - 2339738
ER -