TY - GEN
T1 - Enhancing Expressiveness of Synthesized Speech in Human-Robot Interaction
T2 - 10th International Conference on Control, Automation and Robotics, ICCAR 2024
AU - Kuo, Yao Chieh
AU - Tsai, Pei Hsuan
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - The utilization of speech synthesis systems, particularly text-to-speech (TTS), in human-robot interaction poses a persistent challenge. Users frequently perceive synthesized speech as lacking natural prosody variation, which can impede the effectiveness of such interactions. In response to this challenge, this paper delves into the realm of voice conversion-based methods, with the aim of imbuing synthesized speech with enhanced expressiveness by extracting prosody features from users. A comprehensive investigation is carried out, encompassing the implementation of various established and widely used techniques. To ensure the relevance and practicality of the study, content derived from hospital patient education materials is employed as the primary source material. The outcomes of the experiments reveal that existing voice conversion (VC) and emotional voice conversion (EVC) methods do not substantially enhance the naturalness of synthesized speech. Moreover, a discovery is made that evaluation using quantitative methods that are common in voice conversion individually does not necessarily reflect the subjective experience of synthesized speech. These findings provide a step forward in advancing the field of speech synthesis and human-robot interaction, offering valuable insights for future research and development in this area.
AB - The utilization of speech synthesis systems, particularly text-to-speech (TTS), in human-robot interaction poses a persistent challenge. Users frequently perceive synthesized speech as lacking natural prosody variation, which can impede the effectiveness of such interactions. In response to this challenge, this paper delves into the realm of voice conversion-based methods, with the aim of imbuing synthesized speech with enhanced expressiveness by extracting prosody features from users. A comprehensive investigation is carried out, encompassing the implementation of various established and widely used techniques. To ensure the relevance and practicality of the study, content derived from hospital patient education materials is employed as the primary source material. The outcomes of the experiments reveal that existing voice conversion (VC) and emotional voice conversion (EVC) methods do not substantially enhance the naturalness of synthesized speech. Moreover, a discovery is made that evaluation using quantitative methods that are common in voice conversion individually does not necessarily reflect the subjective experience of synthesized speech. These findings provide a step forward in advancing the field of speech synthesis and human-robot interaction, offering valuable insights for future research and development in this area.
UR - http://www.scopus.com/inward/record.url?scp=85198222406&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85198222406&partnerID=8YFLogxK
U2 - 10.1109/ICCAR61844.2024.10569697
DO - 10.1109/ICCAR61844.2024.10569697
M3 - Conference contribution
AN - SCOPUS:85198222406
T3 - 2024 10th International Conference on Control, Automation and Robotics, ICCAR 2024
SP - 1
EP - 4
BT - 2024 10th International Conference on Control, Automation and Robotics, ICCAR 2024
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 27 April 2024 through 29 April 2024
ER -