TY - GEN
T1 - Time Alignment using Lip Images for Frame-based Electrolaryngeal Voice Conversion
AU - Liou, Yi Syuan
AU - Huang, Wen Chin
AU - Yen, Ming Chi
AU - Tsai, Shu Wei
AU - Peng, Yu Huai
AU - Toda, Tomoki
AU - Tsao, Yu
AU - Wang, Hsin Min
N1 - Funding Information:
This work was partly supported by JSPS KAKENHI Grant Number 21J20920 and JST CREST Grant Number JPMJCR19A3, Japan. This work was also partly supported by MOST-Taiwan Grant 107-2221-E-001-008-MY3. In addition, this study was approved by a local Institutional Review Board (TMU-JIRB 202005100). Informed consent was obtained from all participants prior to the experiment.
Publisher Copyright:
© 2021 APSIPA.
PY - 2021
Y1 - 2021
N2 - Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device. In frame-based VC methods, time alignment needs to be performed prior to model training, and the dynamic time warping (DTW) algorithm is widely adopted to compute the best time alignment between each utterance pair. The validity is based on the assumption that the same phonemes of the speakers have similar features and can be mapped by measuring a pre-defined distance between speech frames of the source and the target. However, the special characteristics of the EL speech can break the assumption, resulting in a sub-optimal DTW alignment. In this work, we propose to use lip images for time alignment, as we assume that the lip movements of laryngectomee remain normal compared to healthy people. We investigate two naive lip representations and distance metrics, and experimental results demonstrate that the proposed method can significantly outperform the audio-only alignment in terms of objective and subjective evaluations.
AB - Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device. In frame-based VC methods, time alignment needs to be performed prior to model training, and the dynamic time warping (DTW) algorithm is widely adopted to compute the best time alignment between each utterance pair. The validity is based on the assumption that the same phonemes of the speakers have similar features and can be mapped by measuring a pre-defined distance between speech frames of the source and the target. However, the special characteristics of the EL speech can break the assumption, resulting in a sub-optimal DTW alignment. In this work, we propose to use lip images for time alignment, as we assume that the lip movements of laryngectomee remain normal compared to healthy people. We investigate two naive lip representations and distance metrics, and experimental results demonstrate that the proposed method can significantly outperform the audio-only alignment in terms of objective and subjective evaluations.
UR - http://www.scopus.com/inward/record.url?scp=85126685148&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85126685148&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85126685148
T3 - 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Proceedings
SP - 1234
EP - 1238
BT - 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021
Y2 - 14 December 2021 through 17 December 2021
ER -