Time Alignment using Lip Images for Frame-based Electrolaryngeal Voice Conversion

Yi Syuan Liou, Wen Chin Huang, Ming Chi Yen, Shu Wei Tsai, Yu Huai Peng, Tomoki Toda, Yu Tsao, Hsin Min Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device. In frame-based VC methods, time alignment needs to be performed prior to model training, and the dynamic time warping (DTW) algorithm is widely adopted to compute the best time alignment between each utterance pair. The validity is based on the assumption that the same phonemes of the speakers have similar features and can be mapped by measuring a pre-defined distance between speech frames of the source and the target. However, the special characteristics of the EL speech can break the assumption, resulting in a sub-optimal DTW alignment. In this work, we propose to use lip images for time alignment, as we assume that the lip movements of laryngectomee remain normal compared to healthy people. We investigate two naive lip representations and distance metrics, and experimental results demonstrate that the proposed method can significantly outperform the audio-only alignment in terms of objective and subjective evaluations.

Original languageEnglish
Title of host publication2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1234-1238
Number of pages5
ISBN (Electronic)9789881476890
Publication statusPublished - 2021
Event2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Tokyo, Japan
Duration: 2021 Dec 142021 Dec 17

Publication series

Name2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Proceedings

Conference

Conference2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021
Country/TerritoryJapan
CityTokyo
Period21-12-1421-12-17

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Computer Vision and Pattern Recognition
  • Signal Processing
  • Instrumentation

Fingerprint

Dive into the research topics of 'Time Alignment using Lip Images for Frame-based Electrolaryngeal Voice Conversion'. Together they form a unique fingerprint.

Cite this