While voice conversion methods have been popularly applied to convert the speech signals uttered by a source speaker to a target speaker, frame-based voice conversion generally suffers from incorrect alignment using only spectral distance and therefore generate improper conversion results. In a parallel phone sequence, the alignment using minimum spectral distance between frame-based feature vectors of the source and target phone sequences is theoretical impractical, since the spectral properties of the source and target phones are inherently different. Nevertheless, if the feature vectors of the phone sequence are transformed into codewords in an eigen space, the eigen-codeword occurrence distribution curves of the source and target phone sequences are likely to be similar. By integrating the codeword occurrence distribution into distance estimation, a more precise frame alignment based on dynamic time warping can be obtained. With the precise alignment, voice conversion functions can be properly constructed. Objective and subjective evaluations were conducted and the comparison results to spectral distance-based alignment confirm the improved performance of the proposed method.
|出版狀態||Published - 2010|
|事件||7th ISCA Tutorial and Research Workshop on Speech Synthesis, SSW 2010 - Kyoto, Japan|
持續時間: 2010 9月 22 → 2010 9月 24
|Conference||7th ISCA Tutorial and Research Workshop on Speech Synthesis, SSW 2010|
|期間||10-09-22 → 10-09-24|
All Science Journal Classification (ASJC) codes