Chinese-english phone set construction for code-switching ASR using acoustic and DNN-extracted articulatory features

Chung Hsien Wu, Han Ping Shen, Yan Ting Yang

研究成果: Article

6 引文 (Scopus)

摘要

This study proposes a data-driven approach to phone set construction for code-switching automatic speech recognition (ASR). Acoustic and context-dependent cross-lingual articulatory features (AFs) are incorporated into the estimation of the distance between triphone units for constructing a Chinese-English phone set. The acoustic features of each triphone in the training corpus are extracted for constructing an acoustic triphone HMM. Furthermore, the articulatory features of the "last/first" state of the corresponding preceding/succeeding triphone in the training corpus are used to construct an AF-based GMM. The AFs, extracted using a deep neural network (DNN), are used for code-switching articulation modeling to alleviate the data sparseness problem due to the diverse context-dependent phone combinations in intra-sentential code-switching. The triphones are then clustered to obtain a Chinese-English phone set based on the acoustic HMMs and the AF-based GMMs using a hierarchical triphone clustering algorithm. Experimental results on code-switching ASR show that the proposed method for phone set construction outperformed other traditional methods.

原文English
頁(從 - 到)858-862
頁數5
期刊IEEE Transactions on Audio, Speech and Language Processing
22
發行號4
DOIs
出版狀態Published - 2014 四月 1

指紋

speech recognition
Speech recognition
Acoustics
acoustics
education
Clustering algorithms
Deep neural networks

All Science Journal Classification (ASJC) codes

  • Acoustics and Ultrasonics
  • Electrical and Electronic Engineering

引用此文

@article{9a7cbd7844e042d1ae03312d4cb1637c,
title = "Chinese-english phone set construction for code-switching ASR using acoustic and DNN-extracted articulatory features",
abstract = "This study proposes a data-driven approach to phone set construction for code-switching automatic speech recognition (ASR). Acoustic and context-dependent cross-lingual articulatory features (AFs) are incorporated into the estimation of the distance between triphone units for constructing a Chinese-English phone set. The acoustic features of each triphone in the training corpus are extracted for constructing an acoustic triphone HMM. Furthermore, the articulatory features of the {"}last/first{"} state of the corresponding preceding/succeeding triphone in the training corpus are used to construct an AF-based GMM. The AFs, extracted using a deep neural network (DNN), are used for code-switching articulation modeling to alleviate the data sparseness problem due to the diverse context-dependent phone combinations in intra-sentential code-switching. The triphones are then clustered to obtain a Chinese-English phone set based on the acoustic HMMs and the AF-based GMMs using a hierarchical triphone clustering algorithm. Experimental results on code-switching ASR show that the proposed method for phone set construction outperformed other traditional methods.",
author = "Wu, {Chung Hsien} and Shen, {Han Ping} and Yang, {Yan Ting}",
year = "2014",
month = "4",
day = "1",
doi = "10.1109/TASLP.2014.2310353",
language = "English",
volume = "22",
pages = "858--862",
journal = "IEEE Transactions on Speech and Audio Processing",
issn = "1558-7916",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "4",

}

TY - JOUR

T1 - Chinese-english phone set construction for code-switching ASR using acoustic and DNN-extracted articulatory features

AU - Wu, Chung Hsien

AU - Shen, Han Ping

AU - Yang, Yan Ting

PY - 2014/4/1

Y1 - 2014/4/1

N2 - This study proposes a data-driven approach to phone set construction for code-switching automatic speech recognition (ASR). Acoustic and context-dependent cross-lingual articulatory features (AFs) are incorporated into the estimation of the distance between triphone units for constructing a Chinese-English phone set. The acoustic features of each triphone in the training corpus are extracted for constructing an acoustic triphone HMM. Furthermore, the articulatory features of the "last/first" state of the corresponding preceding/succeeding triphone in the training corpus are used to construct an AF-based GMM. The AFs, extracted using a deep neural network (DNN), are used for code-switching articulation modeling to alleviate the data sparseness problem due to the diverse context-dependent phone combinations in intra-sentential code-switching. The triphones are then clustered to obtain a Chinese-English phone set based on the acoustic HMMs and the AF-based GMMs using a hierarchical triphone clustering algorithm. Experimental results on code-switching ASR show that the proposed method for phone set construction outperformed other traditional methods.

AB - This study proposes a data-driven approach to phone set construction for code-switching automatic speech recognition (ASR). Acoustic and context-dependent cross-lingual articulatory features (AFs) are incorporated into the estimation of the distance between triphone units for constructing a Chinese-English phone set. The acoustic features of each triphone in the training corpus are extracted for constructing an acoustic triphone HMM. Furthermore, the articulatory features of the "last/first" state of the corresponding preceding/succeeding triphone in the training corpus are used to construct an AF-based GMM. The AFs, extracted using a deep neural network (DNN), are used for code-switching articulation modeling to alleviate the data sparseness problem due to the diverse context-dependent phone combinations in intra-sentential code-switching. The triphones are then clustered to obtain a Chinese-English phone set based on the acoustic HMMs and the AF-based GMMs using a hierarchical triphone clustering algorithm. Experimental results on code-switching ASR show that the proposed method for phone set construction outperformed other traditional methods.

UR - http://www.scopus.com/inward/record.url?scp=84898062710&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84898062710&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2014.2310353

DO - 10.1109/TASLP.2014.2310353

M3 - Article

AN - SCOPUS:84898062710

VL - 22

SP - 858

EP - 862

JO - IEEE Transactions on Speech and Audio Processing

JF - IEEE Transactions on Speech and Audio Processing

SN - 1558-7916

IS - 4

ER -