TY - JOUR
T1 - Subsyllable-based discriminative segmental Bayesian network for Mandarin speech keyword spotting
AU - Wu, C. H.
PY - 1997
Y1 - 1997
N2 - A continuous Mandarin speech keyword spotting system based on context-dependent subsyllables is presented. In this vocabulary-independent system, users can define their own keywords and most frequently occurring non-keywords without retraining the system. A set of 176 monosyllables and 483 balanced words or sentences are used to extract the context-dependent subsyllables (i.e. initials or finals in Mandarin speech), for training. Each subsyllable is represented by a proposed discriminative segmental Bayesian network (DSBN). In the training process, the generalised probabilistic descent (GPD) algorithm is used for discriminative training. The most frequently ^occurring non-keywords are divided into keyword predecessors and successors. Non-keyword garbage models for keyword predecessors, keyword successors and extraneous speech are separately constructed. In the recognition process, a final part preprocessor is used to screen out unreasonable hypotheses in order to reduce the recognition time. Using a test set of 750 - conversational speech utterances from 20 speakers (ten males and ten females), word spotting rates of 92.0% when the vocabulary word was embedded in unconstrained extraneous speech, were obtained for a user-defined 20 keyword vocabulary.
AB - A continuous Mandarin speech keyword spotting system based on context-dependent subsyllables is presented. In this vocabulary-independent system, users can define their own keywords and most frequently occurring non-keywords without retraining the system. A set of 176 monosyllables and 483 balanced words or sentences are used to extract the context-dependent subsyllables (i.e. initials or finals in Mandarin speech), for training. Each subsyllable is represented by a proposed discriminative segmental Bayesian network (DSBN). In the training process, the generalised probabilistic descent (GPD) algorithm is used for discriminative training. The most frequently ^occurring non-keywords are divided into keyword predecessors and successors. Non-keyword garbage models for keyword predecessors, keyword successors and extraneous speech are separately constructed. In the recognition process, a final part preprocessor is used to screen out unreasonable hypotheses in order to reduce the recognition time. Using a test set of 750 - conversational speech utterances from 20 speakers (ten males and ten females), word spotting rates of 92.0% when the vocabulary word was embedded in unconstrained extraneous speech, were obtained for a user-defined 20 keyword vocabulary.
UR - http://www.scopus.com/inward/record.url?scp=0031124566&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0031124566&partnerID=8YFLogxK
U2 - 10.1049/ip-vis:19971095
DO - 10.1049/ip-vis:19971095
M3 - Article
AN - SCOPUS:0031124566
SN - 1350-245X
VL - 144
SP - 65
EP - 71
JO - IEE Proceedings: Vision, Image and Signal Processing
JF - IEE Proceedings: Vision, Image and Signal Processing
IS - 2
ER -