TY - GEN
T1 - Speech emotion recognition using convolutional neural network with audio word-based embedding
AU - Huang, Kun Yi
AU - Wu, Chung Hsien
AU - Hong, Qian Bei
AU - Su, Ming Hsiang
AU - Zeng, Yuan Rong
PY - 2018/7/2
Y1 - 2018/7/2
N2 - A complete emotional expression typically contains a complex temporal course in a natural conversation. Related research on utterance-level, segment-level and multi-level processing lacks understanding of the underlying relation of emotional speech. In this work, a convolutional neural network (CNN) with audio word-based embedding is proposed for emotion modeling. In this study, vector quantization is first applied to convert the low level features of each speech frame into audio words using k-means algorithm. Word2vec is adopted to convert an input speech utterance into the corresponding audio word vector sequence. Finally, the audio word vector sequences of the training emotional speech data with emotion annotation are used to construct the CNN-based emotion model. The NCKU-ES database, containing seven emotion categories: happiness, boredom, anger, anxiety, sadness, surprise and disgust, was collected and five-fold cross validation was used to evaluate the performance of the proposed CNN-based method for speech emotion recognition. Experimental results show that the proposed method achieved an emotion recognition accuracy of 82.34%, improving by 8.7% compared to the Long Short Term Memory (LSTM)based method, which faced the challenging issue of long input sequence. Comparing with raw features, the audio word-based embedding achieved an improvement of 3.4% for speech emotion recognition.
AB - A complete emotional expression typically contains a complex temporal course in a natural conversation. Related research on utterance-level, segment-level and multi-level processing lacks understanding of the underlying relation of emotional speech. In this work, a convolutional neural network (CNN) with audio word-based embedding is proposed for emotion modeling. In this study, vector quantization is first applied to convert the low level features of each speech frame into audio words using k-means algorithm. Word2vec is adopted to convert an input speech utterance into the corresponding audio word vector sequence. Finally, the audio word vector sequences of the training emotional speech data with emotion annotation are used to construct the CNN-based emotion model. The NCKU-ES database, containing seven emotion categories: happiness, boredom, anger, anxiety, sadness, surprise and disgust, was collected and five-fold cross validation was used to evaluate the performance of the proposed CNN-based method for speech emotion recognition. Experimental results show that the proposed method achieved an emotion recognition accuracy of 82.34%, improving by 8.7% compared to the Long Short Term Memory (LSTM)based method, which faced the challenging issue of long input sequence. Comparing with raw features, the audio word-based embedding achieved an improvement of 3.4% for speech emotion recognition.
UR - http://www.scopus.com/inward/record.url?scp=85065872405&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85065872405&partnerID=8YFLogxK
U2 - 10.1109/ISCSLP.2018.8706610
DO - 10.1109/ISCSLP.2018.8706610
M3 - Conference contribution
AN - SCOPUS:85065872405
T3 - 2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings
SP - 265
EP - 269
BT - 2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018
Y2 - 26 November 2018 through 29 November 2018
ER -