Speech emotion recognition using convolutional neural network with audio word-based embedding

Kun Yi Huang, Chung-Hsien Wu, Qian Bei Hong, Ming Hsiang Su, Yuan Rong Zeng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

A complete emotional expression typically contains a complex temporal course in a natural conversation. Related research on utterance-level, segment-level and multi-level processing lacks understanding of the underlying relation of emotional speech. In this work, a convolutional neural network (CNN) with audio word-based embedding is proposed for emotion modeling. In this study, vector quantization is first applied to convert the low level features of each speech frame into audio words using k-means algorithm. Word2vec is adopted to convert an input speech utterance into the corresponding audio word vector sequence. Finally, the audio word vector sequences of the training emotional speech data with emotion annotation are used to construct the CNN-based emotion model. The NCKU-ES database, containing seven emotion categories: happiness, boredom, anger, anxiety, sadness, surprise and disgust, was collected and five-fold cross validation was used to evaluate the performance of the proposed CNN-based method for speech emotion recognition. Experimental results show that the proposed method achieved an emotion recognition accuracy of 82.34%, improving by 8.7% compared to the Long Short Term Memory (LSTM)based method, which faced the challenging issue of long input sequence. Comparing with raw features, the audio word-based embedding achieved an improvement of 3.4% for speech emotion recognition.

Original languageEnglish
Title of host publication2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages265-269
Number of pages5
ISBN (Electronic)9781538656273
DOIs
Publication statusPublished - 2018 Jul 2
Event11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Taipei, Taiwan
Duration: 2018 Nov 262018 Nov 29

Publication series

Name2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings

Conference

Conference11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018
CountryTaiwan
CityTaipei
Period18-11-2618-11-29

Fingerprint

neural network
emotion
boredom
Emotion Recognition
Neural Networks
Emotion
happiness
anger
conversation
anxiety
lack
performance
Convert
Utterance

All Science Journal Classification (ASJC) codes

  • Linguistics and Language
  • Language and Linguistics

Cite this

Huang, K. Y., Wu, C-H., Hong, Q. B., Su, M. H., & Zeng, Y. R. (2018). Speech emotion recognition using convolutional neural network with audio word-based embedding. In 2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings (pp. 265-269). [8706610] (2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ISCSLP.2018.8706610
Huang, Kun Yi ; Wu, Chung-Hsien ; Hong, Qian Bei ; Su, Ming Hsiang ; Zeng, Yuan Rong. / Speech emotion recognition using convolutional neural network with audio word-based embedding. 2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 265-269 (2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings).
@inproceedings{11584c97074c4f8bbf2c6f9731d869c2,
title = "Speech emotion recognition using convolutional neural network with audio word-based embedding",
abstract = "A complete emotional expression typically contains a complex temporal course in a natural conversation. Related research on utterance-level, segment-level and multi-level processing lacks understanding of the underlying relation of emotional speech. In this work, a convolutional neural network (CNN) with audio word-based embedding is proposed for emotion modeling. In this study, vector quantization is first applied to convert the low level features of each speech frame into audio words using k-means algorithm. Word2vec is adopted to convert an input speech utterance into the corresponding audio word vector sequence. Finally, the audio word vector sequences of the training emotional speech data with emotion annotation are used to construct the CNN-based emotion model. The NCKU-ES database, containing seven emotion categories: happiness, boredom, anger, anxiety, sadness, surprise and disgust, was collected and five-fold cross validation was used to evaluate the performance of the proposed CNN-based method for speech emotion recognition. Experimental results show that the proposed method achieved an emotion recognition accuracy of 82.34{\%}, improving by 8.7{\%} compared to the Long Short Term Memory (LSTM)based method, which faced the challenging issue of long input sequence. Comparing with raw features, the audio word-based embedding achieved an improvement of 3.4{\%} for speech emotion recognition.",
author = "Huang, {Kun Yi} and Chung-Hsien Wu and Hong, {Qian Bei} and Su, {Ming Hsiang} and Zeng, {Yuan Rong}",
year = "2018",
month = "7",
day = "2",
doi = "10.1109/ISCSLP.2018.8706610",
language = "English",
series = "2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "265--269",
booktitle = "2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings",
address = "United States",

}

Huang, KY, Wu, C-H, Hong, QB, Su, MH & Zeng, YR 2018, Speech emotion recognition using convolutional neural network with audio word-based embedding. in 2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings., 8706610, 2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings, Institute of Electrical and Electronics Engineers Inc., pp. 265-269, 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018, Taipei, Taiwan, 18-11-26. https://doi.org/10.1109/ISCSLP.2018.8706610

Speech emotion recognition using convolutional neural network with audio word-based embedding. / Huang, Kun Yi; Wu, Chung-Hsien; Hong, Qian Bei; Su, Ming Hsiang; Zeng, Yuan Rong.

2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2018. p. 265-269 8706610 (2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Speech emotion recognition using convolutional neural network with audio word-based embedding

AU - Huang, Kun Yi

AU - Wu, Chung-Hsien

AU - Hong, Qian Bei

AU - Su, Ming Hsiang

AU - Zeng, Yuan Rong

PY - 2018/7/2

Y1 - 2018/7/2

N2 - A complete emotional expression typically contains a complex temporal course in a natural conversation. Related research on utterance-level, segment-level and multi-level processing lacks understanding of the underlying relation of emotional speech. In this work, a convolutional neural network (CNN) with audio word-based embedding is proposed for emotion modeling. In this study, vector quantization is first applied to convert the low level features of each speech frame into audio words using k-means algorithm. Word2vec is adopted to convert an input speech utterance into the corresponding audio word vector sequence. Finally, the audio word vector sequences of the training emotional speech data with emotion annotation are used to construct the CNN-based emotion model. The NCKU-ES database, containing seven emotion categories: happiness, boredom, anger, anxiety, sadness, surprise and disgust, was collected and five-fold cross validation was used to evaluate the performance of the proposed CNN-based method for speech emotion recognition. Experimental results show that the proposed method achieved an emotion recognition accuracy of 82.34%, improving by 8.7% compared to the Long Short Term Memory (LSTM)based method, which faced the challenging issue of long input sequence. Comparing with raw features, the audio word-based embedding achieved an improvement of 3.4% for speech emotion recognition.

AB - A complete emotional expression typically contains a complex temporal course in a natural conversation. Related research on utterance-level, segment-level and multi-level processing lacks understanding of the underlying relation of emotional speech. In this work, a convolutional neural network (CNN) with audio word-based embedding is proposed for emotion modeling. In this study, vector quantization is first applied to convert the low level features of each speech frame into audio words using k-means algorithm. Word2vec is adopted to convert an input speech utterance into the corresponding audio word vector sequence. Finally, the audio word vector sequences of the training emotional speech data with emotion annotation are used to construct the CNN-based emotion model. The NCKU-ES database, containing seven emotion categories: happiness, boredom, anger, anxiety, sadness, surprise and disgust, was collected and five-fold cross validation was used to evaluate the performance of the proposed CNN-based method for speech emotion recognition. Experimental results show that the proposed method achieved an emotion recognition accuracy of 82.34%, improving by 8.7% compared to the Long Short Term Memory (LSTM)based method, which faced the challenging issue of long input sequence. Comparing with raw features, the audio word-based embedding achieved an improvement of 3.4% for speech emotion recognition.

UR - http://www.scopus.com/inward/record.url?scp=85065872405&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85065872405&partnerID=8YFLogxK

U2 - 10.1109/ISCSLP.2018.8706610

DO - 10.1109/ISCSLP.2018.8706610

M3 - Conference contribution

T3 - 2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings

SP - 265

EP - 269

BT - 2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Huang KY, Wu C-H, Hong QB, Su MH, Zeng YR. Speech emotion recognition using convolutional neural network with audio word-based embedding. In 2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2018. p. 265-269. 8706610. (2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP 2018 - Proceedings). https://doi.org/10.1109/ISCSLP.2018.8706610