Speech emotion recognition using autoencoder bottleneck features and LSTM

Kun Yi Huang, Chung-Hsien Wu, Tsung Hsien Yang, Ming Hsiang Su, Jia Hui Chou

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

A complete emotional expression contains a complex temporal course in a conversation. Related research on utterance and segment-level processing lacks considering subtle differences in characteristics and historical information. In this work, as Deep Scattering Spectrum (DSS) can obtain more detailed energy distributions in frequency domain than the Low Level Descriptors (LLDs), this work combines LLDs and DSS as the speech features. Autoencoder neural network is then applied to extract the bottleneck features for dimensionality reduction. Finally, the long-short term memory (LSTM) is employed to characterize temporal variation of speech emotion for emotion recognition. For evaluation, the MHMC emotion database was collected and used for performance evaluation. Experimental results show that the proposed method using the bottleneck features from the combination of the LLDs and DSS achieved an emotion recognition accuracy of 98.1%, outperforming the systems using LLDs or DSS individually.

Original languageEnglish
Title of host publication2016 International Conference on Orange Technologies, ICOT 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1-4
Number of pages4
ISBN (Electronic)9781538648315
DOIs
Publication statusPublished - 2018 Feb 1
Event2016 International Conference on Orange Technologies, ICOT 2016 - Melbourne, Australia
Duration: 2016 Dec 182016 Dec 20

Publication series

Name2016 International Conference on Orange Technologies, ICOT 2016
Volume2018-January

Other

Other2016 International Conference on Orange Technologies, ICOT 2016
CountryAustralia
CityMelbourne
Period16-12-1816-12-20

Fingerprint

Long-Term Memory
Speech recognition
Short-Term Memory
Emotions
Scattering
Databases
Neural networks
Recognition (Psychology)
Long short-term memory
Processing
Research

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Computer Vision and Pattern Recognition
  • Behavioral Neuroscience
  • Cognitive Neuroscience

Cite this

Huang, K. Y., Wu, C-H., Yang, T. H., Su, M. H., & Chou, J. H. (2018). Speech emotion recognition using autoencoder bottleneck features and LSTM. In 2016 International Conference on Orange Technologies, ICOT 2016 (pp. 1-4). [8278965] (2016 International Conference on Orange Technologies, ICOT 2016; Vol. 2018-January). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICOT.2016.8278965
Huang, Kun Yi ; Wu, Chung-Hsien ; Yang, Tsung Hsien ; Su, Ming Hsiang ; Chou, Jia Hui. / Speech emotion recognition using autoencoder bottleneck features and LSTM. 2016 International Conference on Orange Technologies, ICOT 2016. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 1-4 (2016 International Conference on Orange Technologies, ICOT 2016).
@inproceedings{9d8c59d5fabc4580a9afb98cf07eee7c,
title = "Speech emotion recognition using autoencoder bottleneck features and LSTM",
abstract = "A complete emotional expression contains a complex temporal course in a conversation. Related research on utterance and segment-level processing lacks considering subtle differences in characteristics and historical information. In this work, as Deep Scattering Spectrum (DSS) can obtain more detailed energy distributions in frequency domain than the Low Level Descriptors (LLDs), this work combines LLDs and DSS as the speech features. Autoencoder neural network is then applied to extract the bottleneck features for dimensionality reduction. Finally, the long-short term memory (LSTM) is employed to characterize temporal variation of speech emotion for emotion recognition. For evaluation, the MHMC emotion database was collected and used for performance evaluation. Experimental results show that the proposed method using the bottleneck features from the combination of the LLDs and DSS achieved an emotion recognition accuracy of 98.1{\%}, outperforming the systems using LLDs or DSS individually.",
author = "Huang, {Kun Yi} and Chung-Hsien Wu and Yang, {Tsung Hsien} and Su, {Ming Hsiang} and Chou, {Jia Hui}",
year = "2018",
month = "2",
day = "1",
doi = "10.1109/ICOT.2016.8278965",
language = "English",
series = "2016 International Conference on Orange Technologies, ICOT 2016",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "1--4",
booktitle = "2016 International Conference on Orange Technologies, ICOT 2016",
address = "United States",

}

Huang, KY, Wu, C-H, Yang, TH, Su, MH & Chou, JH 2018, Speech emotion recognition using autoencoder bottleneck features and LSTM. in 2016 International Conference on Orange Technologies, ICOT 2016., 8278965, 2016 International Conference on Orange Technologies, ICOT 2016, vol. 2018-January, Institute of Electrical and Electronics Engineers Inc., pp. 1-4, 2016 International Conference on Orange Technologies, ICOT 2016, Melbourne, Australia, 16-12-18. https://doi.org/10.1109/ICOT.2016.8278965

Speech emotion recognition using autoencoder bottleneck features and LSTM. / Huang, Kun Yi; Wu, Chung-Hsien; Yang, Tsung Hsien; Su, Ming Hsiang; Chou, Jia Hui.

2016 International Conference on Orange Technologies, ICOT 2016. Institute of Electrical and Electronics Engineers Inc., 2018. p. 1-4 8278965 (2016 International Conference on Orange Technologies, ICOT 2016; Vol. 2018-January).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Speech emotion recognition using autoencoder bottleneck features and LSTM

AU - Huang, Kun Yi

AU - Wu, Chung-Hsien

AU - Yang, Tsung Hsien

AU - Su, Ming Hsiang

AU - Chou, Jia Hui

PY - 2018/2/1

Y1 - 2018/2/1

N2 - A complete emotional expression contains a complex temporal course in a conversation. Related research on utterance and segment-level processing lacks considering subtle differences in characteristics and historical information. In this work, as Deep Scattering Spectrum (DSS) can obtain more detailed energy distributions in frequency domain than the Low Level Descriptors (LLDs), this work combines LLDs and DSS as the speech features. Autoencoder neural network is then applied to extract the bottleneck features for dimensionality reduction. Finally, the long-short term memory (LSTM) is employed to characterize temporal variation of speech emotion for emotion recognition. For evaluation, the MHMC emotion database was collected and used for performance evaluation. Experimental results show that the proposed method using the bottleneck features from the combination of the LLDs and DSS achieved an emotion recognition accuracy of 98.1%, outperforming the systems using LLDs or DSS individually.

AB - A complete emotional expression contains a complex temporal course in a conversation. Related research on utterance and segment-level processing lacks considering subtle differences in characteristics and historical information. In this work, as Deep Scattering Spectrum (DSS) can obtain more detailed energy distributions in frequency domain than the Low Level Descriptors (LLDs), this work combines LLDs and DSS as the speech features. Autoencoder neural network is then applied to extract the bottleneck features for dimensionality reduction. Finally, the long-short term memory (LSTM) is employed to characterize temporal variation of speech emotion for emotion recognition. For evaluation, the MHMC emotion database was collected and used for performance evaluation. Experimental results show that the proposed method using the bottleneck features from the combination of the LLDs and DSS achieved an emotion recognition accuracy of 98.1%, outperforming the systems using LLDs or DSS individually.

UR - http://www.scopus.com/inward/record.url?scp=85050507118&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85050507118&partnerID=8YFLogxK

U2 - 10.1109/ICOT.2016.8278965

DO - 10.1109/ICOT.2016.8278965

M3 - Conference contribution

T3 - 2016 International Conference on Orange Technologies, ICOT 2016

SP - 1

EP - 4

BT - 2016 International Conference on Orange Technologies, ICOT 2016

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Huang KY, Wu C-H, Yang TH, Su MH, Chou JH. Speech emotion recognition using autoencoder bottleneck features and LSTM. In 2016 International Conference on Orange Technologies, ICOT 2016. Institute of Electrical and Electronics Engineers Inc. 2018. p. 1-4. 8278965. (2016 International Conference on Orange Technologies, ICOT 2016). https://doi.org/10.1109/ICOT.2016.8278965