Speech Emotion Recognition using Decomposed Speech via Multi-task Learning

Jia Hao Hsu, Chung Hsien Wu, Yu Hung Wei

研究成果: Conference article同行評審

1 引文 斯高帕斯(Scopus)


In speech emotion recognition, most recent studies used powerful models to obtain robust features without considering the disentangled components, which contain diverse emotion-rich information helpful for speech emotion recognition. In this study, an autoencoder is used as the speech decomposition model to obtain the disentangled components, including content, timbre, pitch, and rhythm features, which are regarded as emotion-rich features, for speech emotion recognition. The mechanism of multi-task training is then used to train the tasks of speech emotion recognition, speaker recognition, speech recognition, and spectral reconstruction at the same time, while exploiting commonalities and differences across tasks. The model proposed in this study achieved an accuracy of 77.50% on the four-classes emotion recognition task of IEMOCAP. Experiments showed that the proposed methods can effectively improve speech emotion recognition performance, outperforming the SOTA approach.

頁(從 - 到)4553-4557
期刊Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
出版狀態Published - 2023
事件24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
持續時間: 2023 8月 202023 8月 24

All Science Journal Classification (ASJC) codes

  • 語言與語言學
  • 人機介面
  • 訊號處理
  • 軟體
  • 建模與模擬


深入研究「Speech Emotion Recognition using Decomposed Speech via Multi-task Learning」主題。共同形成了獨特的指紋。