TY - GEN
T1 - Attentively-Coupled Long Short-Term Memory for Audio-Visual Emotion Recognition
AU - Hsu, Jia Hao
AU - Wu, Chung Hsien
N1 - Publisher Copyright:
© 2020 APSIPA.
PY - 2020/12/7
Y1 - 2020/12/7
N2 - There have been more and more studies on emotion recognition through multiple modalities. In the existing audiovisual emotion recognition methods, few studies focused on modeling emotional fluctuations in the signals. Besides, how to fuse multimodal signals, such as audio-visual signals, is still a challenging issue. In this paper, segments of audio-visual signals are extracted and considered as the recognition unit to characterize the emotional fluctuation. An Attentively-Coupled long-short term memory (ACLSTM) is proposed to combine the audio-based and visual-based LSTMs to improve the emotion recognition performance. In the Attentively-Coupled LSTM, the Coupled LSTM is used as the fusion model, and the neural tensor network (NTN) is employed for attention estimation to obtain the segment-based emotion consistency between audio and visual segments. Compared with previous approaches, the experimental results showed that the proposed method achieved the best results of 70.1% in multi-modal emotion recognition on the dataset BAUM-I.
AB - There have been more and more studies on emotion recognition through multiple modalities. In the existing audiovisual emotion recognition methods, few studies focused on modeling emotional fluctuations in the signals. Besides, how to fuse multimodal signals, such as audio-visual signals, is still a challenging issue. In this paper, segments of audio-visual signals are extracted and considered as the recognition unit to characterize the emotional fluctuation. An Attentively-Coupled long-short term memory (ACLSTM) is proposed to combine the audio-based and visual-based LSTMs to improve the emotion recognition performance. In the Attentively-Coupled LSTM, the Coupled LSTM is used as the fusion model, and the neural tensor network (NTN) is employed for attention estimation to obtain the segment-based emotion consistency between audio and visual segments. Compared with previous approaches, the experimental results showed that the proposed method achieved the best results of 70.1% in multi-modal emotion recognition on the dataset BAUM-I.
UR - http://www.scopus.com/inward/record.url?scp=85100918587&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85100918587&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85100918587
T3 - 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2020 - Proceedings
SP - 1048
EP - 1053
BT - 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2020 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2020
Y2 - 7 December 2020 through 10 December 2020
ER -