With the continuous evolution of human-computer interaction products many smart products can support our daily needs such as smart speakers home robots and self-driving cars In the interaction with these products the ability to add emotion recognition to users will make these products more humane and increase the flexibility of interaction There have been more and more studies on emotion recognition In the existing audio-visual modal emotion recognition systems only few of them focused on segment-based recognition of emotion expression contrast to utterance-based emotion recognition From the segment-based emotion expression we can find the fluctuations of the more detailed expression of emotion This thesis uses segments as the identification unit to capture the facial expressions and audio signals of the speakers considers and analyzes the different features of the facial and audio signals and considers the pre- and post-dependence of the segmented signals In the segmentation process an important segment that has a great influence on the expression of the whole sentence is firstly found and the segment is given a higher attention in the overall recognition to improve the recognition accuracy of each segment Different from single-modal emotion recognition multi-modal emotion recognition architecture considers the data from different modalities This thesis focuses on how to improve the fusion mechanism to improve the performance of segment-based emotion recognition by using a attentively-coupled long-term memory model With the attention mechanism in each fusion operation the coupling unit can simultaneously consider the mutual influence relationship of the two modal signal characteristics when updating the unit and add the degree of attention of each sequential segment for emotion recognition The long-short term memory is adopted to control the flow of information to learn the long and short-term dependence of the signal The model obtains the emotion prediction sequence of each segment and expects to recognize the emotion from both facial and audio emotion expressions of the speaker In the experimental results the accuracy of the proposed audio-visual emotion recognition system achieved 70 1% which outperformed other existing traditional audio-visual emotion recognition systems The experimental results showed that the proposed attentively-coupled long short-term memory model achieved good results in multi-modal emotion recognition or emotion recognition using segment-based attention
Date of Award | 2019 |
---|
Original language | English |
---|
Supervisor | Chung-Hsien Wu (Supervisor) |
---|
Attentively-Coupled Long Short-Term Memory for Audio-Visual Emotion Recognition
嘉昊, 徐. (Author). 2019
Student thesis: Doctoral Thesis