A Study on Data Fusion Strategy for Audio-Visual Emotion Recognition

  • 林 仁俊

Student thesis: Doctoral Thesis


Recent years have seen increased attention being given to research topic in automatic audio-visual emotion recognition To increase the recognition accuracy data fusion strategy that is related to how effectively integrate the audio and visual cues became the major research issue The fusion operations reported can be classified into three major categories: feature-level fusion decision-level fusion and model-level fusion for audio-visual emotion recognition Obviously the different data fusion strategies have different characteristics and distinct advantages and disadvantages According to the analysis of characteristics of current data fusion strategies this dissertation firstly presented a hybrid fusion method to effectively integrate the advantages of data fusion strategies of different characteristics for increasing the recognition performance This dissertation presented a hybrid fusion method named Error Weighted Semi-Coupled Hidden Markov Model (EWSC-HMM) to effectively integrate the advantages of model-level fusion method Semi-Coupled Hidden Markov Model (SC-HMM) and the decision-level fusion method Error Weighted Classifier Combination (EWC) to obtain the optimal emotion recognition result based on audio-visual bimodal fusion The state-based bimodal alignment strategy in SC-HMM is proposed to align the temporal relationship between audio and visual streams The Bayesian classifier weighting scheme EWC is then adopted to explore the contributions of the SC-HMM-based classifiers for different audio-visual feature pairs to make a final emotion recognition decision For performance evaluation two databases are considered: the posed MHMC database and the spontaneous SEMAINE database Experimental results show that the proposed method not only outperforms other fusion-based bimodal emotion recognition methods for posed expressions but also provide acceptable results for spontaneous expressions A complete emotional expression typically contains a complex temporal course in face-to-face natural conversation In this dissertation we further focused on exploring the temporal evolution of an emotional expression for audio-visual emotion recognition Previous psychologist research showed that a complete emotional expression can be characterized in three sequential temporal phases: onset (application) apex (release) and offset (relaxation) when considering the manner and intensity of expression However a complete emotional expression is expressed by more than one utterance in natural conversation and in more detail each utterance may contain several temporal phases of emotional expression Accordingly this dissertation further presented a novel data fusion method with respect to the temporal course modeling scheme named Two-Level Hierarchical Alignment-Based Semi-Coupled Hidden Markov model (2H-SC-HMM) to effectively solve the problem of complex temporal structures of an emotional expression and consider the temporal relationship between audio and visual streams for increasing the performance of audio-visual emotion recognition in a conversational utterance Finally the experimental results demonstrate that the proposed 2H-SC-HMM substantially improves apparent performance of audio-visual emotion recognition
Date of Award2014 Apr 28
Original languageEnglish
SupervisorChung-Hsien Wu (Supervisor)

Cite this