Due to diverse expression styles in real-world scenarios, recognizing human emotions is difficult without collecting sufficient and various data for model training. Besides, emotion recognition of noisy data is another challenging problem to be solved. This work endeavors to propose a fusion strategy to alleviate the problems of noisy and sparse data in bimodal emotion recognition. Toward robust bimodal emotion recognition, a Semi-Coupled Hidden Markov Model (SC-HMM) based on a state-based bimodal alignment strategy is proposed to align the temporal relation of states of two component HMMs between audio and visual streams. Based on this strategy, the SC-HMM can diminish the over-fitting problem and achieve better statistical dependency between states of audio and visual HMMs in sparse data conditions and also provides the ability to better accommodate to the noisy conditions. Experiments show a promising result of the proposed approach.