TY - JOUR
T1 - Toward semantic indexing and retrieval using hierarchical audio models
AU - Chu, Wei Ta
AU - Cheng, Wen Huang
AU - Hsu, Jane Yung Jen
AU - Wu, Ja Ling
PY - 2005/12/1
Y1 - 2005/12/1
N2 - Semantic-level content analysis is a crucial issue in achieving efficient content retrieval and management. We propose a hierarchical approach that models the statistical characteristics of audio events over a time series to accomplish semantic context detection. Two stages, audio event and semantic context modeling, are devised to bridge the semantic gap between physical audio features and semantic concepts. In this work, hidden Markov models (HMMs) are used to model four representative audio events, i.e., gunshot, explosion, engine, and car-braking, in action movies. At the semantic-context level, Gaussian mixture models (GMMs) and ergodic HMMs are investigated to fuse the characteristics and correlations between various audio events. They provide cues for detecting gunplay and car-chasing scenes, two semantic contexts we focus on in this work. The promising experimental results demonstrate the effectiveness of the proposed approach and exhibit that the proposed frame-work provides a foundation in semantic indexing and retrieval. Moreover, the two fusion schemes are compared, and the relations between audio event and semantic context are studied.
AB - Semantic-level content analysis is a crucial issue in achieving efficient content retrieval and management. We propose a hierarchical approach that models the statistical characteristics of audio events over a time series to accomplish semantic context detection. Two stages, audio event and semantic context modeling, are devised to bridge the semantic gap between physical audio features and semantic concepts. In this work, hidden Markov models (HMMs) are used to model four representative audio events, i.e., gunshot, explosion, engine, and car-braking, in action movies. At the semantic-context level, Gaussian mixture models (GMMs) and ergodic HMMs are investigated to fuse the characteristics and correlations between various audio events. They provide cues for detecting gunplay and car-chasing scenes, two semantic contexts we focus on in this work. The promising experimental results demonstrate the effectiveness of the proposed approach and exhibit that the proposed frame-work provides a foundation in semantic indexing and retrieval. Moreover, the two fusion schemes are compared, and the relations between audio event and semantic context are studied.
UR - http://www.scopus.com/inward/record.url?scp=33845633986&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33845633986&partnerID=8YFLogxK
U2 - 10.1007/s00530-005-0183-6
DO - 10.1007/s00530-005-0183-6
M3 - Article
AN - SCOPUS:33845633986
VL - 10
SP - 570
EP - 583
JO - Multimedia Systems
JF - Multimedia Systems
SN - 0942-4962
IS - 6
ER -