Sound Event Recognition Using Auditory-Receptive-Field Binary Pattern and Hierarchical-Diving Deep Belief Network

Chien Yao Wang, Jia Ching Wang, Andri Santoso, Chin Chin Chiang, Chung-Hsien Wu

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Automatic sound event recognition (SER) has recently attracted renewed interest. Although practical SER system has many useful applications in everyday life, SER is challenging owing to the variations among sounds and noises in the real-world environment. This work presents a novel feature extraction and classification method to solve the problem of sound event recognition. An audio-visual descriptor, called the auditory-receptive-field binary pattern (ARFBP), is designed based on the spectrogram image feature (SIF), the cepstral features, and the human auditory receptive field model. The extracted features are then fed into a classifier to perform event classification. The proposed classifier, called the hierarchical-diving deep belief network (HDDBN), is a deep neural network (DNN) system that hierarchically learns the discriminative characteristics from physical feature representation to the abstract concept. The performance of our proposed system was verified using several experiments under various conditions. Using the RWCP sound scene dataset, the proposed system achieved a recognition rate of 99.27\% for real-world sound data in 105 categories. Under noisy conditions, the developed system is very robust, with which it achieved 95.06\% recognition rate with 0 dB signal-to-noise ratio (SNR). Using the TUT sound event dataset, the proposed system achieves an error of 0.81 and 0.73 in sound event detection in home and residential area scenes. The experimental results reveal that the proposed system outperformed the other systems in this field.

Original languageEnglish
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
DOIs
Publication statusAccepted/In press - 2017 Aug 10

Fingerprint

belief networks
Belief Networks
Receptive Field
Diving
Bayesian networks
Acoustic waves
Binary
acoustics
event
classifiers
Classifiers
Classifier
residential areas
Spectrogram
Sound
residential area
Event Detection
spectrograms
Signal-To-Noise Ratio
neural network

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Media Technology
  • Instrumentation
  • Acoustics and Ultrasonics
  • Linguistics and Language
  • Electrical and Electronic Engineering
  • Speech and Hearing

Cite this

@article{7984b6fb2402432ba58a39fea5383f0d,
title = "Sound Event Recognition Using Auditory-Receptive-Field Binary Pattern and Hierarchical-Diving Deep Belief Network",
abstract = "Automatic sound event recognition (SER) has recently attracted renewed interest. Although practical SER system has many useful applications in everyday life, SER is challenging owing to the variations among sounds and noises in the real-world environment. This work presents a novel feature extraction and classification method to solve the problem of sound event recognition. An audio-visual descriptor, called the auditory-receptive-field binary pattern (ARFBP), is designed based on the spectrogram image feature (SIF), the cepstral features, and the human auditory receptive field model. The extracted features are then fed into a classifier to perform event classification. The proposed classifier, called the hierarchical-diving deep belief network (HDDBN), is a deep neural network (DNN) system that hierarchically learns the discriminative characteristics from physical feature representation to the abstract concept. The performance of our proposed system was verified using several experiments under various conditions. Using the RWCP sound scene dataset, the proposed system achieved a recognition rate of 99.27\{\%} for real-world sound data in 105 categories. Under noisy conditions, the developed system is very robust, with which it achieved 95.06\{\%} recognition rate with 0 dB signal-to-noise ratio (SNR). Using the TUT sound event dataset, the proposed system achieves an error of 0.81 and 0.73 in sound event detection in home and residential area scenes. The experimental results reveal that the proposed system outperformed the other systems in this field.",
author = "Wang, {Chien Yao} and Wang, {Jia Ching} and Andri Santoso and Chiang, {Chin Chin} and Chung-Hsien Wu",
year = "2017",
month = "8",
day = "10",
doi = "10.1109/TASLP.2017.2738443",
language = "English",
journal = "IEEE/ACM Transactions on Speech and Language Processing",
issn = "2329-9290",
publisher = "IEEE Advancing Technology for Humanity",

}

Sound Event Recognition Using Auditory-Receptive-Field Binary Pattern and Hierarchical-Diving Deep Belief Network. / Wang, Chien Yao; Wang, Jia Ching; Santoso, Andri; Chiang, Chin Chin; Wu, Chung-Hsien.

In: IEEE/ACM Transactions on Audio Speech and Language Processing, 10.08.2017.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Sound Event Recognition Using Auditory-Receptive-Field Binary Pattern and Hierarchical-Diving Deep Belief Network

AU - Wang, Chien Yao

AU - Wang, Jia Ching

AU - Santoso, Andri

AU - Chiang, Chin Chin

AU - Wu, Chung-Hsien

PY - 2017/8/10

Y1 - 2017/8/10

N2 - Automatic sound event recognition (SER) has recently attracted renewed interest. Although practical SER system has many useful applications in everyday life, SER is challenging owing to the variations among sounds and noises in the real-world environment. This work presents a novel feature extraction and classification method to solve the problem of sound event recognition. An audio-visual descriptor, called the auditory-receptive-field binary pattern (ARFBP), is designed based on the spectrogram image feature (SIF), the cepstral features, and the human auditory receptive field model. The extracted features are then fed into a classifier to perform event classification. The proposed classifier, called the hierarchical-diving deep belief network (HDDBN), is a deep neural network (DNN) system that hierarchically learns the discriminative characteristics from physical feature representation to the abstract concept. The performance of our proposed system was verified using several experiments under various conditions. Using the RWCP sound scene dataset, the proposed system achieved a recognition rate of 99.27\% for real-world sound data in 105 categories. Under noisy conditions, the developed system is very robust, with which it achieved 95.06\% recognition rate with 0 dB signal-to-noise ratio (SNR). Using the TUT sound event dataset, the proposed system achieves an error of 0.81 and 0.73 in sound event detection in home and residential area scenes. The experimental results reveal that the proposed system outperformed the other systems in this field.

AB - Automatic sound event recognition (SER) has recently attracted renewed interest. Although practical SER system has many useful applications in everyday life, SER is challenging owing to the variations among sounds and noises in the real-world environment. This work presents a novel feature extraction and classification method to solve the problem of sound event recognition. An audio-visual descriptor, called the auditory-receptive-field binary pattern (ARFBP), is designed based on the spectrogram image feature (SIF), the cepstral features, and the human auditory receptive field model. The extracted features are then fed into a classifier to perform event classification. The proposed classifier, called the hierarchical-diving deep belief network (HDDBN), is a deep neural network (DNN) system that hierarchically learns the discriminative characteristics from physical feature representation to the abstract concept. The performance of our proposed system was verified using several experiments under various conditions. Using the RWCP sound scene dataset, the proposed system achieved a recognition rate of 99.27\% for real-world sound data in 105 categories. Under noisy conditions, the developed system is very robust, with which it achieved 95.06\% recognition rate with 0 dB signal-to-noise ratio (SNR). Using the TUT sound event dataset, the proposed system achieves an error of 0.81 and 0.73 in sound event detection in home and residential area scenes. The experimental results reveal that the proposed system outperformed the other systems in this field.

UR - http://www.scopus.com/inward/record.url?scp=85042851663&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85042851663&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2017.2738443

DO - 10.1109/TASLP.2017.2738443

M3 - Article

JO - IEEE/ACM Transactions on Speech and Language Processing

JF - IEEE/ACM Transactions on Speech and Language Processing

SN - 2329-9290

ER -