Mask-based Speech Enhancement Considering Speech Quality and Acoustic Confidence for Noisy Speech Recognition

  • 林 允文

Student thesis: Doctoral Thesis

Abstract

In recent years the number of network-connected devices has risen rapidly Many devices can interact with people by automatic speech recognition (ASR) The behavior of using voice operations has gradually been accepted by the public but there is much background noise which makes it difficult for ASR It is very important to effectively improve speech recognition by speech enhancement in a noisy speech In addition although simply using mean square error (MSE) as the loss function can effectively enhance speech quality there is still a gap between speech enhancement and speech recognition Therefore the main contribution of this thesis is to generate a mask that takes into account speech quality and acoustic credibility for speech enhancement to reduce the word error rate (WER) for noisy speech recognition First we extract the features of speakers phones and noises and then use these related features and noisy speech as inputs to make the enhanced speech with better speech quality On the other hand this study uses the phone confidence from Kaldi the phone judgment trained with clean speech the MSE and the loss of STOI and PESQ as the loss function to train the mask generation model Compared with the baseline model the proposed model successfully improves the speech quality and reduces the WER in speech recognition In the experiment we chose to use TIMIT as the speech data and noiseX-92 as the noise data and mixed the speech and noise at the signal-to-noise ratio (SNR) of -10 -5 0 5 and 10 dB Compared with the baseline model multiplying the MSE the loss of the phone judgment and the loss of STOI and PESQ not only improved STOI by 2 14% and PESQ by 7 22% but also achieved the lowest WER of 21 59% compared to the baseline model which achieved 33 72% and the model for the noisy speech without enhancement which was 29 08% Experiments shows that the proposed method greatly improves the results of speech recognition on noisy speech
Date of Award2020
Original languageEnglish
SupervisorChung-Hsien Wu (Supervisor)

Cite this

'