Nowadays with the improvement and advancement of many related technologies for voice processing voice interactive software and products have become more and more popular In the part of the multi-person dialogue voice we will need to use the speaker change point detection technology to perform voice pre-processing and then do further analysis and processing In the past research on speaker change point detection most of them are based on the characteristics of acoustic features for detection The method proposed in this thesis is to provide the speaker information from the perspective of articulatory features The difference in pronunciation characteristics is to improve the accuracy of the detection of the change point of the speaker and achieve complementary effects In this thesis we use a convolutional neural network to train a speaker embedding model by extracting acoustic features from speech to obtain a vector representing the characteristics of the speaker In addition a model of articulatory features (AFs) is trained and a multi-layered perception network is used to extract the AF embedding of speech features Finally using these two vectors another multilayer perceptual network is used to train the speaker change detection model The trained speaker change detection model is helpful to determine the exact position of the change point Two speech databases were utilized in this thesis The first one was the VoxCeleb2 speech database which was a corpus widely used in the field of speaker identification The second was the Librispeech corpus which was a corpus widely used in the field of speech or speaker recognition In this thesis we mainly trained three models the first was the speaker embedding model the second was the AF embedding model and the third was the speaker change detection model In the speaker embedding model part this thesis used the VoxCeleb2 database for training and evaluation of model parameter settings In the articulatory feature embedding model we used the Librispeech database for training and evaluation of model parameter settings In the part of the speaker change detection model we used a dialogue corpus composed from the VoxCeleb2 database to train the speaker change detection model In the speaker change detection task the experimental results showed that the proposed method could reduce false alarm rate by 1 94% increasing accuracy by 1 1% precision rate by 2 04% and F1 score by 0 16% From the experimental results the proposed method was superior to the traditional method and could be applied to products that require speaker change detection
Date of Award | 2019 |
---|
Original language | English |
---|
Supervisor | Chung-Hsien Wu (Supervisor) |
---|
Speaker Change Detection using Speaker and Articulatory Feature Embeddings
?青, 黃. (Author). 2019
Student thesis: Doctoral Thesis