Multi-mode speech recognition method based on deep neural network
A deep neural network, speech recognition technology, applied in speech recognition, speech analysis, instruments, etc., can solve problems such as reducing the error rate of acoustic models in word and sentence recognition, poor recognition of Chinese continuous speech recognition, etc. The effect of error rate
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0063] This method needs to train the acoustic model after the Kaldi experimental platform, database and automatic speech recognition system are built:
[0064] 1. GMM-HMM model training
[0065] When training the GMM-HMM acoustic model, the main idea is to use the hidden Markov model to model the temporal characteristics of the speech signal, and then calculate the emission probability of each model state through the mixed Gaussian model.
[0066] Such as image 3 As shown, firstly, it is necessary to normalize the input timing features, that is, to perform CMVN (Cepstral Mean Variance Normalization) processing to reduce the differences caused by individual speaker characteristics. Then, the monophonic Gaussian model is trained with the processed time series features, and then the sentences in the training data are forced to be aligned using the monophonic Gaussian model and the Viterbi algorithm to obtain phoneme segmentation information. Finally, use the obtained segmenta...
Embodiment 2
[0115] exist Figure 4 In DNN-HMM, for any state S∈[1,S], the posterior probability p(q t =s|x t ). However, under the traditional GMM framework, it is necessary to use multiple different GMMs to model different states. In addition, the input of the deep neural network is no longer the feature of one frame of audio, but the combined feature of multiple frames of audio, so that the information between adjacent frames can reflect the timing of speech, so that the information can be used effectively.
[0116] Information fusion is divided into three levels: data fusion, feature fusion, decision fusion, such as Figure 5 shown.
[0117] Such as Image 6 As shown, the resolution of the color image collected in the experiment is 1920×1080, and the sampling frequency is 30 frames per second. The depth image resolution is 512×424, and the sampling frequency is also 30 frames per second. And each frame of image is time-stamped for alignment of annotations.
[0118] Such as im...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


