Multi-mode speech recognition method based on deep neural network

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A deep neural network, speech recognition technology, applied in speech recognition, speech analysis, instruments, etc., can solve problems such as reducing the error rate of acoustic models in word and sentence recognition, poor recognition of Chinese continuous speech recognition, etc. The effect of error rate

Inactive Publication Date: 2019-08-09

TIANJIN UNIV

View PDF17 Cites 19 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] The invention provides a multi-modal speech recognition method based on a deep neural network. Aiming at the problem of poor recognition effect of Chinese continuous speech recognition in a noisy environment, the present invention proposes to use visual information to supplement speech information for multi-modal speech recognition. Feature fusion, and use the DNN-HMM model (deep neural network-hidden Markov model) for acoustic model modeling, and perform experimental decoding on the Chinese corpus recorded in the laboratory, which reduces the recognition error rate of the acoustic model in words and sentences , see the description below:

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0063] This method needs to train the acoustic model after the Kaldi experimental platform, database and automatic speech recognition system are built:

[0064] 1. GMM-HMM model training

[0065] When training the GMM-HMM acoustic model, the main idea is to use the hidden Markov model to model the temporal characteristics of the speech signal, and then calculate the emission probability of each model state through the mixed Gaussian model.

[0066] Such as image 3 As shown, firstly, it is necessary to normalize the input timing features, that is, to perform CMVN (Cepstral Mean Variance Normalization) processing to reduce the differences caused by individual speaker characteristics. Then, the monophonic Gaussian model is trained with the processed time series features, and then the sentences in the training data are forced to be aligned using the monophonic Gaussian model and the Viterbi algorithm to obtain phoneme segmentation information. Finally, use the obtained segmenta...

Embodiment 2

[0115] exist Figure 4 In DNN-HMM, for any state S∈[1,S], the posterior probability p(q t =s|x t ). However, under the traditional GMM framework, it is necessary to use multiple different GMMs to model different states. In addition, the input of the deep neural network is no longer the feature of one frame of audio, but the combined feature of multiple frames of audio, so that the information between adjacent frames can reflect the timing of speech, so that the information can be used effectively.

[0116] Information fusion is divided into three levels: data fusion, feature fusion, decision fusion, such as Figure 5 shown.

[0117] Such as Image 6 As shown, the resolution of the color image collected in the experiment is 1920×1080, and the sampling frequency is 30 frames per second. The depth image resolution is 512×424, and the sampling frequency is also 30 frames per second. And each frame of image is time-stamped for alignment of annotations.

[0118] Such as im...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a multi-modal speech recognition method based on a deep neural network. The method comprises the following steps: building a sentence-level corpus text based on Chinese phonemes, and recording multi-modal data which comprises a color image, a depth image, depth data, and audio information; acquiring a lip image and an audio signal in the pronunciation process of a speaker,performing windowing framing on the lip image, performing DCT and PCA dimension reduction on the image, and selecting an image feature with an appropriate dimension to perform feature splicing with anMFCC feature of audio to form a new multi-modal audio feature; building a Chinese automatic speech recognition system, and using a deep neural network-hidden Markov model to perform modeling of an acoustic model, selecting multi-modal speech feature splicing as input, and performing training and test decoding to reduce the recognition error rate of words and sentences. According to the invention,the recognition error rate of the acoustic model in words and sentences is reduced.

Description

technical field [0001] The invention relates to the fields of speech recognition, acoustic modeling and deep learning, in particular to a multi-modal speech recognition method based on a deep neural network. Background technique [0002] At present, in related technologies, speech, as the most natural way of interaction between human and computer, has unique advantages, which also makes speech recognition a hot research field. Whether it is car voice recognition navigation or voice assistants on smartphones, it has shown its huge application value and future prospects. [0003] However, compared with human's sensory and auditory ability, speech recognition technology still has many defects in the accuracy of recognition and the robustness of the overall performance. noise interference. Contents of the invention [0004] The invention provides a multi-modal speech recognition method based on a deep neural network. Aiming at the problem of poor recognition effect of Chines...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G10L15/22G10L15/25G10L15/06G10L25/30

CPCG10L15/063G10L15/22G10L15/25G10L25/30

Inventor 喻梅程旻余童高洁刘志强徐天一于瑞国李雪威胡晓凯

Owner TIANJIN UNIV

Multi-mode speech recognition method based on deep neural network

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology