Unlock instant, AI-driven research and patent intelligence for your innovation.

Audio-visual speech recognition method and device based on multiple modal fusion, equipment and storage medium

A speech recognition and modal technology, applied in speech recognition, character and pattern recognition, speech analysis, etc., can solve problems such as imperfect video frame extraction information, and achieve the effect of improving multi-modal fusion problems and improving recognition accuracy.

Active Publication Date: 2021-06-15
XI AN JIAOTONG UNIV
View PDF4 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to overcome the shortcomings of the above-mentioned prior art, provide an audio-visual speech recognition method, device, equipment and storage medium based on multiple modality fusion, and solve the problem of imperfect video frame extraction information existing in the prior art and feature fusion problem

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Audio-visual speech recognition method and device based on multiple modal fusion, equipment and storage medium
  • Audio-visual speech recognition method and device based on multiple modal fusion, equipment and storage medium
  • Audio-visual speech recognition method and device based on multiple modal fusion, equipment and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0103] like Figure 5 and Figure 6 As shown, the dataset used in this experiment is the public dataset LRS2 dataset, which consists of more than 37,000 sentences from BBC TV, and the length of each sentence does not exceed 100 characters. The data set mainly includes two types of files, video files and corresponding text files. Speech recognition is very difficult because of the different light intensities in the video, as well as different speaker speeds and accents. In order to verify the effectiveness of the proposed method, in this embodiment, two kinds of noise (NOISA-A noise and NOISA-B noise) are added to the data set with different signal-to-noise ratios, where the signal-to-noise ratio SNR can be expressed as:

[0104]

[0105] Among them, ∑ t the s 2 (t) represents pure speech energy without noise, and ∑ t no 2 (t) is expressed as noise energy.

[0106] In order to synthesize the mixed speech of specific signal-to-noise ratio, the present embodiment needs ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses an audio-visual speech recognition method and device based on multiple modal fusion, equipment and a storage medium. The method is characterized in that compared with a common RNN, a Skip RNN used in an audio-visual speech recognition sub-network improves the problems of low reasoning speed, gradient disappearance, difficulty in capturing a long-term dependency relationship and the like. The adopted TCN can solve the problem of imperfect video frame feature extraction, the adopted multi-modal fusion attention mechanism can effectively improve the multi-modal fusion problem, and the multi-modal fusion is adopted to improve the recognition accuracy.

Description

【Technical field】 [0001] The invention belongs to the field of speech recognition, and relates to an audio-visual speech recognition method, device, equipment and storage medium based on multiple modality fusion. 【Background technique】 [0002] Speech recognition is a basic problem in artificial intelligence, natural language processing and signal processing, especially in the boom of deep learning in the past decade, it has been greatly developed. At present, the performance of speech recognition has been greatly improved, but in the case of noise interference, the speech signal shows great volatility, and the performance of the speech recognition algorithm is not satisfactory. How to improve the performance of speech recognition systems in noisy environments has become a hot issue in the field of natural language processing. [0003] The goal of visual lip language recognition technology and auditory speech recognition technology is to predict the text information corresp...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G10L15/183G10L19/00G10L25/18G06N3/08G06N3/04G06K9/62
CPCG10L15/183G10L25/18G06N3/049G06N3/08G06N3/047G06F18/2415G06F18/253
Inventor 王志郭加伟余凡赵欣伟
Owner XI AN JIAOTONG UNIV