Audio-visual speech recognition method and device based on multiple modal fusion, equipment and storage medium

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A speech recognition and modal technology, applied in speech recognition, character and pattern recognition, speech analysis, etc., can solve problems such as imperfect video frame extraction information, and achieve the effect of improving multi-modal fusion problems and improving recognition accuracy.

Active Publication Date: 2021-06-15

XI AN JIAOTONG UNIV

View PDF4 Cites 6 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] The purpose of the present invention is to overcome the shortcomings of the above-mentioned prior art, provide an audio-visual speech recognition method, device, equipment and storage medium based on multiple modality fusion, and solve the problem of imperfect video frame extraction information existing in the prior art and feature fusion problem

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0103] like Figure 5 and Figure 6 As shown, the dataset used in this experiment is the public dataset LRS2 dataset, which consists of more than 37,000 sentences from BBC TV, and the length of each sentence does not exceed 100 characters. The data set mainly includes two types of files, video files and corresponding text files. Speech recognition is very difficult because of the different light intensities in the video, as well as different speaker speeds and accents. In order to verify the effectiveness of the proposed method, in this embodiment, two kinds of noise (NOISA-A noise and NOISA-B noise) are added to the data set with different signal-to-noise ratios, where the signal-to-noise ratio SNR can be expressed as:

[0104]

[0105] Among them, ∑ t the s 2 (t) represents pure speech energy without noise, and ∑ t no 2 (t) is expressed as noise energy.

[0106] In order to synthesize the mixed speech of specific signal-to-noise ratio, the present embodiment needs ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses an audio-visual speech recognition method and device based on multiple modal fusion, equipment and a storage medium. The method is characterized in that compared with a common RNN, a Skip RNN used in an audio-visual speech recognition sub-network improves the problems of low reasoning speed, gradient disappearance, difficulty in capturing a long-term dependency relationship and the like. The adopted TCN can solve the problem of imperfect video frame feature extraction, the adopted multi-modal fusion attention mechanism can effectively improve the multi-modal fusion problem, and the multi-modal fusion is adopted to improve the recognition accuracy.

Description

【Technical field】 [0001] The invention belongs to the field of speech recognition, and relates to an audio-visual speech recognition method, device, equipment and storage medium based on multiple modality fusion. 【Background technique】 [0002] Speech recognition is a basic problem in artificial intelligence, natural language processing and signal processing, especially in the boom of deep learning in the past decade, it has been greatly developed. At present, the performance of speech recognition has been greatly improved, but in the case of noise interference, the speech signal shows great volatility, and the performance of the speech recognition algorithm is not satisfactory. How to improve the performance of speech recognition systems in noisy environments has become a hot issue in the field of natural language processing. [0003] The goal of visual lip language recognition technology and auditory speech recognition technology is to predict the text information corresp...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G10L15/183G10L19/00G10L25/18G06N3/08G06N3/04G06K9/62

CPCG10L15/183G10L25/18G06N3/049G06N3/08G06N3/047G06F18/2415G06F18/253

Inventor 王志郭加伟余凡赵欣伟

Owner XI AN JIAOTONG UNIV

Audio-visual speech recognition method and device based on multiple modal fusion, equipment and storage medium

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology