Audio-visual speech recognition method and device based on multiple modal fusion, equipment and storage medium
A speech recognition and modal technology, applied in speech recognition, character and pattern recognition, speech analysis, etc., can solve problems such as imperfect video frame extraction information, and achieve the effect of improving multi-modal fusion problems and improving recognition accuracy.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment
[0103] like Figure 5 and Figure 6 As shown, the dataset used in this experiment is the public dataset LRS2 dataset, which consists of more than 37,000 sentences from BBC TV, and the length of each sentence does not exceed 100 characters. The data set mainly includes two types of files, video files and corresponding text files. Speech recognition is very difficult because of the different light intensities in the video, as well as different speaker speeds and accents. In order to verify the effectiveness of the proposed method, in this embodiment, two kinds of noise (NOISA-A noise and NOISA-B noise) are added to the data set with different signal-to-noise ratios, where the signal-to-noise ratio SNR can be expressed as:
[0104]
[0105] Among them, ∑ t the s 2 (t) represents pure speech energy without noise, and ∑ t no 2 (t) is expressed as noise energy.
[0106] In order to synthesize the mixed speech of specific signal-to-noise ratio, the present embodiment needs ...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


