Speech recognition method and device, electronic equipment and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By performing speech quality analysis on the speech segments to be recognized and dynamically selecting image frames to participate, the problem of low real-time performance of speech recognition in complex noisy environments is solved, and efficient speech recognition in complex environments is achieved.

CN115547323BActive Publication Date: 2026-06-19BEIKE TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: BEIKE TECH CO LTD
Filing Date: 2022-08-31
Publication Date: 2026-06-19

Application Information

Patent Timeline

31 Aug 2022

Application

19 Jun 2026

Publication

CN115547323B

IPC: G10L15/22; G10L25/60

AI Tagging

Application Domain

Speech recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies have low real-time performance when performing speech recognition in complex noisy environments, mainly due to the excessive time consumption caused by the need for complex multimodal feature extraction.

Method used

By performing a first speech quality analysis on the speech segment to be recognized, it is dynamically determined whether the speech frame to be recognized needs to be combined with the image frame to be recognized for speech activity detection. The image frame is only acquired when needed for feature fusion, which reduces the amount of image frame processing and improves the real-time performance of speech recognition.

Benefits of technology

While ensuring recognition accuracy, the real-time performance and efficiency of speech recognition are significantly improved by reducing the amount of image frame processing.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115547323B_ABST

Patent Text Reader

Abstract

This invention provides a speech recognition method, apparatus, electronic device, and storage medium. The speech recognition method includes: performing a first speech quality analysis on a speech segment to be recognized to obtain a first speech quality analysis result; if the first speech quality analysis result indicates that the speech frame to be recognized requires the cooperation of an image frame to be recognized for speech activity detection, acquiring the image frame to be recognized; based on the speech frame and the image frame to be recognized, performing speech activity detection on the speech frame in the speech segment to be recognized to determine a target audio time node of the speech segment to be recognized, and determining a target speech segment based on the target audio time node; and performing speech recognition on the speech segment to be recognized based on the target speech segment. This invention improves the real-time performance of speech recognition while ensuring accuracy.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of speech recognition technology, and in particular to a speech recognition method, apparatus, electronic device, and storage medium. Background Technology

[0002] Speech recognition is a crucial part of the field of voice interaction, playing an important role in scenarios such as smart homes and intelligent customer service.

[0003] According to relevant technologies, effective speech recognition in complex noisy environments often requires the use of video (combining audio and image) with multimodal features. The need for complex network models to extract image features leads to increased time consumption during speech recognition, impacting its real-time performance. Summary of the Invention

[0004] This invention provides a speech recognition method, apparatus, electronic device, and storage medium to address the shortcomings of low real-time performance in existing speech recognition technologies and improve the real-time performance of speech recognition.

[0005] This invention provides a speech recognition method, the method comprising: performing a first speech quality analysis on a speech segment to be recognized to obtain a first speech quality analysis result, wherein the first speech quality analysis result is an analysis result of whether each speech frame to be recognized in the speech segment to be recognized needs to cooperate with a speech activity detection image frame to be recognized, wherein the speech frame to be recognized corresponds to the speech frame to be recognized; if the first speech quality analysis result indicates that the speech frame to be recognized needs to cooperate with a speech activity detection image frame to be recognized, acquiring the speech frame to be recognized; based on the speech frame to be recognized and the speech frame to be recognized, performing speech activity detection on the speech frames to be recognized in the speech segment to be recognized to determine a target audio time node of the speech segment to be recognized, and determining a target speech segment of the speech segment to be recognized based on the target audio time node; and performing speech recognition on the speech segment to be recognized based on the target speech segment.

[0006] According to a speech recognition method provided by the present invention, the step of detecting speech activity in the speech frame to be recognized in the speech segment to be recognized based on the speech frame to be recognized and the image frame to be recognized includes: detecting speech activity in the speech frame to be recognized in the speech segment to be recognized by using a pre-trained speech activity detection model based on the speech frame to be recognized and the image frame to be recognized.

[0007] According to a speech recognition method provided by the present invention, the step of performing a first speech quality analysis on a speech segment to be recognized to obtain a first speech quality analysis result includes: determining each speech frame to be recognized based on the speech segment to be recognized; performing a signal-to-noise ratio (SNR) analysis on the speech frames to be recognized to obtain the SNR analysis result of the speech frames to be recognized; performing frequency domain decomposition on the speech segment to be recognized to obtain multiple speech sub-bands to be recognized, and inputting the multiple speech sub-bands to be recognized into corresponding first pre-trained speech quality analysis models to obtain multiple first sub-band analysis results, wherein the first sub-band... The frequency band analysis result is the analysis result of whether each speech frame to be identified in the speech segment to be identified needs to cooperate with the image frame to be identified for speech activity detection in the corresponding frequency domain range; multiple first sub-frequency band analysis results are fused to obtain a first fused sub-frequency band analysis result, wherein the first fused sub-frequency band analysis result is the analysis result of whether each speech frame to be identified in the speech segment to be identified needs to cooperate with the image frame to be identified for speech activity detection in the entire frequency domain range; based on the signal-to-noise ratio analysis result of the speech frame to be identified and the first fused sub-frequency band analysis result, the first speech quality analysis result is obtained.

[0008] According to a speech recognition method provided by the present invention, the step of performing speech recognition on the speech segment to be recognized based on the target speech segment includes: performing a second speech quality analysis on the target speech segment to obtain a second speech quality analysis result, wherein the second speech quality analysis result is an analysis result of whether each target speech frame in the target speech segment needs to cooperate with sub-images for speech recognition, and the sub-images are images corresponding to different human portrait actions in the target image frame corresponding to the target speech frame; and performing speech recognition on the speech segment to be recognized based on the second speech quality analysis result.

[0009] According to a speech recognition method provided by the present invention, the step of performing speech recognition on the speech segment to be recognized based on the second speech quality analysis result includes: if the second speech quality analysis result indicates that the target speech frame requires the cooperation of a sub-image for speech recognition, then acquiring the sub-image; and performing speech recognition on the speech segment to be recognized by a pre-trained speech recognition model based on the target speech frame and the sub-image.

[0010] According to a speech recognition method provided by the present invention, the step of performing speech recognition on the speech segment to be recognized based on the second speech quality analysis result includes: if the second speech quality analysis result indicates that the target speech frame does not require the cooperation of sub-images for speech recognition, then performing speech recognition on the speech segment to be recognized based on the target speech frame using a pre-trained speech recognition model.

[0011] According to a speech recognition method provided by the present invention, the step of performing a second speech quality analysis on the target speech segment to obtain a second speech quality analysis result includes: determining each target speech frame of the target speech segment based on the target speech segment; performing signal-to-noise ratio (SNR) analysis on the target speech frames to obtain SNR analysis results for the target speech frames; performing frequency domain decomposition on the target speech segment to obtain multiple target speech sub-frequency bands, and inputting the multiple target speech sub-frequency bands into corresponding second pre-trained speech quality analysis models to obtain multiple second sub-frequency band analysis results, wherein the second sub-frequency band analysis results are analysis results of whether each target speech frame in the target speech segment needs sub-images to cooperate in speech recognition within the corresponding frequency domain range; fusing the multiple second sub-frequency band analysis results to obtain a second fused sub-frequency band analysis result, wherein the second fused sub-frequency band analysis result is analysis results of whether each target speech frame in the target speech segment needs sub-images to cooperate in speech recognition within the entire frequency domain range; and obtaining the second speech quality analysis result based on the SNR analysis results of the target speech frames and the second fused sub-frequency band analysis results.

[0012] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the speech recognition method as described above.

[0013] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the speech recognition method as described above.

[0014] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the speech recognition method as described above. Attached Figure Description

[0015] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0016] Figure 1 This is a flowchart illustrating the speech recognition method provided by the present invention;

[0017] Figure 2 This is a flowchart illustrating the process of determining a target speech segment provided by the present invention;

[0018] Figure 3This is a schematic diagram of the process of obtaining the first speech quality analysis result by performing a first speech quality analysis on the speech segment to be recognized, provided by the present invention.

[0019] Figure 4 This is a schematic diagram of an application scenario for obtaining the first speech quality analysis result provided by the present invention;

[0020] Figure 5 This is a schematic diagram of the process of performing speech recognition on the speech segment to be recognized provided by the present invention;

[0021] Figure 6 This is a schematic diagram of the process of obtaining the second speech quality analysis result by performing a second speech quality analysis on a target speech segment, as provided by the present invention.

[0022] Figure 7 This is a schematic diagram of an application scenario for obtaining the second speech quality analysis result provided by the present invention;

[0023] Figure 8 This is a schematic diagram of the structure of the speech recognition device provided by the present invention;

[0024] Figure 9 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0025] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0026] According to relevant technologies, in quiet environments, audio-based speech recognition technology can achieve near-human accuracy. However, in complex, noisy environments, relying solely on audio information for speech recognition of audio segments is easily affected by background noise, making it difficult to achieve good recognition results. By acquiring facial movements from images corresponding to the audio segments to be recognized and performing speech recognition based on multimodal speech recognition technology that combines audio and images, the impact of environmental noise on speech recognition performance can be effectively avoided. However, since complex network models are required for image feature extraction, speech recognition based on existing multimodal speech recognition models will consume more time, affecting the real-time rate of the speech recognition process.

[0027] The speech recognition method provided by this invention performs a first speech quality analysis on the speech segment to be recognized. Based on the result of the first speech quality analysis, it determines whether the speech frame to be recognized in the speech segment needs to be combined with the corresponding image frame to be recognized to jointly determine the target speech segment, and completes the speech recognition of the speech segment to be recognized based on the target speech segment. The target speech segment can be understood as the speech segment in the speech segment to be recognized after removing non-speech segments; in other words, the target speech segment is the speech segment in the speech segment to be recognized that contains a valid speech signal.

[0028] In this invention, the speech recognition process is illustrated using a speech recognition model consisting of a Voice Activity Detection (VAD) module and an Automatic Speech Recognition (ASR) acoustic module as an example. The VAD module is responsible for monitoring valid audio segments in the audio. The ASR acoustic module receives the valid speech segments (corresponding to the target speech segment) detected by the VAD module and performs speech recognition on the speech segment to be recognized based on the target speech segment. It is understood that the speech recognition method provided by this invention is not limited to application in the aforementioned speech recognition model.

[0029] Figure 1 This is a flowchart illustrating the speech recognition method provided by the present invention.

[0030] In an exemplary embodiment of the present invention, combined with Figure 1 As can be seen, the speech recognition method may include steps 110 to 140, and each step will be described below.

[0031] In step 110, a first speech quality analysis is performed on the speech segment to be recognized to obtain a first speech quality analysis result. The first speech quality analysis result is the analysis result of whether each speech frame to be recognized in the speech segment to be recognized needs to cooperate with the image frame to be recognized for speech activity detection. The image frame to be recognized corresponds to the speech frame to be recognized.

[0032] In one embodiment, feature extraction can be performed on the speech segment to be recognized to obtain its speech features. Based on these features, a first speech quality analysis is then performed to obtain a first speech quality analysis result. This first speech quality analysis result is an analysis of whether corresponding image frames are needed for speech activity detection in each speech frame of the speech segment to avoid environmental noise interference. The correspondence between the image frame and the speech frame means that both frames target the same person, and the sound emitted by the action of the person in the image frame is semantically and temporally identical to the sound in the speech frame.

[0033] In this embodiment, to avoid the decrease in speech recognition speed caused by image processing of each corresponding image frame for a speech frame to be recognized, the first speech quality analysis result can be used to determine whether the speech frame to be recognized needs the image information of the corresponding image frame to be recognized for joint speech activity detection. In this embodiment, by reducing the number of image frames to be recognized that need to be processed, the computational load is reduced, thereby improving the real-time rate of speech recognition.

[0034] In step 120, if the first speech quality analysis result indicates that the speech frame to be recognized requires the cooperation of the image frame to be recognized for speech activity detection, the image frame to be recognized is acquired.

[0035] In one embodiment, if the first speech quality analysis result indicates that the speech frame to be recognized requires the cooperation of the image frame to be recognized for speech activity detection, it means that the environmental noise of the current speech frame to be recognized is relatively high. In order to improve the accuracy of recognition, it is necessary to combine it with the corresponding image frame to be recognized for speech activity detection. At this time, it is necessary to acquire the image frame to be recognized.

[0036] In step 130, based on the speech frame to be identified and the image frame to be identified, speech activity detection is performed on the speech frame to be identified in the speech segment to be identified, so as to determine the target audio time node of the speech segment to be identified, and the target speech segment to be identified is determined based on the target audio time node.

[0037] In an exemplary embodiment of the present invention, speech activity detection of the speech frame to be identified in the speech segment to be identified, based on the speech frame to be identified and the image frame to be identified, can be achieved in the following manner:

[0038] Based on the speech frame and image frame to be identified, a pre-trained speech activity detection model is used to detect speech activity in the speech frame within the speech segment to be identified. In one example, the VAD (Voice Acting Detection) features of the speech frame to be identified and the human motion features (e.g., mouth movements) in the image frame to be identified can be fused. Based on the fused features, the target audio time point of the speech segment to be identified is determined using the pre-trained speech activity detection model. The target audio time point can be understood as the time point of the target speech segment within the speech segment to be identified.

[0039] In another example, the VAD speech features of the speech frame to be recognized and other action features of the human figure in the image frame to be recognized (such as facial expression features, human figure action features, etc.) can be fused, and the target audio time node of the speech segment to be recognized can be determined by a pre-trained speech activity detection model based on the fused features. It should be noted that this embodiment does not specifically limit the other action features of the human figure in the image frame to be recognized.

[0040] The extraction of VAD speech features from the speech frame to be recognized and the extraction of other motion features of the human figure in the image frame to be recognized can be achieved using corresponding neural network models. During the extraction process, the speech signal (corresponding to the speech frame to be recognized) and the image signal (corresponding to the image frame to be recognized) within a specified time unit in the time domain are respectively converted into fixed-length digital vectors. These digital vectors can be understood as the VAD speech features of the corresponding speech frame to be recognized and the other motion features of the human figure in the image frame to be recognized. In the process of fusing the feature vectors of the audio and image (corresponding to the feature fusion of the VAD speech features of the speech frame to be recognized and the other motion features of the human figure in the image frame to be recognized), the Factorized Bilinear Pooling scheme can be used. The fused features generated based on this technique can better reflect the complex relationships between different signals.

[0041] In step 140, speech recognition is performed on the speech segment to be recognized based on the target speech segment.

[0042] In the process of determining the target speech segment, the results of the first speech quality analysis can be used to dynamically determine which image frames to be recognized need to be processed, thereby reducing the number of image frames to be recognized that need to be processed, reducing the amount of computation, and thus improving the real-time rate of speech recognition.

[0043] The speech recognition method provided by this invention reduces the computational load in determining the target speech segment by performing a first speech quality analysis on the speech segment to be recognized and determining whether the speech frame to be recognized in the speech segment needs to be combined with the corresponding image frame to be recognized based on the first speech quality analysis result. Furthermore, it completes the speech recognition of the speech segment to be recognized based on the determined target speech segment, thereby improving the real-time performance of speech recognition while ensuring accuracy.

[0044] Figure 2 This is a schematic diagram of the process for determining a target speech segment provided by the present invention.

[0045] In one embodiment, combined with Figure 2 As can be seen, determining the target speech segment may include steps 210 to 290, which will be described in detail below. It is understandable that the entity responsible for determining the target speech segment can be considered the VAD module.

[0046] In step 210, feature extraction is performed on the speech segment to be recognized to obtain the speech features of the speech segment to be recognized.

[0047] In step 220, a first speech quality analysis is performed on the speech segment to be identified based on speech features.

[0048] In one embodiment, feature extraction can be performed on the speech segment to be recognized to obtain speech features of the speech segment, and a first speech quality analysis can be performed based on the speech features of the speech segment to be recognized to obtain a first speech quality analysis result. The speech features of the speech segment to be recognized can be extracted using a neural network model. In application, performing the first speech quality analysis based on the extracted speech features rather than the entire speech segment to be recognized can improve the accuracy and efficiency of the analysis.

[0049] In step 230, based on the first speech quality analysis result, it is determined whether each speech frame to be identified in the speech segment to be identified needs to be identified in conjunction with the image frame to be identified for speech activity detection.

[0050] In step 240, if not, the VAD speech features of the speech frame to be identified are obtained.

[0051] In step 250, speech activity is detected using a VAD model based on the VAD speech features of the speech frame to be identified.

[0052] In one embodiment, if the first speech quality analysis result indicates that the speech frame to be identified does not require the cooperation of the image frame to be identified for speech activity detection, it means that the environmental noise of the current speech frame to be identified is relatively low. Therefore, performing speech activity detection based on the speech frame to be identified will not affect the accuracy of speech activity detection. In one example, the VAD speech features of the speech frame to be identified can be directly extracted, and the target audio time node of the speech segment to be identified can be determined through a VAD model based on the VAD speech features of the speech frame to be identified. In this embodiment, while ensuring the accuracy of speech activity detection, reducing the number of image frames to be identified that need to be processed can reduce the computational load, thereby improving the real-time rate of speech recognition.

[0053] In step 260, if so, the image frame to be recognized corresponding to the speech frame to be recognized is obtained.

[0054] In step 270, the VAD speech features of the speech frame to be identified and the image features of the image frame to be identified are extracted.

[0055] In step 280, the VAD speech features and image features are fused, and speech activity is detected using the VAD model based on the fused features.

[0056] In one embodiment, if the first speech quality analysis result indicates that the speech frame to be identified requires the cooperation of the image frame to be identified for speech activity detection, it means that the environmental noise of the current speech frame to be identified is relatively large. In order to improve the accuracy of recognition, it is necessary to combine the corresponding image frame to be identified for speech activity detection.

[0057] In one example, the VAD speech features of the speech frame to be identified and the image features of the image frame to be identified can be fused, and the target audio time node of the speech segment to be identified can be determined by a speech activity detection model based on the fused features.

[0058] In step 290, the target audio time node of the speech segment to be identified is determined, and the target speech segment to be identified is determined based on the target audio time node.

[0059] In this embodiment, since the first speech quality analysis result can dynamically determine which image frames to be recognized need to be processed during the process of determining the target speech segment, the number of image frames to be recognized that need to be processed is reduced, the amount of computation is reduced, and the real-time rate of speech recognition is improved.

[0060] Figure 3 This is a schematic diagram of the process provided by the present invention for performing a first speech quality analysis on a speech segment to be recognized to obtain a first speech quality analysis result.

[0061] In an exemplary embodiment of the present invention, combined with Figure 3As can be seen, performing a first speech quality analysis on the speech segment to be recognized to obtain the first speech quality analysis result may include steps 310 to 350, and each step will be described below.

[0062] In step 310, based on the speech segment to be identified, each speech frame to be identified of the speech segment to be identified is determined.

[0063] In step 320, a signal-to-noise ratio (SNR) analysis is performed on the speech frame to be recognized to obtain the SNR analysis results of the speech frame to be recognized.

[0064] In one embodiment, by performing signal-to-noise ratio (SNR) analysis on each speech frame to be identified, the SNR analysis results of the speech frame to be identified can be obtained, and the strength of the environmental noise of the speech signal corresponding to the speech frame to be identified can be analyzed.

[0065] In step 330, the speech segment to be recognized is decomposed in the frequency domain to obtain multiple speech sub-bands to be recognized. These multiple speech sub-bands are then input into their respective first pre-trained speech quality analysis models to obtain multiple first sub-band analysis results. The first sub-band analysis result indicates whether each speech frame in the speech segment to be recognized requires the cooperation of an image frame to be recognized for speech activity detection within its corresponding frequency domain.

[0066] From a frequency domain perspective, for the speech segment to be recognized, the amount of speech information contained in different frequency domains gradually decreases from low to high frequencies. To more fully analyze the speech information concentrated in the low-frequency region and make more accurate detections of speech quality, the speech segment to be recognized can be decomposed in the frequency domain to obtain multiple sub-frequency bands for recognition, and the audio quality of each sub-frequency band can be analyzed. It should be noted that the division boundary from low frequency to mid frequency and then to high frequency can be adjusted according to actual conditions, and is not specifically limited in this embodiment. In one example, the low-frequency range can refer to 30-150Hz, the mid-frequency range can refer to 150-5kHz, and the high-frequency range can refer to 5kHz-16kHz.

[0067] In one embodiment, the speech segment to be recognized can be decomposed in the frequency domain to generate sub-signal sequences of different sub-frequency bands of the speech to be recognized. Each sub-signal sequence of the speech sub-frequency band can retain information within a specific frequency domain range. To obtain the speech quality of each sub-frequency band, each sub-frequency band can be input into a corresponding first pre-trained speech quality analysis model to obtain multiple first sub-frequency band analysis results. The first sub-frequency band analysis results can characterize whether each speech frame in the speech segment to be recognized requires the cooperation of an image frame to be recognized for speech activity detection within the corresponding frequency domain range.

[0068] In one example, the first pre-trained speech quality analysis model can be a CNN convolutional neural network model or an RNN neural network model. During training, the first pre-trained speech quality analysis model can be trained using a first training sample. The first training sample consists of training samples labeled with whether a speech frame requires a corresponding image frame for speech activity detection.

[0069] In one example, if the frequency domain range of the speech sub-band to be identified is the low-frequency range, the analysis result of the first sub-band can be obtained based on the first pre-trained speech quality analysis model corresponding to the low-frequency range. The analysis result of the first sub-band can characterize whether each speech frame to be identified in the speech segment needs the cooperation of the image frame to be identified for speech activity detection in the low-frequency domain. In another example, if the frequency domain range of the speech sub-band to be identified is the high-frequency range, the analysis result of the first sub-band can be obtained based on the first pre-trained speech quality analysis model corresponding to the high-frequency range. The analysis result of the first sub-band can characterize whether each speech frame to be identified in the speech segment needs the cooperation of the image frame to be identified for speech activity detection in the high-frequency domain.

[0070] In step 340, the analysis results of multiple first sub-frequency bands are fused to obtain the first fused sub-frequency band analysis result. The first fused sub-frequency band analysis result is an analysis result indicating whether each speech frame in the speech segment to be identified requires the cooperation of the image frame to be identified for speech activity detection across the entire frequency domain.

[0071] In one embodiment, the fusion of multiple first sub-frequency band analysis results can be achieved by means of voting, weighted summation, etc. The weight allocation can be adjusted according to the actual situation, and is not specifically limited in this embodiment.

[0072] In step 350, the first speech quality analysis result is obtained based on the signal-to-noise ratio analysis result of the speech frame to be identified and the first fused sub-frequency band analysis result.

[0073] In one embodiment, a weighted summation method can be used to obtain a first speech quality analysis result based on the signal-to-noise ratio analysis result of the speech frame to be identified and the analysis result of the first fused sub-frequency band. The weight allocation can be adjusted according to actual conditions and is not specifically limited in this embodiment. Through this embodiment, by comprehensively considering different aspects of the quality information of the speech frame to be identified, the semantic information contained in the low-frequency region can be more fully utilized, thereby obtaining a more accurate first speech quality analysis result.

[0074] In one embodiment, a first speech quality analysis module can be added to the VAD module, and a first speech quality analysis result can be obtained based on the first speech quality analysis module. To further explain the process by which the first speech quality analysis module obtains the first speech quality analysis result, the following will be combined with... Figure 4 Please provide an explanation.

[0075] Figure 4 This is a schematic diagram of an application scenario for obtaining the first speech quality analysis result provided by the present invention.

[0076] In one embodiment, the speech segment to be identified can be input to a first speech quality analysis module. During application, the first speech quality analysis result can be determined through both signal-to-noise ratio analysis and sub-band audio quality analysis.

[0077] Combination Figure 4 As can be seen, obtaining the first speech quality analysis result may include steps 410 to 480, and each step will be described below.

[0078] In step 410, input the speech segment to be recognized.

[0079] In step 420, signal-to-noise ratio (SNR) analysis is performed on each speech frame to be identified in the speech segment to be identified, and the SNR analysis results of the speech frames to be identified are obtained.

[0080] The signal-to-noise ratio (SNR) analysis results based on the speech frame to be identified can characterize the strength of the environmental noise of the speech signal corresponding to the speech frame to be identified.

[0081] In step 430, the speech segment to be recognized is decomposed in the frequency domain to obtain multiple speech sub-bands to be recognized.

[0082] In one embodiment, the speech segment to be identified can be decomposed in the frequency domain in order from low frequency to high frequency to obtain speech sub-band 1, speech sub-band 2, ... speech sub-band n to be identified.

[0083] In step 440, the speech sub-band 1 to be identified is input into the first pre-trained speech quality analysis model 1.

[0084] In step 450, the speech sub-band 2 to be identified is input into the first pre-trained speech quality analysis model 2.

[0085] In step 460, the speech sub-band n to be identified is input into the first pre-trained speech quality analysis model n.

[0086] In one embodiment, each speech sub-band to be identified can be input into a corresponding first pre-trained speech quality analysis model to obtain the corresponding first sub-band analysis result. The first sub-band analysis result can be used to characterize whether each speech frame to be identified in the speech segment to be identified requires the cooperation of an image frame to be identified for speech activity detection within the corresponding frequency domain.

[0087] It should be noted that the first pre-trained speech quality analysis model 1, the first pre-trained speech quality analysis model 2, ..., the first pre-trained speech quality analysis model n can be obtained through pre-training.

[0088] In step 470, the analysis results of multiple speech sub-bands to be identified are fused to obtain the first fused sub-band analysis result.

[0089] The analysis results of the sub-frequency band of the speech to be identified can be understood as the analysis results of the first sub-frequency band. In application, methods such as voting and weighted summation can be used to fuse the analysis results. The weight allocation can be adjusted according to the actual situation, and is not specifically limited in this embodiment. Based on the analysis results of the first fused sub-frequency band, it can characterize whether each speech frame to be identified in the speech segment needs the cooperation of the image frame to be identified for speech activity detection in the entire frequency domain.

[0090] In step 480, the signal-to-noise ratio analysis result of the speech frame to be identified and the first fused sub-frequency band analysis result are fused to obtain the first speech quality analysis result.

[0091] In one embodiment, a weighted summation method can be used to obtain a first speech quality analysis result based on the signal-to-noise ratio (SNR) analysis result of the speech frame to be identified and the first fused sub-frequency band analysis result. The weight allocation can be adjusted according to actual conditions and is not specifically limited in this embodiment. Since the first speech quality analysis result includes both SNR analysis and sub-frequency band audio quality analysis results, the accuracy and comprehensiveness of the first speech quality analysis result are improved, ensuring the accuracy of speech activity detection, and thus obtaining a more accurate target speech segment.

[0092] In an exemplary embodiment of the present invention, performing speech recognition on the speech segment to be recognized based on the target speech segment may include the following steps:

[0093] A second speech quality analysis is performed on the target speech segment to obtain the second speech quality analysis result. The second speech quality analysis result is the analysis result of whether each target speech frame in the target speech segment needs the cooperation of sub-images for speech recognition. The sub-image is the image corresponding to different human portrait actions in the target image frame corresponding to the target speech frame.

[0094] Based on the results of the second speech quality analysis, speech recognition is performed on the speech segment to be recognized.

[0095] In one embodiment, feature extraction can be performed on the target speech segment to obtain its speech features, and a second speech quality analysis can be performed based on these features to obtain a second speech quality analysis result. This second speech quality analysis result determines whether the target speech frame in the target speech segment needs to be combined with sub-images for speech recognition to improve accuracy and speed. Sub-images are images corresponding to different human facial movements in the target image frame corresponding to the target speech frame. For example, sub-images could be images of mouth movements, facial expressions, or body movements in the target image frame. In this embodiment, by using appropriate sub-images in conjunction with the target speech frame for speech recognition, recognition accuracy can be improved.

[0096] In one embodiment, speech recognition of the speech segment to be recognized based on the second speech quality analysis result can be achieved in the following way:

[0097] If the second speech quality analysis result indicates that the target speech frame requires sub-images for speech recognition, then the sub-images are acquired. Based on the target speech frame and the sub-images, a pre-trained speech recognition model is used to perform speech recognition on the speech segment to be recognized. The pre-trained speech recognition model can be obtained through pre-training.

[0098] In one example, the sub-image corresponding to the target speech frame can be determined based on the second speech quality analysis result. Further, the ASR speech features of the target speech frame and the image features of the sub-image can be extracted separately, and the ASR speech features of the target speech frame and the image features of the sub-image are fused. Based on the fused features, a speech recognition model is used to perform speech recognition on the speech segment to be recognized. In this embodiment, by dynamically determining which image features (corresponding sub-image features) to use in conjunction with the target speech frame for speech recognition, the accuracy of recognition can be improved. Furthermore, during speech recognition, selecting the corresponding sub-image instead of the entire target image frame to accompany the target speech frame reduces the computational load of feature processing on the sub-image, thereby improving the real-time rate of speech recognition.

[0099] The extraction of ASR speech features from the target speech frame and the extraction of image features from the sub-images can be achieved using the corresponding neural network models. Furthermore, the Factorized Bilinear Pooling scheme described earlier can be used for feature fusion.

[0100] In yet another embodiment, speech recognition of the speech segment to be recognized based on the second speech quality analysis result can be achieved in the following way:

[0101] If the second speech quality analysis result indicates that the target speech frame does not require sub-images for speech recognition, then speech recognition of the speech segment to be recognized is performed based on the target speech frame using a pre-trained speech recognition model. The pre-trained speech recognition model can be obtained through pre-training.

[0102] Figure 5 This is a schematic diagram of the speech recognition process for the speech segment to be recognized provided by the present invention.

[0103] In one embodiment, combined with Figure 5 As can be seen, performing speech recognition on the speech segment to be recognized may include steps 510 to 580, and each step will be described below.

[0104] It is understandable that the entity performing speech recognition on the speech segment to be recognized can be understood as the ASR acoustic module.

[0105] In step 510, feature extraction is performed on the target speech segment to obtain the speech features of the target speech segment.

[0106] In step 520, a second speech quality analysis is performed on the target speech segment based on speech features.

[0107] In one embodiment, feature extraction can be performed on the target speech segment to obtain its speech features, and a second speech quality analysis can be performed based on these features to obtain the second speech quality analysis result. The speech features of the target speech segment can be extracted using a neural network model. In application, performing the second speech quality analysis based on the extracted speech features, rather than the entire target speech segment, can improve the accuracy and speed of the analysis.

[0108] It should be noted that the second speech quality analysis result is an analysis of whether each target speech frame in the target speech segment needs to be combined with sub-images for speech recognition to improve the accuracy and speed of speech recognition. Here, the sub-image can be an image corresponding to different human facial movements in the target image frame corresponding to the target speech frame.

[0109] In step 530, based on the second speech quality analysis result, it is determined whether the target speech frame needs to be accompanied by sub-images for speech recognition.

[0110] In step 540, if not, the ASR speech features of the target speech frame are obtained.

[0111] In step 550, the ASR speech features based on the target speech frame are used for speech recognition through the ASR model.

[0112] In one example, the VAD (Voice Actual Description) features of the target speech frame can be directly extracted, and speech recognition can be performed using an ASR (Automatic Speech Recognition) model based on these features. This embodiment ensures that speech recognition is performed directly based on the VAD features of the target speech frame, while maintaining accuracy, by determining that the current target speech frame does not require combining sub-images for speech recognition. This improves the speed of speech recognition.

[0113] In step 560, if so, the corresponding sub-image is obtained.

[0114] In step 570, ASR speech features of the target speech frame and image features of the sub-image are extracted.

[0115] In step 580, the ASR speech features and image features are fused, and speech recognition is performed using the ASR model based on the fused features.

[0116] In one embodiment, ASR speech features of the target speech frame and image features of sub-images (e.g., mouth movement features, facial expression features, or body movement features in the target image frame) can be fused, and speech recognition can be performed using an ASR model based on the fused features. It should be noted that there can be one or more sub-images corresponding to the target speech frame, which can be adjusted according to actual conditions. In this embodiment, using appropriate sub-images in conjunction with the target speech frame for speech recognition can improve recognition accuracy.

[0117] Figure 6 This is a schematic diagram of the process of obtaining the second speech quality analysis result by performing a second speech quality analysis on a target speech segment, as provided by the present invention.

[0118] In an exemplary embodiment of the present invention, combined with Figure 6 As can be seen, performing a second speech quality analysis on the target speech segment to obtain the second speech quality analysis result may include steps 610 to 650, and each step will be described below.

[0119] In step 610, each target speech frame of the target speech segment is determined based on the target speech segment.

[0120] In step 620, a signal-to-noise ratio (SNR) analysis is performed on the target speech frame to obtain the SNR analysis results of the target speech frame.

[0121] The signal-to-noise ratio (SNR) analysis results of the target speech frame can characterize the strength of the environmental noise of the speech signal corresponding to the target speech frame.

[0122] In step 630, the target speech segment is decomposed in the frequency domain to obtain multiple target speech sub-frequency bands. These multiple target speech sub-frequency bands are then input into their respective second pre-trained speech quality analysis models to obtain multiple second sub-frequency band analysis results. The second sub-frequency band analysis results determine whether each target speech frame in the target speech segment requires sub-images for speech recognition within its corresponding frequency domain.

[0123] In one example, the second pre-trained speech quality analysis model can be a CNN convolutional neural network model or an RNN neural network model. During training, the second pre-trained speech quality analysis model can be trained using second training samples. These second training samples are training samples labeled with whether a speech frame requires sub-images for speech recognition, where sub-images are images corresponding to different human actions in the image frame corresponding to the speech frame. It is understood that during training, if it is determined that a speech frame requires sub-images for speech recognition, the second pre-trained speech quality analysis model can also output the specific sub-image information for the required speech frame.

[0124] In one embodiment, the target speech segment can be decomposed in the frequency domain to generate sub-signal sequences of different target speech sub-bands. Each sub-signal sequence of the target speech sub-band can retain information within a specific frequency domain range. To obtain the speech quality of each target speech sub-band, each target speech sub-band can be input into a corresponding second pre-trained speech quality analysis model to obtain multiple second sub-band analysis results. The second sub-band analysis results can characterize whether each target speech frame in the target speech segment requires sub-images for speech recognition within its corresponding frequency domain range.

[0125] In step 640, the analysis results of multiple second sub-frequency bands are fused to obtain the second fused sub-frequency band analysis result. The second fused sub-frequency band analysis result is the analysis result of whether each target speech frame in the target speech segment requires sub-images for speech recognition across the entire frequency domain.

[0126] In one embodiment, the fusion of multiple second sub-frequency band analysis results can be achieved by means of voting, weighted summation, etc. The weight allocation can be adjusted according to the actual situation, and is not specifically limited in this embodiment.

[0127] In step 650, a second speech quality analysis result is obtained based on the signal-to-noise ratio analysis result of the target speech frame and the second fused sub-frequency band analysis result.

[0128] In one embodiment, a weighted summation method can be used to obtain a second speech quality analysis result based on the signal-to-noise ratio analysis result of the target speech frame and the analysis result of the second fused sub-frequency band. The weight allocation can be adjusted according to actual conditions and is not specifically limited in this embodiment. Through this embodiment, by comprehensively considering different aspects of the quality information of the target speech frame, the semantic information contained in the low-frequency region can be more fully utilized, thereby obtaining a more accurate and higher-quality second speech quality analysis result.

[0129] In one embodiment, a second speech quality analysis module can be added to the ASR acoustic module, and the second speech quality analysis result can be obtained based on the second speech quality analysis module. To further explain the process by which the second speech quality analysis module obtains the second speech quality analysis result, the following will be combined with... Figure 7 Please provide an explanation.

[0130] Figure 7 This is a schematic diagram of an application scenario for obtaining the second speech quality analysis result provided by the present invention.

[0131] In one embodiment, the target speech segment can be input to a second speech quality analysis module. During application, the second speech quality analysis result can be determined through both signal-to-noise ratio analysis and sub-band audio quality analysis.

[0132] Combination Figure 7 As can be seen, obtaining the second speech quality analysis result may include steps 710 to 780, and each step will be described below.

[0133] In step 710, the target speech segment is input.

[0134] In step 720, signal-to-noise ratio (SNR) analysis is performed on each target speech frame in the target speech segment to obtain the SNR analysis results of the target speech frames.

[0135] The signal-to-noise ratio (SNR) analysis results based on the target speech frame can characterize the intensity of environmental noise in the speech signal corresponding to the target speech frame.

[0136] In step 730, the target speech segment is decomposed in the frequency domain to obtain multiple target speech sub-bands.

[0137] In one embodiment, the target speech segment can be decomposed in the frequency domain in order from low frequency to high frequency to obtain target speech sub-band 1, target speech sub-band 2, ... target speech sub-band n respectively.

[0138] In step 740, the target speech sub-band 1 is input into the second pre-trained speech quality analysis model 1.

[0139] In step 750, the target speech sub-band 2 is input into the second pre-trained speech quality analysis model 2.

[0140] In step 760, the target effective speech sub-band n is input into the second pre-trained speech quality analysis model n.

[0141] In one embodiment, each target speech sub-band can be input into a corresponding second pre-trained speech quality analysis model to obtain the corresponding second sub-band analysis result. The second sub-band analysis result can be used to characterize whether each target speech frame in the target speech segment requires sub-images for speech recognition within the corresponding frequency domain, thereby providing a matching basis for matching the target speech frame with the corresponding sub-image.

[0142] Among them, the second pre-trained speech quality analysis model 1, the second pre-trained speech quality analysis model 2, ..., the second pre-trained speech quality analysis model n can be obtained through pre-training.

[0143] It should be noted that the training samples of the first pre-trained speech quality analysis model and the second pre-trained speech quality analysis model mentioned above are different. This ensures that the first speech quality analysis result is used to characterize whether the speech frame to be identified needs the cooperation of the image frame to be identified for speech activity detection, and the second speech quality analysis result is used to characterize whether the target speech frame needs the cooperation of the sub-image for speech recognition.

[0144] In step 770, the analysis results of multiple target speech sub-bands are fused to obtain the second fused sub-band analysis result.

[0145] The analysis results of the target speech sub-band can be understood as the analysis results of the second sub-band. In the application process, the analysis results can be fused by means of voting, weighted summation, etc. The weight allocation can be adjusted according to the actual situation, and is not specifically limited in this embodiment.

[0146] In step 780, the signal-to-noise ratio analysis results of the target speech frame and the second fused sub-frequency band analysis results are fused to obtain the second speech quality analysis result.

[0147] In one embodiment, a second speech quality analysis result can be obtained by weighted summation based on the signal-to-noise ratio (SNR) analysis result of the target speech frame and the second fused sub-frequency band analysis result. The weight allocation can be adjusted according to actual conditions and is not specifically limited in this embodiment. Since the second speech quality analysis result includes both SNR analysis and sub-frequency band audio quality analysis results, its accuracy and comprehensiveness can be improved. This allows for matching corresponding sub-images to the target speech frame for joint speech recognition, thereby improving the accuracy of speech recognition.

[0148] Based on the same concept, the present invention also provides a voice recognition device.

[0149] The speech recognition device provided by the present invention is described below. The speech recognition device described below can be referred to in correspondence with the speech recognition method described above.

[0150] Figure 8 This is a schematic diagram of the structure of the voice recognition device provided by the present invention.

[0151] In an exemplary embodiment of the present invention, combined with Figure 8 As shown, the speech recognition device may include an analysis module 810, an acquisition module 820, a processing module 830, and a recognition module 840. Each module will be described in detail below.

[0152] The analysis module 810 can be configured to perform a first speech quality analysis on the speech segment to be recognized and obtain a first speech quality analysis result. The first speech quality analysis result is the analysis result of whether each speech frame to be recognized in the speech segment to be recognized needs to cooperate with the image frame to be recognized for speech activity detection. The image frame to be recognized corresponds to the speech frame to be recognized.

[0153] The acquisition module 820 can be configured to acquire the image frame to be recognized when the first speech quality analysis result indicates that the speech frame to be recognized needs to be cooperated with the image frame to be recognized for speech activity detection.

[0154] The processing module 830 can be configured to perform speech activity detection on the speech frame to be identified in the speech segment to be identified based on the speech frame to be identified and the image frame to be identified, so as to determine the target audio time node of the speech segment to be identified, and determine the target speech segment based on the target audio time node.

[0155] The recognition module 840 can be configured to perform speech recognition on the speech segment to be recognized based on the target speech segment.

[0156] In an exemplary embodiment of the present invention, the processing module 830 may perform speech activity detection on the speech frame to be identified in the speech segment to be identified based on the speech frame to be identified and the image frame to be identified in the speech segment to be identified in the following manner: based on the speech frame to be identified and the image frame to be identified, perform speech activity detection on the speech frame to be identified in the speech segment to be identified by a pre-trained speech activity detection model.

[0157] In an exemplary embodiment of the present invention, the analysis module 810 may perform a first speech quality analysis on the speech segment to be recognized in the following manner to obtain a first speech quality analysis result: based on the speech segment to be recognized, determine each speech frame to be recognized of the speech segment to be recognized; perform signal-to-noise ratio (SNR) analysis on the speech frames to be recognized to obtain the SNR analysis result of the speech frames to be recognized; perform frequency domain decomposition on the speech segment to be recognized to obtain multiple speech sub-bands to be recognized, and input the multiple speech sub-bands to be recognized into the corresponding first pre-trained speech quality analysis models to obtain multiple first sub-band analysis results, wherein... The first sub-frequency band analysis result is the analysis result of whether each speech frame to be identified in the speech segment to be identified needs to cooperate with the image frame to be identified for speech activity detection in the corresponding frequency domain range; multiple first sub-frequency band analysis results are fused to obtain the first fused sub-frequency band analysis result, wherein the first fused sub-frequency band analysis result is the analysis result of whether each speech frame to be identified in the speech segment to be identified needs to cooperate with the image frame to be identified for speech activity detection in the entire frequency domain range; based on the signal-to-noise ratio analysis result of the speech frame to be identified and the first fused sub-frequency band analysis result, the first speech quality analysis result is obtained.

[0158] In an exemplary embodiment of the present invention, the recognition module 840 may perform speech recognition on the speech segment to be recognized based on the target speech segment in the following manner: perform a second speech quality analysis on the target speech segment to obtain a second speech quality analysis result, wherein the second speech quality analysis result is the analysis result of whether each target speech frame in the target speech segment needs to cooperate with sub-images for speech recognition, and the sub-images are the images corresponding to different human portrait actions in the target image frame corresponding to the target speech frame; and perform speech recognition on the speech segment to be recognized based on the second speech quality analysis result.

[0159] In an exemplary embodiment of the present invention, the recognition module 840 may perform speech recognition on the speech segment to be recognized based on the second speech quality analysis result in the following manner: if the second speech quality analysis result is that the target speech frame requires the cooperation of a sub-image for speech recognition, then the sub-image is obtained; based on the target speech frame and the sub-image, the speech segment to be recognized is performed on the speech segment through a pre-trained speech recognition model.

[0160] In an exemplary embodiment of the present invention, the recognition module 840 may perform speech recognition on the speech segment to be recognized based on the second speech quality analysis result in the following manner: if the second speech quality analysis result is that the target speech frame does not require the cooperation of sub-images for speech recognition, then the speech segment to be recognized is performed on the target speech frame through a pre-trained speech recognition model.

[0161] In an exemplary embodiment of the present invention, the recognition module 840 may perform a second speech quality analysis on the target speech segment in the following manner to obtain a second speech quality analysis result: based on the target speech segment, determine each target speech frame of the target speech segment; perform signal-to-noise ratio (SNR) analysis on the target speech frames to obtain the SNR analysis result of the target speech frames; perform frequency domain decomposition on the target speech segment to obtain multiple target speech sub-frequency bands, and input the multiple target speech sub-frequency bands into the corresponding second pre-trained speech quality analysis models to obtain multiple second sub-frequency band analysis results, wherein the second sub-frequency band analysis result is the analysis result of whether each target speech frame in the target speech segment needs sub-images to cooperate in speech recognition within the corresponding frequency domain range; fuse the multiple second sub-frequency band analysis results to obtain a second fused sub-frequency band analysis result, wherein the second fused sub-frequency band analysis result is the analysis result of whether each target speech frame in the target speech segment needs sub-images to cooperate in speech recognition within the entire frequency domain range; and obtain the second speech quality analysis result based on the SNR analysis result of the target speech frames and the second fused sub-frequency band analysis result.

[0162] Figure 9 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 9 As shown, the electronic device may include a processor 910, a communications interface 920, a memory 930, and a communication bus 940, wherein the processor 910, the communications interface 920, and the memory 930 communicate with each other via the communication bus 940. The processor 910 can call logical instructions in the memory 930 to execute the speech recognition method described above.

[0163] Furthermore, the logical instructions in the aforementioned memory 930 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, essentially, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0164] On the other hand, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer-readable storage medium, and when the computer program is executed by a processor, the computer is able to execute the speech recognition methods provided by the above methods.

[0165] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to perform the speech recognition methods provided by the methods described above.

[0166] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0167] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0168] It is further understood that although the operations are described in a specific order in the accompanying drawings in the embodiments of the present invention, this should not be construed as requiring these operations to be performed in the specific order or serial order shown, or requiring all the operations shown to obtain the desired result. In certain environments, multitasking and parallel processing may be advantageous.

[0169] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A speech recognition method, characterized in that, The method includes: A first speech quality analysis is performed on the speech segment to be recognized to obtain a first speech quality analysis result. The first speech quality analysis result is the analysis result of whether each speech frame to be recognized in the speech segment to be recognized needs to be recognized image frames to be recognized for speech activity detection. The image frames to be recognized correspond to the speech frames to be recognized. If the first speech quality analysis result indicates that the speech frame to be identified requires the cooperation of the image frame to be identified for speech activity detection, the image frame to be identified is acquired. Based on the speech frame to be identified and the image frame to be identified, speech activity detection is performed on the speech frame to be identified in the speech segment to be identified, so as to determine the target audio time node of the speech segment to be identified, and the target speech segment of the speech segment to be identified is determined based on the target audio time node. Speech recognition is performed on the speech segment to be recognized based on the target speech segment, wherein the speech recognition is performed on the speech segment to be recognized based on the speech quality of the target speech segment.

2. The speech recognition method according to claim 1, characterized in that, The step of detecting speech activity in the speech frame to be identified within the speech segment to be identified, based on the speech frame to be identified and the image frame to be identified, includes: Based on the speech frame to be identified and the image frame to be identified, a pre-trained speech activity detection model is used to detect speech activity in the speech frame to be identified within the speech segment to be identified.

3. The speech recognition method according to claim 1, characterized in that, The first speech quality analysis of the speech segment to be recognized, to obtain the first speech quality analysis result, includes: Based on the speech segment to be identified, determine each speech frame to be identified for the speech segment to be identified; The signal-to-noise ratio (SNR) of the speech frame to be identified is analyzed to obtain the SNR analysis results of the speech frame to be identified. The speech segment to be identified is decomposed in the frequency domain to obtain multiple speech sub-bands to be identified. The multiple speech sub-bands to be identified are then input into the corresponding first pre-trained speech quality analysis model to obtain multiple first sub-band analysis results. The first sub-band analysis results are the analysis results of whether each speech frame to be identified in the speech segment to be identified needs to cooperate with the image frame to be identified for speech activity detection in the corresponding frequency domain range. Multiple first sub-frequency band analysis results are fused to obtain a first fused sub-frequency band analysis result, wherein the first fused sub-frequency band analysis result is the analysis result of whether each speech frame to be identified in the speech segment to be identified needs to cooperate with the image frame to be identified for speech activity detection in the entire frequency domain. Based on the signal-to-noise ratio analysis results of the speech frame to be identified and the analysis results of the first fused sub-frequency band, the first speech quality analysis result is obtained.

4. The speech recognition method according to claim 1, characterized in that, The step of performing speech recognition on the speech segment to be recognized based on the target speech segment includes: A second speech quality analysis is performed on the target speech segment to obtain a second speech quality analysis result. The second speech quality analysis result is the analysis result of whether each target speech frame in the target speech segment needs to be accompanied by a sub-image for speech recognition. The sub-image is the image corresponding to different human portrait actions in the target image frame corresponding to the target speech frame. Based on the second speech quality analysis results, speech recognition is performed on the speech segment to be recognized.

5. The speech recognition method according to claim 4, characterized in that, The step of performing speech recognition on the speech segment to be recognized based on the second speech quality analysis result includes: If the second speech quality analysis result indicates that the target speech frame requires the cooperation of a sub-image for speech recognition, then the sub-image is acquired; Based on the target speech frame and the sub-image, the speech segment to be recognized is recognized by a pre-trained speech recognition model.

6. The speech recognition method according to claim 4, characterized in that, The step of performing speech recognition on the speech segment to be recognized based on the second speech quality analysis result includes: If the second speech quality analysis result indicates that the target speech frame does not require sub-images for speech recognition, then the speech segment to be recognized is recognized based on the target speech frame using a pre-trained speech recognition model.

7. The speech recognition method according to claim 4, characterized in that, The second speech quality analysis of the target speech segment, to obtain the second speech quality analysis result, includes: Based on the target speech segment, determine each target speech frame of the target speech segment; The signal-to-noise ratio (SNR) of the target speech frame is analyzed to obtain the SNR analysis results of the target speech frame; The target speech segment is decomposed in the frequency domain to obtain multiple target speech sub-frequency bands. The multiple target speech sub-frequency bands are then input into the corresponding second pre-trained speech quality analysis model to obtain multiple second sub-frequency band analysis results. The second sub-frequency band analysis results are the analysis results of whether each target speech frame in the target speech segment needs sub-images to cooperate with speech recognition in the corresponding frequency domain range. Multiple second sub-frequency band analysis results are fused to obtain a second fused sub-frequency band analysis result, wherein the second fused sub-frequency band analysis result is the analysis result of whether each target speech frame in the target speech segment needs sub-images to cooperate for speech recognition in the entire frequency domain. Based on the signal-to-noise ratio analysis results of the target speech frame and the analysis results of the second fused sub-frequency band, the second speech quality analysis result is obtained.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the speech recognition method as described in any one of claims 1 to 7.

9. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the speech recognition method as described in any one of claims 1 to 7.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the speech recognition method as described in any one of claims 1 to 7.