Speech recognition text enhancement system fused with multi-modal semantic invariance

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A speech recognition and enhancement system technology, applied in speech recognition, speech analysis, biological neural network models, etc., can solve problems such as inapplicability, high error rate, and large amount of training data, so as to improve performance and reduce data dependence.

Active Publication Date: 2021-08-17

INST OF AUTOMATION CHINESE ACAD OF SCI

View PDF18 Cites 5 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

At the same time, due to the powerful modeling ability of the neural network, the existing end-to-end speech recognition text enhancement system is very easy to overfit, requires a large amount of training data, and does not integrate the semantic similarity between the acoustic mode and the text mode

Unable to apply to Chinese-English mixed speech recognition problems, model training is difficult, and the error rate is high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0062] Such as figure 1 As shown, a speech recognition text enhancement system that incorporates multimodal semantic invariance includes:

[0063] Acoustic feature extraction module, acoustic down-sampling module, coder and decoder; Described acoustic feature extraction module is divided into the short-time audio frame of fixed length to speech data frame processing, extracts fbank acoustic feature to described short-time audio frame, Input the acoustic features into the acoustic downsampling module for downsampling to obtain an acoustic representation; input the speech data into an existing speech recognition module to obtain input text data, and input the input text data to the encoder , to obtain the input text coded representation; input the acoustic representation and the input text coded representation to the decoder for fusion to obtain a decoded representation; input the decoded representation to a softmax function to obtain the target with the highest probability;

...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a speech recognition text enhancement system fused with multi-modal semantic invariance. The system comprises an acoustic feature extraction module, an acoustic downsampling module, an encoder and a decoder fused with multi-modal semantic invariance; A method comprises the following steps: performing framing processing on the voice data by the acoustic feature extraction module, dividing the voice data into short-time audio frames with fixed lengths, extracting acoustic features from the short-time audio frames, and inputting the acoustic features into the acoustic downsampling module for downsampling to obtain acoustic representation; inputting the voice data into an existing voice recognition module to obtain input text data, and inputting the input text data into an encoder to obtain input text encoding representation; and inputting the acoustic representation and the input text coding representation into a decoder for fusion, and carrying out similarity constraint on the representation of the acoustic mode and the representation of the text mode to obtain a decoding representation. According to the method, by fusing the cross-modal semantic invariance constraint loss, the dependence of the model on data is reduced, the performance of the model is improved, and the method is suitable for Chinese-English mixed speech recognition.

Description

technical field [0001] This application relates to the field of text enhancement for Chinese-English mixed speech recognition, in particular to a text enhancement system for speech recognition that integrates multi-modal semantic invariance. Background technique [0002] The phenomenon of Chinese-English mixing refers to switching languages during the speaking process, mainly including two types of inter-sentence transitions and intra-sentence transitions. This phenomenon poses a huge challenge to speech recognition technology. The main problems are accent problems caused by non-standard pronunciation of speakers; more and more complex modeling units; collaborative pronunciation of different languages; difficulties in data collection; difficulties in data labeling, etc. With the development of deep learning technology, especially end-to-end models, monolingual speech recognition technology has been greatly improved. However, the end-to-end model can only use speech-text ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G10L15/00G10L15/02G10L15/04G10L15/16G10L15/26G06N3/04

CPCG10L15/005G10L15/02G10L15/04G10L15/16G06N3/045G10L15/32G10L15/063G10L15/1815

Inventor陶建华张帅易江燕

OwnerINST OF AUTOMATION CHINESE ACAD OF SCI

Speech recognition text enhancement system fused with multi-modal semantic invariance

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology