Speech recognition text enhancement system fused with multi-modal semantic invariance

A speech recognition and enhancement system technology, applied in speech recognition, speech analysis, biological neural network models, etc., can solve problems such as inapplicability, high error rate, and large amount of training data, so as to improve performance and reduce data dependence.

Active Publication Date: 2021-08-17
INST OF AUTOMATION CHINESE ACAD OF SCI
View PDF18 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

At the same time, due to the powerful modeling ability of the neural network, the existing end-to-end speech recognition text enhancement system is very easy to overfit, requires a large amount of training data, and does not integrate the semantic similarity between the acoustic mode and the text mode
Unable to apply to Chinese-English mixed speech recognition problems, model training is difficult, and the error rate is high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Speech recognition text enhancement system fused with multi-modal semantic invariance
  • Speech recognition text enhancement system fused with multi-modal semantic invariance
  • Speech recognition text enhancement system fused with multi-modal semantic invariance

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0062] Such as figure 1 As shown, a speech recognition text enhancement system that incorporates multimodal semantic invariance includes:

[0063] Acoustic feature extraction module, acoustic down-sampling module, coder and decoder; Described acoustic feature extraction module is divided into the short-time audio frame of fixed length to speech data frame processing, extracts fbank acoustic feature to described short-time audio frame, Input the acoustic features into the acoustic downsampling module for downsampling to obtain an acoustic representation; input the speech data into an existing speech recognition module to obtain input text data, and input the input text data to the encoder , to obtain the input text coded representation; input the acoustic representation and the input text coded representation to the decoder for fusion to obtain a decoded representation; input the decoded representation to a softmax function to obtain the target with the highest probability;

...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a speech recognition text enhancement system fused with multi-modal semantic invariance. The system comprises an acoustic feature extraction module, an acoustic downsampling module, an encoder and a decoder fused with multi-modal semantic invariance; A method comprises the following steps: performing framing processing on the voice data by the acoustic feature extraction module, dividing the voice data into short-time audio frames with fixed lengths, extracting acoustic features from the short-time audio frames, and inputting the acoustic features into the acoustic downsampling module for downsampling to obtain acoustic representation; inputting the voice data into an existing voice recognition module to obtain input text data, and inputting the input text data into an encoder to obtain input text encoding representation; and inputting the acoustic representation and the input text coding representation into a decoder for fusion, and carrying out similarity constraint on the representation of the acoustic mode and the representation of the text mode to obtain a decoding representation. According to the method, by fusing the cross-modal semantic invariance constraint loss, the dependence of the model on data is reduced, the performance of the model is improved, and the method is suitable for Chinese-English mixed speech recognition.

Description

technical field [0001] This application relates to the field of text enhancement for Chinese-English mixed speech recognition, in particular to a text enhancement system for speech recognition that integrates multi-modal semantic invariance. Background technique [0002] The phenomenon of Chinese-English mixing refers to switching languages ​​during the speaking process, mainly including two types of inter-sentence transitions and intra-sentence transitions. This phenomenon poses a huge challenge to speech recognition technology. The main problems are accent problems caused by non-standard pronunciation of speakers; more and more complex modeling units; collaborative pronunciation of different languages; difficulties in data collection; difficulties in data labeling, etc. With the development of deep learning technology, especially end-to-end models, monolingual speech recognition technology has been greatly improved. However, the end-to-end model can only use speech-text ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G10L15/00G10L15/02G10L15/04G10L15/16G10L15/26G06N3/04
CPCG10L15/005G10L15/02G10L15/04G10L15/16G06N3/045G10L15/32G10L15/063G10L15/1815
Inventor 陶建华张帅易江燕
Owner INST OF AUTOMATION CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products