Encoder and local generative attention mechanism-based end-to-end speech recognition system adopting same

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
An attention and encoder technology, applied in speech recognition, speech analysis, instruments, etc., can solve problems such as increased training time, wasted memory usage, and increased attention weight error rate, achieving good recognition rate, reduction of multiplication times, The effect of reducing computational complexity

Inactive Publication Date: 2021-09-17

NORTHWESTERN POLYTECHNICAL UNIV +1

View PDF2 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] However, in speech recognition, directly replacing SA with DSA leads to three problems

First, the length of attention weights predicted by DSA is fixed, if we directly apply DSA to ASR, then the length of each recording must be padded to the length of the longest recording in the training corpus, which causes an increase in training time and memory waste of occupation

Second, the feature length in the ASR task is much larger than the feature length in the language model, directly predicting the attention weight for such a long feature will lead to a significant increase in the error rate

Third, like SA, DSA still does not have the ability to extract fine-grained feature patterns

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0070] (1) Model structure:

[0071] The SA-Transformer baseline model is an improved Transformer speech recognition model, including an encoder and a decoder. The encoder consists of a convolutional front end and 12 identical encoder sub-blocks. Each sub-block contains SA layers, convolutional layer and a feed-forward fully connected layer. For the convolutional front end, we stack two 3×3 convolutional layers. We set the stride of both its time dimension and frequency dimension to 2 to downsample the input features. The decoder consists of a word embedding layer and 6 identical decoder sub-blocks. In addition to the feed-forward fully connected layer, the decoder sub-block also contains two multi-head attention on the embedded representation of the label sequence and the output of the encoder respectively. SA layer.

[0072] LDSA-Transformer has the same decoder as the baseline model. Just replace the self-attention mechanism in the SA-Transformer encoder with LDSA. The...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to an encoder and a local generative attention mechanism-based end-to-end speech recognition system adopting the same, and belongs to an end-to-end speech recognition technology. Low-complexity generative attention calculation is adopted to replace a dot product type attention mechanism, so that the calculation complexity is reduced, and meanwhile, the speech recognition accuracy is improved; and a DSA-based speech recognition model is proposed to reduce the computational complexity. The local DSA is further provided, and the attention range of the DSA is limited within a plurality of frames around the current voice frame. According to the method, the LDSA and the SA are combined, so that the model has the capability of extracting local and global information at the same time. Experimental results on an Ai-shell1 mandarin voice recognition corpus show that the character error rate of 6.49% is realized by the provided LDSA-Transformer. Compared with the SA-Transform, the LDSA-Transform has the advantages that the accuracy is higher, and the calculation complexity is lower. Under the condition that the parameter quantity and the calculation complexity of the provided combination attention method are roughly the same as those of the SA-Transformer, the correct rate obviously superior to that of the SA-Transformer is obtained.

Description

technical field [0001] The invention belongs to the field of speech recognition, in particular to an encoder and an end-to-end speech recognition system based on a local generative attention mechanism using the encoder. Background technique [0002] Speech recognition (Automatic Speech Recognition, ASR) refers to the conversion of speech signals into text content, and is a key link in speech interaction technology. In recent years, End-to-End (E2E) Automatic Speech Recognition (ASR) has been widely studied in the field of ASR due to its simple model structure and simple training process. Among the existing end-to-end speech recognition methods, the Connectionist Temporal Classification (CTC) and the Recurrent Neural Network Transducer (RNN-T) have a large number of parameters and low recognition accuracy. The field of offline recognition has been gradually replaced by the attention-based encoder-decoder model (Attention based EncoderDecoder, AED). In the attention-based en...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G10L15/02G10L15/06G10L15/16G10L15/183G10L15/26G10L19/00

CPCG10L15/02G10L15/063G10L15/16G10L15/183G10L15/26G10L19/0018

Inventor 张晓雷徐梦龙姚嘉迪

Owner NORTHWESTERN POLYTECHNICAL UNIV

Encoder and local generative attention mechanism-based end-to-end speech recognition system adopting same

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology