Voice recognition method and system based on triggered non-autoregressive model

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
An autoregressive model and speech recognition technology, applied in speech recognition, speech analysis, neural learning methods, etc., can solve problems such as acceleration and affecting decoding efficiency, and achieve the effects of improving decoding speed, improving accuracy, and avoiding timing dependence

Active Publication Date: 2020-12-04

中科极限元(杭州)智能科技股份有限公司

View PDF7 Cites 12 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Autoregressive decoding relies on markers generated in the past. This timing dependence seriously affects the efficiency of decoding, and it is difficult to accelerate through GPU parallel computing, making autoregressive models deployed in scenarios with high real-time requirements. certain limitations

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0056] Streaming end-to-end speech recognition model and training method. Models based on self-attention transformation networks include acoustic encoders based on self-attention mechanisms and decoders based on self-attention mechanisms, such as Figure 1-4 shown, including the following steps:

[0057] Step 1, acquiring the voice training data and the corresponding text annotation training data, and extracting a series of features of the voice training data to form a voice feature sequence;

[0058] The goal of speech recognition is to convert continuous speech signals into text sequences. In the process of recognition, discrete Fourier transform is performed on the waveform signal in the time domain after windowing and framing, and the coefficients of specific frequency components are extracted to form a feature vector. The series of feature vectors constitute a sequence of speech features, and the speech features are Mel Frequency Cepstral Coefficients (MFCC) or Mel Filter...

Embodiment 2

[0084] like Figure 5 shown, a decoding method for a streaming end-to-end speech recognition model.

[0085] Decoding step 1, read the voice file from the file path and submit it to the processor;

[0086] The processor can be a smartphone, cloud server or other embedded device.

[0087] Decoding step 2, extracting features from the input speech to obtain a speech feature sequence;

[0088]The speech features are Mel Frequency Cepstral Coefficients (MFCC) or Mel Filter Bank Coefficients (FBANK), and the feature processing method is consistent with the training process.

[0089] Decoding step 3, the speech feature sequence is sequentially passed through the convolution downsampling module and the encoder to calculate the encoding state sequence;

[0090] Decoding step 4, passes the coding state sequence through the linear transformation of the CTC part, and calculates the probability distribution of the mark, and further obtains the probability that each position of the codi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a voice recognition method and system based on a triggered non-autoregressive model. The method comprises the following steps: S11, extracting an acoustic feature sequence; S12, generating a convolution downsampling sequence; S13, generating an acoustic coding state sequence; S14, calculating probability distribution and connection time sequence loss of the prediction marks; S15, calculating the positions and the number of peaks; S16, calculating cross entropy loss by an acoustic decoder; S17, calculating a gradient according to the joint loss of the connection time sequence loss and the cross entropy loss, and performing back propagation; and S18, circularly executing the steps S12 to S17 until the training is completed. The system comprises an acoustic feature sequence extraction module, a convolution downsampling module, an acoustic encoder, a connection time sequence classification module, an acoustic decoder and a joint loss calculation module which are connected with one another in sequence. The connection time sequence classification module comprises a linear change module, a connection time sequence loss calculation module and a peak extraction module.

Description

technical field [0001] The invention relates to the technical field of electronic signal processing, in particular to a speech recognition method and system based on a trigger type non-autoregressive model. Background technique [0002] As the entrance of human-computer interaction, speech recognition is an important research direction in the field of artificial intelligence. End-to-end speech recognition discards the pronunciation dictionary, language model and decoding network that the hybrid speech recognition model relies on, and realizes the direct conversion of audio feature sequences to text sequences. As a representative of the sequence-to-sequence model, the Speech-Transformer has strong sequence modeling capabilities. The model uses the entire speech as input, and encodes the input speech into a high-level feature representation through the encoder; the decoder starts from the start symbol, and based on the editor output, gradually predicts the corresponding text ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G10L15/26G10L15/02G10L15/06G10L25/27G06N3/04G06N3/08

CPCG10L15/26G10L15/02G10L15/063G10L25/27G06N3/084G06N3/044G06N3/045

Inventor 田正坤温正棋

Owner 中科极限元(杭州)智能科技股份有限公司

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Voice recognition method and system based on triggered non-autoregressive model

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology