Voice recognition method and system based on triggered non-autoregressive model

An autoregressive model and speech recognition technology, applied in speech recognition, speech analysis, neural learning methods, etc., can solve problems such as acceleration and affecting decoding efficiency, and achieve the effects of improving decoding speed, improving accuracy, and avoiding timing dependence

Active Publication Date: 2020-12-04
中科极限元(杭州)智能科技股份有限公司
View PDF7 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Autoregressive decoding relies on markers generated in the past. This timing dependence seriously affects the efficiency of decoding, and it is difficult to accelerate through GPU parallel computing, making autoregressive models deployed in scenarios with high real-time requirements. certain limitations

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Voice recognition method and system based on triggered non-autoregressive model
  • Voice recognition method and system based on triggered non-autoregressive model
  • Voice recognition method and system based on triggered non-autoregressive model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0056] Streaming end-to-end speech recognition model and training method. Models based on self-attention transformation networks include acoustic encoders based on self-attention mechanisms and decoders based on self-attention mechanisms, such as Figure 1-4 shown, including the following steps:

[0057] Step 1, acquiring the voice training data and the corresponding text annotation training data, and extracting a series of features of the voice training data to form a voice feature sequence;

[0058] The goal of speech recognition is to convert continuous speech signals into text sequences. In the process of recognition, discrete Fourier transform is performed on the waveform signal in the time domain after windowing and framing, and the coefficients of specific frequency components are extracted to form a feature vector. The series of feature vectors constitute a sequence of speech features, and the speech features are Mel Frequency Cepstral Coefficients (MFCC) or Mel Filter...

Embodiment 2

[0084] like Figure 5 shown, a decoding method for a streaming end-to-end speech recognition model.

[0085] Decoding step 1, read the voice file from the file path and submit it to the processor;

[0086] The processor can be a smartphone, cloud server or other embedded device.

[0087] Decoding step 2, extracting features from the input speech to obtain a speech feature sequence;

[0088]The speech features are Mel Frequency Cepstral Coefficients (MFCC) or Mel Filter Bank Coefficients (FBANK), and the feature processing method is consistent with the training process.

[0089] Decoding step 3, the speech feature sequence is sequentially passed through the convolution downsampling module and the encoder to calculate the encoding state sequence;

[0090] Decoding step 4, passes the coding state sequence through the linear transformation of the CTC part, and calculates the probability distribution of the mark, and further obtains the probability that each position of the codi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a voice recognition method and system based on a triggered non-autoregressive model. The method comprises the following steps: S11, extracting an acoustic feature sequence; S12, generating a convolution downsampling sequence; S13, generating an acoustic coding state sequence; S14, calculating probability distribution and connection time sequence loss of the prediction marks; S15, calculating the positions and the number of peaks; S16, calculating cross entropy loss by an acoustic decoder; S17, calculating a gradient according to the joint loss of the connection time sequence loss and the cross entropy loss, and performing back propagation; and S18, circularly executing the steps S12 to S17 until the training is completed. The system comprises an acoustic feature sequence extraction module, a convolution downsampling module, an acoustic encoder, a connection time sequence classification module, an acoustic decoder and a joint loss calculation module which are connected with one another in sequence. The connection time sequence classification module comprises a linear change module, a connection time sequence loss calculation module and a peak extraction module.

Description

technical field [0001] The invention relates to the technical field of electronic signal processing, in particular to a speech recognition method and system based on a trigger type non-autoregressive model. Background technique [0002] As the entrance of human-computer interaction, speech recognition is an important research direction in the field of artificial intelligence. End-to-end speech recognition discards the pronunciation dictionary, language model and decoding network that the hybrid speech recognition model relies on, and realizes the direct conversion of audio feature sequences to text sequences. As a representative of the sequence-to-sequence model, the Speech-Transformer has strong sequence modeling capabilities. The model uses the entire speech as input, and encodes the input speech into a high-level feature representation through the encoder; the decoder starts from the start symbol, and based on the editor output, gradually predicts the corresponding text ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G10L15/26G10L15/02G10L15/06G10L25/27G06N3/04G06N3/08
CPCG10L15/26G10L15/02G10L15/063G10L25/27G06N3/084G06N3/044G06N3/045
Inventor 田正坤温正棋
Owner 中科极限元(杭州)智能科技股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products