Streaming end-to-end speech recognition method and device, and electronic equipment

A speech recognition and streaming technology, applied in speech recognition, speech analysis, instruments, etc., can solve the problem of low activation point positioning accuracy, and achieve the effect of high accuracy, low mismatch and improved accuracy

Pending Publication Date: 2021-11-02
ALIBABA GRP HLDG LTD
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

It can be seen that in the MoCHA scheme, there is also the problem of relatively low positioning accuracy of activation points

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Streaming end-to-end speech recognition method and device, and electronic equipment
  • Streaming end-to-end speech recognition method and device, and electronic equipment
  • Streaming end-to-end speech recognition method and device, and electronic equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0083] First, the first embodiment provides a flow-end speech recognition method, see image 3 ,include:

[0084] S301: a frame unit received voice stream and extracting acoustic features for speech encoding;

[0085] Performing the streaming voice recognition process, it is possible in voice frame units speech acoustic feature extraction stream, and in units of frames is encoded, the encoder outputs the encoded result of each frame. Further, since the input voice stream is continuous, therefore, the voice stream is encoded in operation may be continued. For example, suppose that is a 60ms frame, as received voice stream, every 60ms speech stream as a process for feature extraction and encoding. Wherein, the coding process is received speech acoustic feature was converted to a new higher level of expression of having to distinguish between the high-level expression may generally be present in the form of a vector. Thus, the encoder may specifically be a multilayer neural network, s...

Embodiment 2

[0107] This second embodiment provides a method for the prediction model, see Figure 4 The method may specifically include:

[0108] S401: obtaining the training sample set, said training set comprising a plurality of block data and the label information, wherein each block frame comprises multiple data streams encoded frame of speech encoding results respectively, each of said label information comprises sub-blocks need to be included in the number of active points of decoded output;

[0109] S402: the training set is input to the trained predictive model in the model.

[0110] In specific implementation, the training set may comprise the same multi-frame modeling unit corresponding to the voice stream is divided into different sub-blocks in the case, which can be the same text like modeling unit is divided into a plurality of different block access to accurate forecast when the situation was the result of training, which encountered the same situation during the test.

Embodiment 3

[0112] The third embodiment is directed to the scenario in the cloud service system provided in the cloud service system, and provides a method of providing a speech recognition service from the angle of the cloud server, and See Figure 5 This method can include:

[0113] S501: After the cloud service system receives the call request of the application system, the voice stream provided by the application system is received;

[0114] S502: Extract the voice sound of the received voice in units of frames and encoded;

[0115] S503: Multiple block processing of completed coded frames and predicts the number of activation points that encode output in the same block need;

[0116] S504: Determine the location where the active point of the decoded output is determined according to the forecast results, so that the decoder decodes the position of the speech to obtain the speech recognition result;

[0117] S505: Returns the speech recognition result to the application system.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a streaming end-to-end speech recognition method and device, and electronic equipment. The streaming end-to-end speech recognition method comprises the following steps: performing speech acoustic feature extraction on a received speech stream by taking a frame as a unit, and encoding; performing block processing on the coded frame, and predicting the number of activation points which are included in the same block and need to be coded and output; and determining the position of the activation point needing to be decoded and output according to the prediction result so that a decoder decodes at the position of the activation point and outputs an identification result. According to the embodiment of the invention, the robustness of a streaming end-to-end speech recognition system to a noise can be improved, and the system performance and accuracy are increased.

Description

Technical field [0001] The present application relates to flow-end speech recognition technology, and particularly relates to a flow-end voice recognition method, apparatus, and an electronic device. Background technique [0002] Speech recognition technology is to allow the machine to the speech signal into the corresponding text or command technology by identifying and understanding process. Among them, end to end voice recognition by the academia and industry more and more attention. Compared to conventional hybrid systems based, end-to-speech recognition through a joint optimization model acoustic model, language model, not only can greatly reduce the complexity of the training system, and can achieve significant performance improvements. But the majority of end to end voice recognition system or offline voice recognition, voice recognition can not be performed in real-time streaming (streaming). That is, able to finish a sentence after the user voice recognition and output r...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G10L15/20G10L15/02
CPCG10L15/20G10L15/02G10L15/16G10L15/30G10L15/063
Inventor 张仕良高志付
Owner ALIBABA GRP HLDG LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products