A Customizable Chinese-English Mixed Speech Recognition End-to-End System

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A hybrid speech and end-system technology, applied in speech recognition, speech analysis, natural language data processing, etc., can solve problems such as performance degradation, statistical language model complexity, inability to meet end-to-end model effective training, and reduce dependencies , Improve the effect of recognition accuracy

Active Publication Date: 2022-03-25

INST OF AUTOMATION CHINESE ACAD OF SCI

View PDF6 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] (1) Pipeline method Since the acoustic model, pronunciation model, and language model are trained and modeled separately, errors will accumulate, and the error of the acoustic model will be transmitted to the subsequent pronunciation model and language model, resulting in performance degradation; another On the one hand, due to the complexity of the statistical language model, the constructed decoding map is huge in size, which is not suitable for end-side applications such as mobile phones and smart speakers;

[0005] (2) The existing end-to-end model requires a large amount of training data for training, but it is extremely difficult to obtain mixed Chinese and English data, which cannot satisfy the effective training of the end-to-end model

End-to-end models trained on domain-specific Chinese-English mixed data cannot effectively solve English recognition problems in other domains

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0034] The first aspect of the present invention discloses a customizable Chinese-English mixed speech recognition end-to-end system, figure 1 It is a structural diagram of a customizable Chinese-English mixed speech recognition end-to-end system according to an embodiment of the present invention, specifically as figure 1 As shown, the system 100 includes:

[0035] Acoustic encoder 101, English vocabulary encoder 102 and decoder 103

[0036]The acoustic encoder 101: extract the acoustic features of the speech waveform to obtain an acoustic feature sequence, then perform convolution and re-encoding operations on the acoustic feature sequence to obtain a down-sampled and re-encoded feature sequence, and then convert the down-sampled And the re-encoded feature sequence is input to the multi-head self-attention module of the acoustic encoder based on the multi-head self-attention mechanism to obtain a sequence of high-dimensional representations of acoustic features;

[0037] I...

Embodiment 2

[0062] Such as figure 1 As shown, the system 100 includes:

[0063] Acoustic encoder 101, English vocabulary encoder 102 and decoder 103

[0064] The acoustic encoder 101: extract the acoustic features of the speech waveform to obtain an acoustic feature sequence, then perform convolution and re-encoding operations on the acoustic feature sequence to obtain a down-sampled and re-encoded feature sequence, and then convert the down-sampled And the re-encoded feature sequence is input to the multi-head self-attention module of the acoustic encoder based on the multi-head self-attention mechanism to obtain a sequence of high-dimensional representations of acoustic features;

[0065] In some embodiments, the specific method for extracting the acoustic features of the speech waveform includes: every 25 milliseconds is a frame, there is an overlap of 10 milliseconds between frames, and after the frame is divided, the 80-dimensional fbank feature is extracted as the acoustic feature;...

Embodiment 3

[0086] The invention discloses an electronic device, including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, it realizes a customizable Chinese-English mixed speech recognition in any one of the first aspects of the invention disclosure Steps in an end-to-end approach.

[0087] image 3 It is a structural diagram of an electronic device according to an embodiment of the present invention, such as image 3 As shown, the electronic device includes a processor, a memory, a communication interface, a display screen and an input device connected through a system bus. Wherein, the processor of the electronic device is used to provide calculation and control capabilities. The memory of the electronic device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operatio...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a customizable Chinese-English mixed speech recognition end-to-end system, wherein the system includes: an acoustic coder, an English vocabulary coder, the decoder and a softmax function. An end-to-end model of the acoustic encoder, English vocabulary encoder-decoder structure, the acoustic encoder, English vocabulary encoder and decoder internally use attention-based modeling. The way the model can be customized is to encode the English words or English phrases that need to be customized in advance, and convert the discrete words into the hidden layer representation of the model to form a list of vectors to be retrieved. While performing the recognition process, the decoder performs attention calculations on both the high-dimensional representation of the acoustic features and the sequence of the final representation of the English vocabulary. The invention has the ability to realize customized models for English proper nouns in different fields, realize accurate recognition of Chinese and English mixed expressions, and reduce the dependence of the model on training data.

Description

technical field [0001] The invention belongs to the field of speech recognition, in particular to a customizable Chinese-English mixed speech recognition end-to-end system. Background technique [0002] The existing Chinese-English mixed speech recognition mainly has two technical routes 1. Pipeline method, the acoustic model, pronunciation model, and language model are trained and modeled separately, and then the three models are integrated into the Unify the decoding map, the recognition process is the search process of the decoding map 2. The end-to-end method, the acoustic model, the pronunciation model, and the language model are modeled and optimized in a unified manner, without the need to build a decoding map, to achieve simple training and decoding. [0003] Disadvantages of existing technology: [0004] (1) Pipeline method Since the acoustic model, pronunciation model, and language model are trained and modeled separately, errors will accumulate, and the error of ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G10L15/00G10L15/02G10L15/06G10L15/183G06F40/126G06F40/237G06F40/284

CPCG10L15/005G10L15/02G10L15/183G10L15/063G06F40/126G06F40/237G06F40/284

Inventor 陶建华张帅易江燕

Owner INST OF AUTOMATION CHINESE ACAD OF SCI

A Customizable Chinese-English Mixed Speech Recognition End-to-End System

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology