Generating method and device for training corpus, equipment and storage medium

A training corpus and corpus technology, applied in the field of data processing, can solve the problems of resource consumption, long iteration cycle of speech recognition model, etc., and achieve the effect of saving resources, shortening the iteration cycle, and improving the effect

Active Publication Date: 2019-06-28
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF9 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In the process of realizing the present invention, the inventor found that in the prior art, the training corpus of speech recognition mainly comes from the random audio marked manually, which leads to two main problems: the iterative period of the speech recognition...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Generating method and device for training corpus, equipment and storage medium
  • Generating method and device for training corpus, equipment and storage medium
  • Generating method and device for training corpus, equipment and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0042] figure 1It is a flowchart of a method for generating training corpus provided in Embodiment 1 of the present invention. This embodiment is applicable to the situation of generating training corpus for speech recognition, and the method can be executed by the device for generating training corpus provided in the embodiment of the present invention. , the device can be implemented in the form of software and / or hardware, and generally can be integrated into a training corpus generation device. The equipment for generating training corpus includes but is not limited to computers and the like. Such as figure 1 As shown, the method of this embodiment specifically includes:

[0043] Step 101. In the user behavior log associated with the target application, dig out multiple pieces of corpus data to be marked. The corpus data includes: the first behavior log containing the user's voice and the corresponding voice recognition result, and the first behavior log time Associated...

Embodiment 2

[0056] Figure 2a It is a flowchart of a method for generating training corpus provided in Embodiment 2 of the present invention. This embodiment can be combined with each optional solution in one or more of the above embodiments. In this embodiment, according to the association relationship between the first behavior log and the second behavior log in each corpus data to be labeled, the The user's speech in each corpus data and the corresponding speech recognition results are determined as positive feedback corpus or negative feedback corpus, which may include: according to the log type of the first behavior log, obtaining the user's expected behavior corresponding to the first behavior log; When the expected behavior matches the second behavior log, the user voice in the corpus data and the corresponding voice recognition result are determined as positive feedback corpus.

[0057] Correspondingly, such as Figure 2a As shown, the method of the present embodiment includes: ...

Embodiment 3

[0074] Figure 3a It is a flowchart of a method for generating training corpus provided by Embodiment 3 of the present invention. This embodiment can be combined with each optional solution in one or more of the above embodiments. In this embodiment, according to the association relationship between the first behavior log and the second behavior log in each corpus data to be labeled, the The user's speech in each corpus data and the corresponding speech recognition results are determined as positive feedback corpus or negative feedback corpus, which may include: if it is determined that the user behavior corresponding to the second behavior log is a correction behavior for the first behavior log within a set time period , the user's speech in the corpus data and the corresponding speech recognition results are determined as negative feedback corpus.

[0075] And, after determining the user's speech in the corpus data and the corresponding speech recognition result as the nega...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a generating method and device for a training corpus, equipment and a storage medium. The method comprises the following steps: exploiting a plurality of corpus data to be marked in a user behavior log associated with a target application, wherein the corpus data comprise a first behavior log including a user voice and a corresponding voice recognition result, and a secondbehavior log associated with first behavior log time and belonging to a same user; determining the user voice and the corresponding voice recognition result in various corpus data to be a positive feedback corpus or negative feedback corpus according to an incidence relation of the first behavior log and the second behavior log in the corpus data to be marked. The embodiment of the invention can automatically and purposefully exploit the positive feedback corpus and the negative feedback corpus of speech recognition to be provided to subsequent training of a speech recognition model based on user behaviors to effectively improve the effect of the speech recognition, thereby being capable of greatly shortening the iteration cycle of the speech recognition model and then saving a lot of resources.

Description

technical field [0001] Embodiments of the present invention relate to data processing technologies, and in particular, to a method, device, device, and storage medium for generating training corpus. Background technique [0002] At present, the optimization of the speech recognition model of map applications mainly requires the following three steps: randomly extract tens of thousands of hours of audio and its corresponding scene information; spend a huge amount of money and time on manual labeling to produce training corpus; Train the speech recognition model and tune it. [0003] In the process of realizing the present invention, the inventor found that in the prior art, the training corpus of speech recognition mainly comes from the random audio marked manually, which leads to two main problems: the iterative period of the speech recognition model is too long due to manual marking , and the resource consumption is relatively serious; due to the randomly extracted audio, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G10L15/06
CPCG10L15/26G10L15/063G10L2015/0635G10L15/22G10L2015/225G10L2015/221G10L25/63
Inventor 丁世强黄际洲蒋忠伟马文韬
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products