A data screening method and device

A technology of speech data and text data, applied in the field of data processing, can solve problems such as the inability to guarantee the effect of acoustic models and language models

Active Publication Date: 2021-12-07
IFLYTEK CO LTD
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In practice, it is necessary to use a large amount of sample data to train the acoustic model and the language model. However, in the data labeling stage of the existing acoustic model and language model, a number of sample data are randomly selected for labeling, so as to complete the subsequent model training, and these It is not known whether the randomly selected sample data is the sample that the model really wants to learn, so the effect of the acoustic model and language model cannot be guaranteed

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A data screening method and device
  • A data screening method and device
  • A data screening method and device

Examples

Experimental program
Comparison scheme
Effect test

no. 1 example

[0065] This embodiment will introduce a voice data screening method. Prior to this, a training data set composed of a large number of voices can be pre-built. Through this method, the voice data that the acoustic model really needs to learn can be screened out from the training data set. , used to train the acoustic model. In this way, using the limited data resources (ie low resources) selected, the acoustic model can learn the acoustic features as comprehensively as possible, which not only improves the training speed of the acoustic model, but also improves the Predictive performance of acoustic models.

[0066] see figure 1 , which is a schematic flow chart of a voice data screening method provided in this embodiment, the method includes the following steps:

[0067] S101: Using the voice data of the first duration, train an acoustic model.

[0068] In this embodiment, in order to improve the data quality of the training data of the acoustic model, before this step S101,...

no. 2 example

[0108] This embodiment will introduce a text data screening method. Prior to this, a training data set composed of a large amount of text can be pre-built. Through this method, the specific text domain classification model that the text domain classification model really needs to learn can be screened out from the training data set. The text data in the domain (such as the medical field) is used to train the text domain classification model, so that the text domain classification model can perform the specific domain as comprehensively as possible by using the limited data resources (that is, low resources) that are screened out. The learning of text features not only improves the training speed of the text domain classification model, but also improves the classification effect of the text domain classification model for this specific field. Furthermore, the text domain classification model can be used to more accurately select the text data of the specific domain from the tra...

no. 3 example

[0146] It should be noted that this embodiment will introduce a data screening method and a model building method, which can be specifically implemented by using the screening methods introduced in the first and second embodiments above.

[0147] see Figure 4 , which is a schematic flowchart of a data screening method provided in this embodiment, the method includes:

[0148] Step S401: Based on the learning requirements for data features, use a preset screening strategy to perform data screening in the data set to be screened to obtain screened data, wherein the screened data carries unlearned data features.

[0149] In practical applications, especially in the field of deep learning, in order to achieve different functional goals, it is necessary to collect a large amount of sample data related to the functional goals to form a data set, and it is expected that the data features in the data set can be analyzed. Comprehensive learning, however, in order to achieve comprehen...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present application discloses a data screening method and device. The method includes: based on the learning requirements for data characteristics, using a preset screening strategy to perform data screening in the data set to be screened to obtain screened data, wherein the screened data Carries unlearned data features. It can be seen that based on the learning requirements for data features, data screening strategies are pre-established to perform data screening in the data set to be screened, so that the screened data carries data features that have not been learned so far, and further, based on these data, can be screened out. feature learning with limited data resources, that is, feature learning under low-resource conditions is achieved.

Description

technical field [0001] The present application relates to the technical field of data processing, in particular to a data screening method and device. Background technique [0002] With the continuous development of speech recognition technology, speech recognition has practical applications in many occasions, such as speech input method, conference transcription, film and television subtitle generation and other fields. Excellent speech recognition technology plays a decisive role in improving the effects of these fields. Therefore, it has also received more and more research and attention from scholars. In recent years, with the rapid development of deep learning, the acoustic model and language model in the current speech recognition system are basically models based on various neural networks, and neural network models often require a large number of training samples as support, and How to improve the performance of speech recognition system under low resource condition...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G10L15/06G10L15/14G10L15/183
CPCG10L15/063G10L15/14G10L15/144G10L15/183
Inventor 方昕刘海波方磊
Owner IFLYTEK CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products