Unlock instant, AI-driven research and patent intelligence for your innovation.

Corpus cleaning method, device and equipment and medium

A corpus, to-be-cleaned technology, applied in the field of data science, can solve the problems of high labor consumption and high labor costs

Active Publication Date: 2019-05-10
THE FOURTH PARADIGM BEIJING TECH CO LTD
View PDF9 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Either way requires a lot of labor for labeling, and the labor cost is high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Corpus cleaning method, device and equipment and medium
  • Corpus cleaning method, device and equipment and medium
  • Corpus cleaning method, device and equipment and medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like numerals refer to like parts throughout. The embodiments are described below in order to explain the present invention by referring to the figures.

[0041] figure 1 A flowchart showing a corpus cleaning method according to an exemplary embodiment of the present invention. figure 1 The shown method can be implemented by a computer program, and can also be implemented by a special corpus cleaning device.

[0042] In step S110, the sentence vector extraction model structure is obtained.

[0043] The sentence vector extraction model structure is used to extract the sentence vector of the input question or answer sentence. In the present invention, the sentence vector extraction model structure can be taken from a part of the question-answer pair model that has been trained in advance to evaluate the matching situation b...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a corpus cleaning method, device and equipment and a medium, and the method comprises the steps: obtaining a sentence vector extraction model structure, taking the sentence vector extraction model structure from one part of a pre-trained question and answer pair model for assessing the matching condition of a question and an answer, and extracting a sentence vector of an input question or answer; extracting at least one part of corpora from all corpora serving as question and answer pairs to be cleaned; obtaining a labeling result of at least one part of corpus; a classification model is trained on the basis of a training set composed of at least one part of corpora and annotation results of the corpora, and the classification model evaluates whether the corpora is suitable for being used as a question and answer pair or not on the basis of sentence vectors extracted from question and answer pairs of the input corpora through a sentence vector extraction model structure; and screening out corpora suitable for being used as question and answer pairs from the unmarked corpora in all corpora by utilizing the trained classification model. Therefore, a large number of corpora with high quality can be obtained through a small number of manual annotations.

Description

technical field [0001] The present invention generally relates to the technical field of data science, and more specifically, relates to a corpus cleaning method, device, equipment and medium. Background technique [0002] In the field of artificial intelligence interaction, the interaction method realized through dialogue still occupies an important position. The realization of dialogue-based artificial intelligence interaction technology depends on the construction of high-quality question-answer pairs. How to select the corpus suitable for question-answer pairs from a large amount of corpus to realize corpus cleaning is the key to constructing high-quality question-answer pairs. [0003] Existing corpus cleaning schemes are mainly divided into two types. One is to obtain the target corpus through a large amount of manual screening, and the other is to obtain a large-scale corpus through manual labeling, and then use the labeled corpus for model training to use the train...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/332G06F16/35G06N3/04G06N3/08
Inventor 王靖淞邢少敏
Owner THE FOURTH PARADIGM BEIJING TECH CO LTD