Corpus data processing method and device, computer readable medium and electronic equipment

A processing method and corpus technology, which are applied in the processing of corpus data, computer-readable media and electronic equipment, can solve the problems of large recall data, large noise of corpus data, time-consuming and labor-intensive manual inspection, etc., so as to avoid time-consuming and wasteful strength, and the effect of improving accuracy

Active Publication Date: 2019-04-09
TENCENT TECH (SHENZHEN) CO LTD
View PDF5 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the corpus mining scheme proposed in the related art has the problem of recalling more data and noisy corpus data, which will not only lead to time-consuming and labor-intensive manual inspection, but also affect the accuracy of the deep learning model.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Corpus data processing method and device, computer readable medium and electronic equipment
  • Corpus data processing method and device, computer readable medium and electronic equipment
  • Corpus data processing method and device, computer readable medium and electronic equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment approach 1

[0096] In an embodiment of the present invention, if the template corresponding to the corpus data is a subset of the existing domain template, then the template corresponding to the corpus data is similar to the domain template, and the piece of corpus data is also the corpus of the domain.

[0097] For example, if the existing domain template is "[player]'s height", the corpus data "I want to know Yao Ming's height" corresponds to the template "I want to know [player]'s height". Since "[player]'s height" is a subset of "I want to know [player]'s height", the template corresponding to the corpus data is similar to the existing domain template, and the corpus data is the corpus of the domain.

Embodiment approach 2

[0099] In an embodiment of the present invention, if the edit distance between the template corresponding to the corpus data and the existing domain template is less than or equal to the distance threshold (such as 2), the template corresponding to the corpus data is similar to the existing domain template. The corpus data is the corpus of the field.

[0100] For example, the template corresponding to the corpus data "Yao Ming's real height" is "[player]'s real height", while the existing domain templates "[player]'s height" and "[player]'s real height" do not have an inclusive relationship, but the two The edit distance between the two templates is equal to 2 within the set range, so the template corresponding to the corpus data can be similar to the existing domain template, and the corpus data is the corpus of the domain.

[0101] In other embodiments of the present invention, the similarity between the models can also be calculated through algorithms such as cosine similarity a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention provides a corpus data processing method and device, a computer readable medium and electronic equipment. The corpus data processing method comprises the steps of obtaining to-be-processed corpus data in a target field; Generating a first corpus template corresponding to the to-be-processed corpus data according to the entity name contained in the to-be-processed corpus data; According to the first corpus template and a second corpus template existing in the target field, calculating the similarity between the first corpus template and the second corpus template; And according to the similarity between the first corpus template and the second corpus template, filtering the to-be-processed corpus data to obtain processed corpus data. According to the technical scheme provided by the embodiment of the invention, the corpus data to be processed can be filtered by mining the corpus template corresponding to the corpus data to be processed and the existing corpus template in the target field, so that the corpus data with relatively poor correlation with the target field can be filtered out, and relatively accurate corpus data in the target field can be obtained.

Description

Technical field [0001] The present invention relates to the field of computer and communication technologies, and in particular to a method, device, computer readable medium and electronic equipment for processing corpus data. Background technique [0002] In the intelligent question answering scenario, the acquisition and expansion of domain corpus is an important issue for domain construction. Sufficient corpus of high quality and diversity can be trained to obtain more accurate deep learning models, which can more accurately classify user questions. Conversely, if there are too few relevant corpus in a field, the deep learning model will learn fewer features related to the field, and it will be difficult to distinguish it from corpus in other fields. It can be seen that the work of corpus mining has a decisive significance for the effect of deep learning models. However, the corpus mining solution proposed in the related technology has the problem of more recalled data and gr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/33G06F17/27
CPCG06F40/295
Inventor 周辉阳饶孟良曹云波
Owner TENCENT TECH (SHENZHEN) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products