Corpus text processing method and device and electronic equipment

A processing method and text technology, applied in special data processing applications, unstructured text data retrieval, text database clustering/classification, etc., can solve problems such as limited ability to express corpus text, low accuracy of labeling information of intent category, etc.

Active Publication Date: 2020-12-29
NETEASE (HANGZHOU) NETWORK CO LTD
View PDF7 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The existing corpus texts mainly use clustering algorithm and metric learning to mark the information of intent categories. Among them, the model of metric learning is the traditional sequence model, which has limited ability to express corpus texts, which leads to the accuracy of labeling information of intent categories. lower

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Corpus text processing method and device and electronic equipment
  • Corpus text processing method and device and electronic equipment
  • Corpus text processing method and device and electronic equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0036] The embodiment of the present invention provides a kind of processing method of corpus text, such as figure 2 As shown, the method includes the following steps:

[0037] Step S202, input the corpus text set to be processed into the language model to obtain the feature vector of the corpus text in the corpus text set; wherein, the feature vector is used to represent the semantic information of the corpus text; the language model is a model obtained by training the original training samples .

[0038] Among them, the language model is the BERT (Bidirectional Encoder Representations from Transformers, bidirectional encoding representation based on the converter) language model. Before clustering the corpus text set, it is necessary to vectorize the corpus text in the corpus text set. Specifically, input the corpus text set to be processed into the BERT language model so that the BERT language model The text set is processed by vector mapping to obtain the feature vector...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a corpus text processing method and device and electronic equipment. The method comprises the steps of inputting a corpus text set to be processed into a language model, and obtaining feature vectors of corpus texts; performing clustering processing on the corpus text set based on a clustering algorithm and the feature vectors of the corpus texts to obtain corpus classification information; modifying intention category annotation information annotated by the target corpus text to obtain the target corpus text; and adding the target corpus text into the original trainingsample to train a language model to obtain an optimized language model. According to the invention, clustering processing is carried out on the corpus text set through the language model and the clustering algorithm, and the intention category annotation information annotated by the target corpus information in the corpus classification information is corrected to train the language model, so thatthe language model can be iteratively optimized in the use process, the generalization ability of the language model and the clustering algorithm is improved, and the labeling accuracy of the intention category labeling information corresponding to the corpus text is improved.

Description

technical field [0001] The present invention relates to the technical field of natural language processing, in particular to a method, device and electronic equipment for processing corpus texts. Background technique [0002] With the rapid development of computers, the number of digitized texts is constantly increasing, and the development of the Internet has intensified the expansion of digitized texts. In this context, clustering technology can be used to simplify the representation of text and re-express information retrieval to speed up information retrieval; or realize the integration and push of a series of personalized information, such as the currently popular APP (Application , mobile phone software) Toutiao, Zhihu, etc. However, in most scenarios, chatbots still need to customize specific question-answer pairs, that is, the pairing of intentions and answers. This mode is very common in task-based conversations, such as booking airline tickets. But in open domain...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06F16/36G06K9/62
CPCG06F16/35G06F16/36G06F18/2193G06F18/2411
Inventor 浦嘉澍毛晓曦范长杰胡志鹏
Owner NETEASE (HANGZHOU) NETWORK CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products