Open domain Chinese text naming entity identification method based on semi-supervised learning

A named entity recognition and semi-supervised learning technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of named entity nesting, difficult to guarantee database data, time-consuming and labor-intensive problems, and achieve the effect of improving efficiency

Active Publication Date: 2018-11-06
NANJING UNIV
View PDF5 Cites 23 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the disadvantage of this method is that it is difficult to ensure the completeness of the database data and the rule-making process is time-consuming an

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Open domain Chinese text naming entity identification method based on semi-supervised learning
  • Open domain Chinese text naming entity identification method based on semi-supervised learning
  • Open domain Chinese text naming entity identification method based on semi-supervised learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0062] In order to better understand the technical content of the present invention, specific embodiments are given and described as follows with reference to the accompanying drawings. First, word segmentation is performed on the training data, and then the word vectors represented in the distributed form of words in the training text are obtained by using the word vector space constructed by the word2vec tool; using the word vectors in the training set and the existing entity type labels of each word vector, the The KNN classifier and the CRF tagger are trained to generate a prediction model of the KNN-CRF named entity category; in the model prediction stage, an empty reliable result set is introduced, and whenever a new prediction result is generated by the prediction, it is added to the reliable result set; when When the number of reliable result sets reaches the threshold, discard the previous KNN and CRF models, add the results of the reliable result set to the training s...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an open domain Chinese text naming entity identification method based on semi-supervised learning. The method comprises two steps of model training and prediction by a model. In the model training stage, a training set text is subjected to word segmentation preprocessing; then, in virtue of word vector space constructed by a word2vec tool, obtaining a word vector expressedby a word distribution type form in the training text; and utilizing the word vector in the training set and the existing entity type tag of each word vector to train a KNN (K-Nearest Neighbor) classifier and a CRF (Conditional Random Field) annotator, and generating a prediction model of a KNN-CRF naming entity type. In the model prediction stage, an empty reliable result set is imported, and when a new prediction result is generated by prediction, the prediction result is added into the reliable result set; when an amount in the reliable result set achieves a threshold value, previous KNN and CRF models are abandoned, the results in the reliable result set are added into the training set, and the KNN classifier and a CRF annotation model are trained again; and the above steps are repeated until a condition is met.

Description

Technical field: [0001] The invention is a named entity recognition method, especially an open domain named entity recognition method based on semi-supervised learning. Background technique: [0002] With the rapid development of information technology, today's society has ushered in a period of data explosion, and massive amounts of data are generated every moment, whether it is an individual, a business or a government. How to extract the valuable information contained in these data is very important. Named entity recognition is an application-driven discipline that uses computer technology to extract valuable information and knowledge from text data to solve this task. The traditional method of named entity recognition is keyword retrieval and related rules, such as matching with keywords in the database and fixed sentence pattern matching to extract target data. However, the disadvantage of this method is that it is difficult to ensure the integrity of the database dat...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F17/30
CPCG06F40/284G06F40/295
Inventor 吴骏陈鹏飞唐思雨孙伟王崇骏
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products