Entity recognition method based on semi-supervised learning and clustering

A technology for rail transit and entity recognition, applied in neural learning methods, text database clustering/classification, biological neural network models, etc. data and other issues, to achieve the effect of improving the extraction speed and accuracy, shortening the processing time, and increasing the query rate

Pending Publication Date: 2021-07-30
XIAN UNIV OF TECH
View PDF0 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The purpose of the present invention is to provide a rail transit entity recognition method based on semi-supervised and clustering, which can solve the problem that existing rail transit specification entity recognition methods need to mark a large amount of data, and when experts build ontology databases, fine-grained entity classification and labeling samples are limited Issues that lead to low accuracy of entity recognition results

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Entity recognition method based on semi-supervised learning and clustering
  • Entity recognition method based on semi-supervised learning and clustering
  • Entity recognition method based on semi-supervised learning and clustering

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0075] The object of the present invention provides a kind of rail transit specification named entity recognition method based on semi-supervised and clustering, concrete frame is as follows figure 1 shown. Experts build ontology databases in the field of rail transit, and manually label part of the data; use word2vec and BERT pre-training models to vectorize labeled entities; secondly, use hierarchical clustering methods to cluster entity word vectors, and entities defined by experts Category proofreading, finalized entity categories; data preprocessing and data training on the training data again, input the generated word vectors into the BiLSTM-CRF algorithm to train the named entity recognition model, and use the Softmax function to iteratively train and optimize the extracted entity features Entity recognition model; set the deep learning model as the server to test the effect of the entity recognition model, input the test data set into the model to output the entity cat...

example

[0130] Entity labeling of the rail transit specification corpus, the specific steps are as follows:

[0131] Step 11.3.1, taking the subway design specification "9.1.6 Stations should be equipped with barrier-free facilities" as an example, the training set output by the BERT model is vectorized, and each word in "Stations should be equipped with barrier-free facilities" is trained Get a 768-dimensional vector, get the initialization vector of each word, and then use the result as the input of the deep learning model.

[0132] In step 11.3.2, using the BiLSTM-CRF algorithm in deep learning, bidirectional LSTM considers both past features and future features, a forward input sequence, and a reverse input sequence to predict the semantics of words in context. For example, after inputting "station", BiLSTM will predict the probability that the next word is "ying", and then input "station should" to predict the probability of the next word "setting", which is a positive inpu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to an entity recognition method based on semi-supervised learning and clustering. The method comprises the steps of: pre-defining entity categories through the ontology library to label the rail transit standard unstructured data; performing vectorization representation on label data by using word2vec, and then performing a hierarchical clustering algorithm on entity word vectors with labels; performing conjoint analysis on entity categories and clustering results, proofreading entity category definitions, and finally determining the entity types of the ontology library in the field of rail transit; and finally, rearranging a data set, and inputting generated word vectors into a BiLSTM-CRF deep learning model to train a named entity recognition model, wherein a Softmax function is used to carry out tag classification on recognized entity features, and an entity tag classification result is evaluated. According to the method, the entity extraction speed and accuracy in the rail transit specifications can be improved, so that the time for processing the rail transit specifications by automatic question and answer system and semantic network labeling is shortened, the query rate of employees in the building field on the rail transit specifications is improved, and the user experience degree is improved.

Description

technical field [0001] The invention belongs to the technical field of artificial intelligence natural language processing, and relates to a rail transit entity recognition method based on semi-supervised learning and clustering. Background technique [0002] In recent years, the development of artificial intelligence has become an important development direction of the industry, among which natural language processing is an important research direction in this field, and its research results have been applied in medical, legal, financial and other industries, greatly improving the level of intelligence in the field . However, there is also a large amount of text information in the field of rail transit, and there are very few related studies in this field. In the existing natural language processing research field, the existing related methods related to the information extraction of rail transit regulations are mainly aimed at English rail transit regulations, while the r...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/295G06F16/35G06N3/04G06N3/08
CPCG06F40/295G06F16/353G06N3/08G06N3/044
Inventor 黑新宏董林靖朱磊
Owner XIAN UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products