Chinese short text entity identification and disambiguation method based on enhanced character vector

A technology of entity recognition and short text, applied in the field of neuro-linguistic programming, can solve problems such as difficult to extract useful semantic information

Active Publication Date: 2020-03-06
TONGJI UNIV
View PDF6 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this is not suitable for short text, because the clauses on both sides will be shorter than the original text, making it more difficult to extract useful semantic information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese short text entity identification and disambiguation method based on enhanced character vector
  • Chinese short text entity identification and disambiguation method based on enhanced character vector
  • Chinese short text entity identification and disambiguation method based on enhanced character vector

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0117] The main steps of the first part of entity recognition are:

[0118] 1.1 Input the Chinese short text "Bitcoin attracts countless fans", and get the character sequence ['bi', 'special', 'coin', 'suction', 'fan', 'no', 'number'], the number of characters is 7 , using the Word2vec method for pre-training to obtain a 300-dimensional character vector;

[0119] 1.2 Input the short Chinese text described in 1.1 into the language model BERT pre-trained with large-scale corpus, and obtain a 768-dimensional character context vector;

[0120] 1.3 Cut the short Chinese text described in 1.1 into Bi-gram word sequences ['Bit', 'Bitcoin', 'Bi Suck', 'Suck Fan', 'Fen Wu', 'Countless'], and then use Word2vec's Method training to obtain 300-dimensional adjacent character vectors.

[0121] 1.4 Input the short Chinese text mentioned in 1.1, import the mentioned dictionary database into the jieba word segmentation tool and then perform word segmentation. The obtained word sequence is: [...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a Chinese short text entity identification and disambiguation method based on an enhanced character vector. The Chinese short text entity identification and disambiguation method comprises the following steps: 1, performing entity identification on an input text by combining a mention library and a context; and 2, performing entity disambiguation on the text subjected to entity identification processing according to the semantic matching between the to-be-disambiguated entity and the candidate entity. Compared with the prior art, the invention provides a method for realizing Chinese short text entity identification and disambiguation, through neural network input based on an enhanced character vector, including introducing mention dictionary library information andmention position information.

Description

technical field [0001] The invention relates to the field of Neuro-Linguistic Programming (NLP), relates to a method for entity linking of short Chinese texts, in particular to a method for entity recognition and disambiguation of short Chinese texts based on enhanced character vectors. Background technique [0002] Entity Linking (EL) aims to identify potential, ambiguous mentions of entities in text and link them to a target Knowledge Base (KB). This is an essential step for many NLP tasks such as knowledge fusion, knowledge base construction, and knowledge base-based question answering systems. EL systems usually consist of two subtasks: (1) Entity Recognition (ER): extract all potential entity references (i.e. mentions) from text fragments; (2) Entity Disambiguation (ED): Map these ambiguous mentions to the correct entities in the KB. [0003] Entity linking has been studied for many years and has achieved great progress with the help of neural networks. But most rese...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/295G06N3/04
CPCG06N3/044G06N3/045Y02D10/00
Inventor 向阳杨力徐忠国
Owner TONGJI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products