Entity recognition method and system considering text semantic information

A technology of entity recognition and semantic information, which is applied in the field of entity recognition methods and systems considering text semantic information, can solve problems such as increased calculation, similar content cannot be fully contained, and insufficient use of semantic information, etc., to achieve time complexity Reduced, good entity recognition effect, high entity recognition efficiency

Pending Publication Date: 2022-01-25
XIDIAN UNIV
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] (2) The block size of the existing entity recognition method is relatively fixed when comparing. If the selected block is too large, the matching calculation of the content with minimal correlation will increase unnecessary calculation. If the block is too small, the similar content will be Cannot be fully included in the window, resulting in the omission of similar records
[0010] (3) The existing method needs to assign different weights to each attribute to calculate the similarity between two records, and the determination of the weights requires manual participation; and the use of semantic information in the text is not sufficient, the recognition effect is poor, and the recognition efficiency is low. Low and poor versatility, which brings great obstacles to the practical application and development of entity recognition

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Entity recognition method and system considering text semantic information
  • Entity recognition method and system considering text semantic information
  • Entity recognition method and system considering text semantic information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0101] The entity recognition method based on the inverted index and the Sentence-BERT (SBERT for short) model provided by the embodiment of the present invention comprises the following steps:

[0102] For the record sets A and B to be identified

[0103] (1) Data reading and preprocessing:

[0104] Read the contents of record sets A and B respectively, perform preprocessing operations such as word segmentation, spelling correction, part of speech restoration, and stop word removal on the data contained in the records, and generate record sets A* and B composed of individual words *;

[0105] (2) Create an inverted index:

[0106] Deduplicate the word content in A* to generate a word dictionary, and use the words in the dictionary as index words to create an inverted index of the record set A;

[0107] (3) Load the SBERT model.

[0108] Load the SBERT model trained on the network into the method for standby;

[0109] (4) Calculate the IDF value:

[0110] Calculate the I...

Embodiment 2

[0126] The high-efficiency entity recognition method that fully considers the text semantic information provided by the present invention is based on the inverted index and the SBERT model. Firstly, through the inverted index and the calculation of the IDF value of the word in the data source, the pair of records to be matched is quickly generated to improve the recognition efficiency, and then through The SBERT model fully extracts the semantic information in the text records, uses cosine similarity to calculate the similarity between records, improves the recognition accuracy, and thus achieves efficient and accurate entity recognition.

[0127] The entity recognition method based on the inverted index and the SBERT model provided by the embodiment of the present invention takes two record sets A and B to be recognized as examples, and includes the following steps:

[0128] 1. Data reading and preprocessing. Read the record collection into the model, and combine the fields o...

Embodiment 3

[0137] The present invention divides the overall process of the entire entity recognition algorithm into three main stages, namely the preparation stage, the processing stage and the verification stage, and the detailed processing steps of each stage are as follows.

[0138] (1) Preparation stage:

[0139] The preparation stage mainly includes preprocessing the data and establishing related indexes. First determine whether the cache file exists. If there is a cache file, you need to load the cache file, then read the original data file, perform field merging and spelling correction on the information that needs to be processed, preload the SBERT model, and create an inverted index, including Dictionary files, location files, etc., will finally write the processing results and generated content into cache files for storage. The main considerations for merging the fields in the file set are as follows. One is that the algorithm can be flexibly applied to all data records mainly...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical field of data cleaning and data integration application, and discloses an entity recognition method and system considering text semantic information. For to-be-recognized record sets A and B, the entity recognition method comprises the following steps: reading and preprocessing data; creating a reverse index about the data set; loading an SBERT model; calculating the IDF values of words in the data set; generating a to-be-matched record pair; calculating record similarity; and processing and returning arecognition result. On the basis of the inverted index and the SBERT model, the to-be-matched record pair is quickly generated through the inverted index and the IDF values of the words in the calculation data source, so that the recognition efficiency is improved; semantic information in text records is fully extracted through the SBERT model, the similarity between the records is calculated through cosine similarity, the recognition accuracy is improved, and the efficient and accurate entity recognition effect is achieved; and compared with a traditional entity recognition method, the recall ratio of the entity recognition result on the thesis data set is improved by about 20%, and the precision ratio is improved by about 10%.

Description

technical field [0001] The invention belongs to the technical field of data cleaning and data integration application, and in particular relates to an entity recognition method and system considering text semantic information. Background technique [0002] At present, with the rapid development of information technology and the continuous acceleration of informatization construction, various enterprises and units have continuously improved their data acquisition and storage capabilities. A large amount of data is stored in the information systems of various enterprises and institutions. These data have great use value. To obtain these values, it is necessary to use data cleaning to transform massive messy data into high-quality data with consistency and accuracy. . [0003] Entity recognition, also known as duplicate record recognition, record linking, etc., is the process of identifying which records in a data set represent the same entity in the real world. Entity recogn...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/295G06F40/30G06F16/31
CPCG06F40/295G06F40/30G06F16/319
Inventor 宗威林松涛李兵
Owner XIDIAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products