An entity resolution method suitable for big data environment with anti-noise ability

An entity analysis and anti-noise technology, which is applied in the field of entity analysis, can solve the problems of reduced accuracy and recall rate of clustering results, achieve the effect of improving anti-noise ability, improving quality, and weakening the influence of data noise

Inactive Publication Date: 2016-10-12
北京交通大学长三角研究院
View PDF1 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

When calculating the cost, each node in the class will play a voting role, and the dispersion of voting rights will lead to a decrease in the accuracy and recall of the clustering results

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An entity resolution method suitable for big data environment with anti-noise ability
  • An entity resolution method suitable for big data environment with anti-noise ability
  • An entity resolution method suitable for big data environment with anti-noise ability

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0055] An entity resolution method suitable for big data environments with anti-noise ability. The method is improved on the basis of the traditional correlation clustering method. By introducing the concept of neighbor relationship and kernel, it is realized by a two-layer algorithm. The upper algorithm is based on the neighbor The relationship performs rough, overlapping pre-block processing on the data; the underlying algorithm accurately defines the degree of association between nodes and classes by introducing the concept of cores, so as to accurately determine the attribution of nodes and improve correlation clustering The accuracy, wherein, the method specifically includes the following steps:

[0056] (1) For unclassified records, first use a rough similarity function to calculate the similarity between each node pair, and choose Jaccard similarity. For two sets S and T, Jaccard is defined as shown in formula 1:

[0057] Jaccard ( ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an entity resolution method suitable for a big data environment and capable of achieving noise immunity. According to the entity resolution method, an improvement is conducted based on a traditional relevancy clustering method, the neighborhood and the core concept are introduced, and the entity resolution method is achieved through two layers of algorithms. The upper-layer algorithm is used for conducting rough pre-partitioning allowing overlapping on data based on the neighborhood. The lower-layer algorithm is used for precisely defining the relevancy degree between a node and a class through introduction of the core concept so that the belonging of the node can be accurately judged, and therefore the accuracy of relevancy clustering is improved.

Description

technical field [0001] The invention belongs to the field of data integration systems, and in particular relates to an entity analysis method suitable for a big data environment and having anti-noise capability. Background technique [0002] With the development of information technology, the ways of data acquisition are diversified, and the problem of data quality has become the biggest challenge people face. How to quickly and efficiently obtain useful information from massive data has become the focus of people's research. Data integration can not only enrich the content of a single data, but also improve the accuracy of the data. Entity resolution is an important step in data integration, and its task is to find records from different data sources that refer to the same entity in the real world. Existing entity resolution algorithms are usually based on feature matching, which not only requires a large computational overhead, but also heavily relies on the existence of...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 王宁李杰
Owner 北京交通大学长三角研究院
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products