Similarity evaluation method of approximately duplicate records

A similarity and near-duplication technology, applied in the field of near-duplicate record identification under big data, can solve problems such as poor processing effect, and achieve the effect of overcoming inaccurate calculation, avoiding cost, and improving accuracy

Active Publication Date: 2015-08-19
EAST CHINA NORMAL UNIVERSITY
View PDF3 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, each method is only effective for specific variable types, and it does

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Similarity evaluation method of approximately duplicate records
  • Similarity evaluation method of approximately duplicate records
  • Similarity evaluation method of approximately duplicate records

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] The present invention will be further described in detail in conjunction with the following specific embodiments and accompanying drawings. The process, conditions, experimental methods, etc. for implementing the present invention, except for the content specifically mentioned below, are common knowledge and common knowledge in this field, and the present invention has no special limitation content.

[0025] The definitions of the technical terms involved in the present invention are as follows:

[0026] A record is composed of some attributes to reflect an entity in nature, figure 2 An example graph showing a record containing a complex text type.

[0027] Attribute (attribute) is a part of the record, which is used to describe the inherent nature of the entity, and can also be called a field (field).

[0028] Deduplication refers to the operation of finding records pointing to the same entity in a record set.

[0029] Attribute level similarity refers to the simil...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention discloses a similarity evaluation method of approximately duplicate records. The similarity evaluation method comprises: step 1, partitioning a large data set waiting for duplicate removing, and obtaining a plurality of smaller data blocks; step 2, with respect to each data block, initializing similarity between an attribute layer and a recording layer; step 3, if iteration stop conditions are not satisfied, using the similarity of the recording layer to update the similarity of the attribute layer, and using the similarity of the attribute layer to update the similarity of the recording layer; and step 4, outputting the similarity between the attribute layer and the recording layer. According to the similarity evaluation method provided by the present invention, the similarity is iteratively spread in the attribute and recording layers, so that the problem that the records have missing values and noise values in practical production is solved, and the similarity of records is evaluated more accurately. The similarity evaluation method provided by the present invention is unsupervised, the cost of marking data is reduced, and the output can be flexibly integrated in conventional duplicate removing system frames based on aggregation or distance.

Description

technical field [0001] The invention relates to a near-duplicate record recognition technology under big data, and to a method for evaluating similarity between records. Background technique [0002] In the era of big data, integrating data from various sources is the most basic part of generating data value, and the deduplication of near-duplicate record identification is the core step. Usually, a record usually consists of multiple attribute values. Existing recognition methods can be mainly classified into the following categories: (1) methods based on probability matching, which use conditional independence assumptions or generalized expectation maximization (EM, Expectation Maximization) algorithm to infer the probability of matching between a single record pair, each observation value is the value of the attribute in the record; (2) distance-based method, which uses different similarity measures to calculate the similarity between attribute layers And by setting diffe...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 兰曼赵江
Owner EAST CHINA NORMAL UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products