Similarity evaluation method of approximately duplicate records

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A similarity and near-duplication technology, applied in the field of near-duplicate record identification under big data, can solve problems such as poor processing effect, and achieve the effect of overcoming inaccurate calculation, avoiding cost, and improving accuracy

Active Publication Date: 2015-08-19

EAST CHINA NORMAL UNIVERSITY

View PDF3 Cites 4 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

However, each method is only effective for specific variable types, and it does not work well for missing or noisy values, especially for data on the Internet.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0024] The present invention will be further described in detail in conjunction with the following specific embodiments and accompanying drawings. The process, conditions, experimental methods, etc. for implementing the present invention, except for the content specifically mentioned below, are common knowledge and common knowledge in this field, and the present invention has no special limitation content.

[0025] The definitions of the technical terms involved in the present invention are as follows:

[0026] A record is composed of some attributes to reflect an entity in nature, figure 2 An example graph showing a record containing a complex text type.

[0027] Attribute (attribute) is a part of the record, which is used to describe the inherent nature of the entity, and can also be called a field (field).

[0028] Deduplication refers to the operation of finding records pointing to the same entity in a record set.

[0029] Attribute level similarity refers to the simil...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The present invention discloses a similarity evaluation method of approximately duplicate records. The similarity evaluation method comprises: step 1, partitioning a large data set waiting for duplicate removing, and obtaining a plurality of smaller data blocks; step 2, with respect to each data block, initializing similarity between an attribute layer and a recording layer; step 3, if iteration stop conditions are not satisfied, using the similarity of the recording layer to update the similarity of the attribute layer, and using the similarity of the attribute layer to update the similarity of the recording layer; and step 4, outputting the similarity between the attribute layer and the recording layer. According to the similarity evaluation method provided by the present invention, the similarity is iteratively spread in the attribute and recording layers, so that the problem that the records have missing values and noise values in practical production is solved, and the similarity of records is evaluated more accurately. The similarity evaluation method provided by the present invention is unsupervised, the cost of marking data is reduced, and the output can be flexibly integrated in conventional duplicate removing system frames based on aggregation or distance.

Description

technical field [0001] The invention relates to a near-duplicate record recognition technology under big data, and to a method for evaluating similarity between records. Background technique [0002] In the era of big data, integrating data from various sources is the most basic part of generating data value, and the deduplication of near-duplicate record identification is the core step. Usually, a record usually consists of multiple attribute values. Existing recognition methods can be mainly classified into the following categories: (1) methods based on probability matching, which use conditional independence assumptions or generalized expectation maximization (EM, Expectation Maximization) algorithm to infer the probability of matching between a single record pair, each observation value is the value of the attribute in the record; (2) distance-based method, which uses different similarity measures to calculate the similarity between attribute layers And by setting diffe...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

CPCG06F16/35

Inventor兰曼赵江

OwnerEAST CHINA NORMAL UNIVERSITY

Similarity evaluation method of approximately duplicate records

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology