A stochastic algorithm-based distributed entity matching method

A matching method and random algorithm technology, applied in computing, instruments, file access structures, etc., can solve problems such as poor performance and inability to fully utilize the advantages of distributed concurrency, achieve performance advantages, reduce network transmission overhead, and ensure matching accuracy Effect

Inactive Publication Date: 2017-01-11
EAST CHINA NORMAL UNIVERSITY
View PDF5 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

On the other hand, when some traditional entity matching methods are transplanted to a distributed environment, they usually cannot make full use

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A stochastic algorithm-based distributed entity matching method
  • A stochastic algorithm-based distributed entity matching method
  • A stochastic algorithm-based distributed entity matching method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] The present invention will be further described in detail in conjunction with the following specific embodiments and accompanying drawings. The process, conditions, experimental methods, etc. for implementing the present invention, except for the content specifically mentioned below, are common knowledge and common knowledge in this field, and the present invention has no special limitation content.

[0029] The distributed entity matching method based on the random algorithm of the present invention supports the matching processing of massive entities. The present invention formulates an effective data storage strategy on an open-source distributed platform, utilizes efficient data indexing technology to support time-sensitive query processing, and designs a time-sensitive data storage strategy to provide guarantee for fast file location of query and realize An index based on inverted technology provides efficient file filtering for queries.

[0030] Such as figure 1...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a stochastic algorithm-based distributed entity matching method. The method comprises a data pretreatment step of performing feature extraction on original data and generating entities and vectors thereof; a signature generating step of generating a plurality of stochastic vectors according to the entities and the vectors thereof, generating a signature corresponding to each stochastic vector, performing multiple times of stochastic transformation on the signatures, and transmitting the serial numbers of the entities, the post-transformation signatures and transformation sequence numbers into distributed nodes; a matching pair generating step of rearranging and grouping the signatures in the distributed nodes and extracting matching pairs from the groups; a similarity calculating step of acquiring the similarities of the matching pairs by calculating Hamming distances. The solution can reduce redundant similarity calculations and can effectively increase the entity matching efficiency for structured data and unstructured data in distributed environment; while the accuracy is guaranteed, the processing speed is clearly higher than that of conventional relatively-advanced entity matching methods.

Description

technical field [0001] The invention belongs to the technical field of data integration and management, and in particular relates to a random algorithm-based distributed entity matching method. Background technique [0002] Entity matching technology (also known as entity resolution, data association and duplication detection, etc.) aims to identify records describing the same entity or object from the target data set, and achieve data matching by screening and fusing multiple records describing the same entity. Integrate and clean. For example, in a customer-to-customer (C2C) online marketplace, people can easily start an online store and list anything they want to sell, so the same item is likely to be sold by multiple sellers at different prices, qualities, and different Product descriptions are used to sell products, which leads to confusion for buyers when choosing. The purpose of entity matching is to find out which entity information corresponds to the same product ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/182G06F16/13
Inventor 张蓉晁平复高竹
Owner EAST CHINA NORMAL UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products