Unlock instant, AI-driven research and patent intelligence for your innovation.

Data similarity determination method, device and processing device

A technology of data similarity and determination method, applied in the field of data processing, can solve the problem of low efficiency of LSH algorithm, and achieve the effect of improving computing speed, improving performance, and high data transmission bandwidth

Active Publication Date: 2021-10-22
HUAWEI TECH CO LTD
View PDF14 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The present application provides a data similarity determination method, device and processing equipment, which can solve the problem of low efficiency of the LSH algorithm in the related art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data similarity determination method, device and processing device
  • Data similarity determination method, device and processing device
  • Data similarity determination method, device and processing device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0048] The data similarity determination method provided by the embodiment of the present invention can be applied to a stand-alone environment, that is, a single processing device. The processing device may be a computer or a server. Taking a single processing device as an example, refer to figure 1, the processing device may include a memory 01, a hardware processor 02, and a central processing unit (Central Processing Unit, CPU) 03, and the CPU 03 may also be referred to as a host (Host) of the processing device. The number of hardware processors 02 provided in the processing device can be one or more, figure 1 Only one hardware processor is shown in .

[0049] Wherein, the memory 01 may be a solid state drive (Solid State Drives, SSD), and the SSD generally uses a flash memory (FLASH) as a storage medium. The performance of random write operations of SSD is lower than the performance of sequential write operations and read operations, and write operations will reduce th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The application provides a data similarity determination method, device and processing equipment, which relate to the field of data processing. The method includes: obtaining a plurality of hash tables corresponding to a plurality of different hash functions one-to-one, and each hash The table includes at least one hash bucket, and multiple key values ​​are recorded in each hash bucket, and the hash values ​​of the tuples indicated by the multiple key values ​​are the same; the multiple hash tables included in the multiple hash tables The bucket is divided into at least one cluster, and each cluster includes multiple hash buckets whose similarity is greater than the similarity threshold; the number of occurrences of key-value pairs belonging to different data sets in the multiple hash buckets included in each cluster is respectively calculated Statistically, the statistical frequency corresponding to each key-value pair is obtained, and the statistical frequency is positively correlated with the similarity degree of the tuple pair indicated by the key-value pair. The method for determining the data similarity provided by the present application has high operation efficiency.

Description

technical field [0001] The present application relates to the field of data processing, in particular to a method, device and processing equipment for determining data similarity. Background technique [0002] A dataset usually records data in the form of a table, and each row in the table is a tuple (also called a record). Similarity join is a common data set operation, which refers to determining a tuple (Tuple) pair whose similarity is greater than a specified threshold from multiple data sets, and storing the tuple pair in the same row of the data set. [0003] In related technologies, a locality sensitive hash (Locality Sensitive Hashing, LSH) algorithm is generally used to determine the similarity of tuple pairs belonging to different data sets. Specifically, the LSH algorithm can use multiple different hash functions to perform hash mapping on each tuple in each data set, and obtain the hash value of each tuple under different hash maps; The tuple pairs of the data ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/22G06F16/28
CPCG06F16/2255G06F16/284
Inventor 傅忱忱薛春李建华王元钢郭鑫
Owner HUAWEI TECH CO LTD