Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Data similarity determination method and device and processing equipment

A technology of data similarity and determination method, applied in the field of data processing, can solve the problem of low efficiency of LSH algorithm, and achieve the effect of reducing computational complexity, improving performance and improving efficiency

Active Publication Date: 2019-10-08
HUAWEI TECH CO LTD
View PDF14 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The present application provides a data similarity determination method, device and processing equipment, which can solve the problem of low efficiency of the LSH algorithm in the related art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data similarity determination method and device and processing equipment
  • Data similarity determination method and device and processing equipment
  • Data similarity determination method and device and processing equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0048] The data similarity determination method provided by the embodiment of the present invention can be applied to a stand-alone environment, that is, a single processing device. The processing device may be a computer or a server. Taking a single processing device as an example, refer to figure 1, the processing device may include a memory 01, a hardware processor 02, and a central processing unit (Central Processing Unit, CPU) 03, and the CPU 03 may also be referred to as a host (Host) of the processing device. The number of hardware processors 02 provided in the processing device can be one or more, figure 1 Only one hardware processor is shown in .

[0049] Wherein, the memory 01 may be a solid state drive (Solid State Drives, SSD), and the SSD generally uses a flash memory (FLASH) as a storage medium. The performance of random write operations of SSD is lower than the performance of sequential write operations and read operations, and write operations will reduce th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a data similarity determination method and device and processing equipment, and relates to the field of data processing. The method comprises: obtaining a plurality of hash tables corresponding to a plurality of different hash functions in a one-to-one mode, wherin each hash table comprises at least one hash bucket, a plurality of key values are recorded in each hash bucket,and the hash values of tuples indicated by the key values are the same; dividing a plurality of hash buckets included in the plurality of hash tables into at least one cluster, wherein each cluster comprises a plurality of hash buckets of which the similarity is greater than a similarity threshold; and counting the occurrence frequency of the key value pairs belonging to different data sets in aplurality of hash buckets included in each cluster to obtain a counting frequency corresponding to each key value pair, wherein the counting frequency is positively correlated with the similarity degree of the tuple pairs indicated by the key value pairs. The data similarity determination method provided by the invention is relatively high in operation efficiency.

Description

technical field [0001] The present application relates to the field of data processing, in particular to a method, device and processing equipment for determining data similarity. Background technique [0002] A dataset usually records data in the form of a table, and each row in the table is a tuple (also called a record). Similarity join is a common data set operation, which refers to determining a tuple (Tuple) pair whose similarity is greater than a specified threshold from multiple data sets, and storing the tuple pair in the same row of the data set. [0003] In related technologies, a locality sensitive hash (Locality Sensitive Hashing, LSH) algorithm is generally used to determine the similarity of tuple pairs belonging to different data sets. Specifically, the LSH algorithm can use multiple different hash functions to perform hash mapping on each tuple in each data set, and obtain the hash value of each tuple under different hash maps; The tuple pairs of the data ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/22G06F16/28
CPCG06F16/2255G06F16/284
Inventor 傅忱忱薛春李建华王元钢郭鑫
Owner HUAWEI TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products