Efficient distributed locality sensitive Hashing method

A locally sensitive hashing and distributed technology, which is applied in special data processing applications, instruments, electrical digital data processing, etc., to reduce the amount of shuffling, improve query performance, and improve efficiency

Active Publication Date: 2017-11-24
NAT UNIV OF DEFENSE TECH
View PDF5 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] Aiming at solving the similarity search problem of massive high-dimensional data, the present invention implements an efficient distributed local sensitive hashing method based on the Spark platform

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Efficient distributed locality sensitive Hashing method
  • Efficient distributed locality sensitive Hashing method
  • Efficient distributed locality sensitive Hashing method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] In order to better understand the technical solutions in this application, the following will give a clear and detailed description of this application in conjunction with the drawings and specific implementation methods in the embodiments of this application:

[0029] like figure 1 As shown, the present invention discloses an efficient distributed locality-sensitive hashing method, which is designed and implemented based on the distributed computing framework Spark, and includes the following parts:

[0030] A client is an application that defines a specific task, such as building a hash table or executing a query. The client submits the task to the master node for scheduling, then sends it to each computing node for parallel execution, and waits to receive the calculation result.

[0031] The master node communicates with the client and the working nodes, receives the jobs submitted by the clients, divides the jobs into a set of tasks, schedules the tasks according t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a distributed locality sensitive Hashing method. The method comprises the steps that original data is loaded from a distributed file system, an original data vector set is read, and a first elastic distributed dataset is generated; L composite Hash functions are constructed according to the number L of Hash tables and the number k of Hash functions designed by a user; L Hash values of each piece of data in the dataset are calculated, each piece of data is mapped into one Hash bucket of each Hash table, key value pairs composed of Hash table identifiers in all the data and values of the composite Hash functions are merged into a string, the string is mapped into digital key values, the digital key values and data identifiers form key value pairs, and the key value pairs are saved as a second elastic distributed dataset; and repartitioning is performed according to the digital key value of each piece of data in the second dataset, so that data with the same digital key value is saved in the same partition, and construction of the Hash tables is completed. Through the method, the shuffle amount generated in the Hash table construction process can be reduced, index construction efficiency can be improved, and message transmission overhead can be reduced during query.

Description

technical field [0001] The invention belongs to the field of big data data mining in Internet technology, in particular to the distributed realization of a local sensitive hash method, which accelerates the similarity search of massive high-dimensional data. Background technique [0002] Similarity search is an important problem in the field of multimedia information retrieval. It refers to finding the highest similarity (or object with the smallest distance). In order to improve the efficiency of similarity search, KD-tree, R-tree, SR-tree and other indexing methods have been proposed one after another, and they have good results in low-dimensional space. However, as the dimensionality of the data increases, the performance of these methods shows a sharp decline, a problem known as the "curse of dimensionality". To overcome the "curse of dimensionality", many approximate search methods have been proposed, one of the most well-known methods is Locality Sensitive Hashing (L...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/2255G06F16/2453G06F16/24554G06F16/2471
Inventor 张万新李东升徐颖
Owner NAT UNIV OF DEFENSE TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products