Unlock instant, AI-driven research and patent intelligence for your innovation.

Spark-based high-dimensional sequence data similarity query method and system

A technology of sequence data and query method, applied in database indexing, electronic digital data processing, structured data retrieval, etc., can solve the problem of high data dimension, achieve the effect of reducing volume, good query accuracy, and good scalability

Active Publication Date: 2022-02-22
ANHUI UNIVERSITY OF TECHNOLOGY +1
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] A Spark-based high-dimensional sequence data similarity query method, running in the distributed cluster environment Spark, its memory-based computing characteristics can improve the speed of program operations; using locality-sensitive hash (LSH) function, Effectively solve the problem of excessive data dimensionality

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Spark-based high-dimensional sequence data similarity query method and system
  • Spark-based high-dimensional sequence data similarity query method and system
  • Spark-based high-dimensional sequence data similarity query method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0057] Below in conjunction with accompanying drawing and specific embodiment, further illustrate the present invention, should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various aspects of the present invention All modifications of the valence form fall within the scope defined by the appended claims of the present application.

[0058] A Spark-based high-dimensional sequence data similarity query system, its architecture is as follows figure 1 As shown, including the Spark cluster unit, compound hash function g i Unit, the Spark cluster unit includes the interconnected Driver Progarm module (the process manager module that runs the main() function in the Application and creates the SparkContext), the Cluster Manager module (cluster manager module) and the Worker Node module (worker node modu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Spark-based high-dimensional sequence data similarity query method and system, including processing steps such as data preprocessing, index construction and query. The invention uses the distributed cluster Spark to improve the computing power; by using the locality sensitive hash (LSH) function to build the index, the problems of high-dimensional sequence data processing difficulty and the like are solved; the query process is only carried out in some worker nodes, which greatly reduces the data consumption Through the collision counting mechanism, the size of the candidate set is effectively reduced, and the speed of similarity search is accelerated. The present invention can quickly and accurately find out most of the similar data objects from a large-scale data set for a high-dimensional sequence data object arbitrarily given by the user.

Description

technical field [0001] The invention relates to a high-dimensional sequence data similarity query method based on a Spark cluster, belonging to the technical field of distributed cluster computing and big data processing. Background technique [0002] The similarity query of high-dimensional sequence data is a method to find the most similar set of high-dimensional sequence data subsets from a given massive high-dimensional sequence data set. It has a wide range of applications in the field. However, high-dimensional sequence data has the characteristics of large data volume, which makes the efficiency of similarity query operation in a stand-alone environment low. At the same time, due to the high dimensionality of high-dimensional data, it is easy to cause the disaster of dimensionality, and with the increase of data dimensionality, the contrast between data and data also gradually decreases, making the performance of similarity query algorithms drop sharply. [0003] So...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/2458G06F16/22G06F16/2453
CPCG06F16/2462G06F16/2255G06F16/2453
Inventor 郑啸张震陈启航黄俊
Owner ANHUI UNIVERSITY OF TECHNOLOGY