Unlock instant, AI-driven research and patent intelligence for your innovation.

Large-scale high-dimensional data approximate neighbor query system and method based on Spark

A high-dimensional data and neighbor query technology, which is applied in other database retrieval, other database indexing, electrical digital data processing, etc., can solve the problem that the performance of throughput and delay cannot meet the actual needs, and does not consider non-spatial high-dimensional vectors requirements, inability to support non-spatial high-dimensional vector data and other issues, to achieve improved query throughput, wide applicability, and significant effects

Pending Publication Date: 2022-04-12
SHANGHAI JIAO TONG UNIV
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0012] 1) It is specially designed for the characteristics of spatial data vectors. It does not consider the needs of non-spatial high-dimensional vectors, and cannot support a large number of existing non-spatial high-dimensional vector data. It is necessary to mine the relationship between vectors and design more effective query methods
[0013] 2) When querying, it often traverses all partitions, then collects the results of each partition, and generates the final result after centralized processing, which requires a large workload and takes a long time
[0014] 3) All are accurate queries, and the accuracy is higher than the actual demand, but the performance such as throughput and delay cannot meet the actual demand

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Large-scale high-dimensional data approximate neighbor query system and method based on Spark
  • Large-scale high-dimensional data approximate neighbor query system and method based on Spark
  • Large-scale high-dimensional data approximate neighbor query system and method based on Spark

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0069] The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. It should be understood by those skilled in the art that the described embodiments are some, but not all, embodiments of the present invention. Based on the embodiments in the present application, those skilled in the art can make any appropriate modification or variation to obtain all other embodiments.

[0070] In the first aspect, the embodiment of the present invention proposes a large-scale high-dimensional data approximate neighbor query system based on Spark, the system includes:

[0071] Vector acquisition module, index building module and query module.

[0072] The vector acquisition module is used to acquire the vectors to be processed by the system, that is, the data sets to be processed, including the vectors to be processed converted from the unstructured data to...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a Spark-based large-scale high-dimensional data approximate neighbor query system and method. The system and the method are mainly used for executing approximate neighbor query in a memory. Firstly, clustering partition is carried out according to the similarity of vectors, and each clustering partition corresponds to one partition of a Spark elastic distributed data set. And proportionally sampling the data of each partition, and labeling the partitions. And establishing a global index on the main node by using the sampling data, and establishing a partition index on the corresponding partition. And during query, finding a plurality of corresponding partitions needing to be queried through the global index, and summarizing and sorting results of the partitions to obtain a final result. According to the technical scheme, a highly extensible distributed approximate neighbor query scheme is provided based on the Spark system, and meanwhile the characteristics of low delay and high throughput are achieved.

Description

technical field [0001] The present invention relates to the technical field of computer data management, and more specifically, to a large-scale high-dimensional data fast retrieval method and system. Background technique [0002] Neighbor search is an important operation in many applications, such as image retrieval, recommender system and data mining all need neighbor search. With the rapid development of artificial intelligence related fields, machine learning algorithms have made major breakthroughs in computer vision, speech recognition, natural language processing and other application fields. A large amount of unstructured data (pictures, voice, text) can be converted into vector More efficient data representation has led to the generation of massive vector data. The vector nearest neighbor search algorithm needs to meet the requirements of scalability, high throughput, and low latency in order to efficiently process massive vector data. For example, in Taobao recom...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/901G06F16/9032
Inventor 徐姚亨姚斌张鹏程唐飞龙沈耀郑文立
Owner SHANGHAI JIAO TONG UNIV