Large-scale high-dimensional data approximate neighbor query system and method based on Spark

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A high-dimensional data and neighbor query technology, which is applied in other database retrieval, other database indexing, electrical digital data processing, etc., can solve the problem that the performance of throughput and delay cannot meet the actual needs, and does not consider non-spatial high-dimensional vectors requirements, inability to support non-spatial high-dimensional vector data and other issues, to achieve improved query throughput, wide applicability, and significant effects

Pending Publication Date: 2022-04-12

SHANGHAI JIAO TONG UNIV

View PDF0 Cites 1 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0012] 1) It is specially designed for the characteristics of spatial data vectors. It does not consider the needs of non-spatial high-dimensional vectors, and cannot support a large number of existing non-spatial high-dimensional vector data. It is necessary to mine the relationship between vectors and design more effective query methods

[0013] 2) When querying, it often traverses all partitions, then collects the results of each partition, and generates the final result after centralized processing, which requires a large workload and takes a long time

[0014] 3) All are accurate queries, and the accuracy is higher than the actual demand, but the performance such as throughput and delay cannot meet the actual demand

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0069] The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. It should be understood by those skilled in the art that the described embodiments are some, but not all, embodiments of the present invention. Based on the embodiments in the present application, those skilled in the art can make any appropriate modification or variation to obtain all other embodiments.

[0070] In the first aspect, the embodiment of the present invention proposes a large-scale high-dimensional data approximate neighbor query system based on Spark, the system includes:

[0071] Vector acquisition module, index building module and query module.

[0072] The vector acquisition module is used to acquire the vectors to be processed by the system, that is, the data sets to be processed, including the vectors to be processed converted from the unstructured data to...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a Spark-based large-scale high-dimensional data approximate neighbor query system and method. The system and the method are mainly used for executing approximate neighbor query in a memory. Firstly, clustering partition is carried out according to the similarity of vectors, and each clustering partition corresponds to one partition of a Spark elastic distributed data set. And proportionally sampling the data of each partition, and labeling the partitions. And establishing a global index on the main node by using the sampling data, and establishing a partition index on the corresponding partition. And during query, finding a plurality of corresponding partitions needing to be queried through the global index, and summarizing and sorting results of the partitions to obtain a final result. According to the technical scheme, a highly extensible distributed approximate neighbor query scheme is provided based on the Spark system, and meanwhile the characteristics of low delay and high throughput are achieved.

Description

technical field [0001] The present invention relates to the technical field of computer data management, and more specifically, to a large-scale high-dimensional data fast retrieval method and system. Background technique [0002] Neighbor search is an important operation in many applications, such as image retrieval, recommender system and data mining all need neighbor search. With the rapid development of artificial intelligence related fields, machine learning algorithms have made major breakthroughs in computer vision, speech recognition, natural language processing and other application fields. A large amount of unstructured data (pictures, voice, text) can be converted into vector More efficient data representation has led to the generation of massive vector data. The vector nearest neighbor search algorithm needs to meet the requirements of scalability, high throughput, and low latency in order to efficiently process massive vector data. For example, in Taobao recom...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F16/901G06F16/9032

Inventor 徐姚亨姚斌张鹏程唐飞龙沈耀郑文立

Owner SHANGHAI JIAO TONG UNIV

Large-scale high-dimensional data approximate neighbor query system and method based on Spark

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology