Parallel indexing method supporting real-time biased query of high dimensional data

A high-dimensional data and indexing technology, applied in the search field, can solve problems such as unsatisfactory real-time performance and scalability, and achieve good real-time performance

Active Publication Date: 2013-12-18
SHENZHEN INSTITUTE OF INFORMATION TECHNOLOGY
View PDF3 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0011] The purpose of the embodiments of the present invention is to provide a parallel indexing method that supports real-time biased query of high-dimensional

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Parallel indexing method supporting real-time biased query of high dimensional data
  • Parallel indexing method supporting real-time biased query of high dimensional data

Examples

Experimental program
Comparison scheme
Effect test

specific example

[0027] The specific example is as follows: a hash function maps a d-dimensional vector v to a set of int values. The way each hash function in this group is indexed is determined by a, b, where a is a d-dimensional vector, which satisfies the "stable (stable) distribution" in the existing LSH algorithm, and b is in In the existing LSH algorithm, it is a real number evenly distributed in the interval [0, r]. The specific embodiment of the present invention modifies b to generate uniform distribution according to the density of data, such as normal distribution can be adopted according to the characteristics of data, so that the length of each section is different, but after a and b are given, a special based on The position-sensitive hash of "stable distribution" can be generated by (a.v+b) / r. Since the value of b is distributed according to density rather than constant, the data can be distributed as evenly as possible in each data Buckets, so as to avoid the problem of uncer...

Embodiment

[0031] The flow chart of index establishment and query provided by the embodiment of the present invention is as follows figure 2 As shown, it is divided into two parts, the offline processing part and the online processing part.

[0032] Offline processing part: use (MapReduce and other methods) to perform feature extraction on data attributes as input, and multiple index servers use hash functions that elastically divide data buckets according to data density (such as normal distribution) to construct data vectors according to the LSH algorithm Index, the vector index of each index server forms an orthogonal relationship to support the index of massive data.

[0033] Online processing part: when the user's query comes, the distance change caused by the user's biased (weighted) query is first projected into the existing index structure through the directed clustering mapping method, so as to reduce the calculation time of index reconstruction ; If the mapping error exceeds ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention is applicable to the field of indexing technologies and provides a parallel indexing method supporting real-time biased query of high dimensional data. The method includes: a query system extracts features of data attribute by means of MapReduce and the like and inputs the features; a plurality of index servers in the query system establish parallel indexes by a Hash function which flexible divides data buckets according to data density; distance change carried by biased query is projected to map to the index servers of the query system by a directed clustering mapping method; if mapping errors exceed the range acceptable to users, the query system submits the biased query to the parallelly-combined index servers for respective processing; the parallelly-combined index servers return screened results respectively according to weight ratios given by users; all returned results are calculated and combined to ensure returning of query response results in the determined time. The method has the advantage that massive data can be handled.

Description

technical field [0001] The invention belongs to the technical field of searching, and in particular relates to a parallel indexing method supporting real-time biased query of high-dimensional data. Background technique [0002] High-dimensional data: Refers to data with attributes (features) more than 20 dimensions. Various types of transaction data, social network information, Web documents and usage data, geographic information, document word frequency data, user rating data, multimedia data, etc. present multi-source, massive, heterogeneous (unstructured data models) and high The characteristics of dimensions, that is, their dimensions (attributes), can usually reach hundreds or thousands of dimensions, or even higher, resulting in increasingly complex data that needs to be retrieved in various applications and a rapid expansion of data volume. Biased query: Based on their own preferences and experience in environment interaction, users only care about certain feature di...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 王寅峰邓果丽许志良
Owner SHENZHEN INSTITUTE OF INFORMATION TECHNOLOGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products