Implementation method of multidimensional index structure obf-index in hadoop environment

A technology of obf-index and index structure, applied in the field of cloud storage, which can solve problems such as high false positive rate

Active Publication Date: 2021-06-04
YUNNAN UNIV
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In 2008, Google's daily data volume exceeded 20PB. In 2016, Ali needed to process more than 100PB of data per day, and had more than 1 million big data tasks per day. It was impossible to use a single machine to achieve data processing of this amount of data.
However, because of the probabilistic data structure of Bloom Filter, the false positive rate will increase as more and more data is inserted.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Implementation method of multidimensional index structure obf-index in hadoop environment
  • Implementation method of multidimensional index structure obf-index in hadoop environment
  • Implementation method of multidimensional index structure obf-index in hadoop environment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0029] In order to better illustrate the technical solution of the present invention, firstly, the idea of ​​the present invention is briefly described.

[0030] In Hadoop, the purpose of quickly processing data is achieved through the parallel operation of multiple Mappers and multiple Reducers. Because the data stored on HDFS is generally in the order of GB, TB or more, it is impossible to allocate all the data to one machine for execution when executing a task. Therefore, before executing Map, Hadoop first divides the input data into fixed-size blocks to obtain data fragments (InputSplits), and then each fragment will be assigned to an independent Mapper.

[0031] figure 1 It is a schematic diagram of the original MapReduce process. Such as figure 1As shown, in the original MapReduce process, the Mapper receives data fragments, and the Reducer often copies and processes data from the relevant Mapper at runtime, so the resources of the Reducer node are less than that of t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for realizing a multi-dimensional index structure OBF-Index under Hadoop environment, which divides a data set to obtain data slices, creates an OBF index object for each data slice and serializes it into an OBF index file for storage , to construct the OBF-Index; when the data set needs to be used, first set the data set A to be used, then read the OBF index file of each data fragment and deserialize it to obtain the OBF index object, and use the OBF index object to query Whether the data in data set A exists in the data shard, if yes, pass the data shard to the corresponding Mapper, otherwise do nothing. The present invention designs a multi-dimensional index structure OBF-Index, which can efficiently implement creation and query, and can effectively reduce the false positive rate.

Description

technical field [0001] The invention belongs to the technical field of cloud storage, and more specifically relates to a method for realizing a multi-dimensional index structure OBF-Index in a Hadoop environment. Background technique [0002] We are living in an era of big data. Various types of logs on the Internet (such as click logs), content posted by users (such as tweets posted by users on Twitter), and graph data (such as social networks) are massive data. source. In 2008, Google's daily data volume exceeded 20PB. In 2016, Ali had to process more than 100PB of data per day, and had more than 1 million big data tasks per day. It was impossible to use a single machine to process data of this volume. In recent years, distributed computing, grid computing, and cloud computing technologies have become increasingly mature. As early as 2003 and 2004, Google published two articles to show people their two new technologies GFS (Google File System) and MapReduce in order to d...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/22G06F16/27
CPCG06F16/2228G06F16/27
Inventor 李劲刘建坤窦奇伟何臻力周维
Owner YUNNAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products