Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

RDD partition internal data index establishing method, click checking method and joinRDD click checking method

A technology for establishing internal data and indexes, applied in database indexing, structured data retrieval, digital data information retrieval, etc., can solve problems such as poor performance of lookupAPI, improve query efficiency, prevent OOM, and improve query efficiency

Inactive Publication Date: 2020-06-19
INSPUR SUZHOU INTELLIGENT TECH CO LTD
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] In order to solve the above problems, the present invention provides an RDD partition internal data index establishment method, an RDD check method and a join RDD check method. By building an index for the internal data of the RDD Partition, the problem of poor performance of Spark's native lookup API is solved. Achieve the technical effect of improving query efficiency, and avoid the actual join of RDD, effectively preventing the occurrence of OOM

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • RDD partition internal data index establishing method, click checking method and joinRDD click checking method
  • RDD partition internal data index establishing method, click checking method and joinRDD click checking method
  • RDD partition internal data index establishing method, click checking method and joinRDD click checking method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0038] This embodiment provides a method for establishing an internal data index of an RDD partition. The index constructed for the RDD is an Array whose element type is HashMap, which corresponds to a partition one by one, and each HashMap stores the internal data index information of the corresponding partition.

[0039] What needs to be explained in this embodiment is that the data type of the RDD is (K, V). Before indexing, it is first possible to determine whether the RDD has a partitioner, and if there is a partitioner, perform subsequent steps. That is, ensure that the RDD has a partitioner, so as to ensure that elements with the same key value in the RDD will be in the same partition.

[0040] Such as figure 1 with 2 As shown, this method to establish partition internal data specifically includes the following steps:

[0041] S1-1, define an Array that stores the internal data index of the partition. The elements of the Array are of HashMap type, and a HashMap corres...

Embodiment 2

[0050] Based on the index established in Embodiment 1, this embodiment provides an RDD counting method, which uses the index search to obtain the partition index according to the partition information, and then obtains the position of the data in the partition according to the index, and finally obtains the data.

[0051] Such as image 3 As shown, the method specifically includes the following steps:

[0052] S2-1, obtain the index information of the partition where the key to be searched is located according to the partitioner of the partition;

[0053] S2-2, according to the index information of the partition, obtain the partition internal data index (ie a HashMap) corresponding to the partition from the RDD index;

[0054] S2-3, call the apply method of HashMap to obtain the position pos of the key to be found in the partition, pos is an ArrayBuffer;

[0055] S2-4, call the slice method of the partition iterator according to the pos information, obtain the slice data of ...

Embodiment 3

[0059] Based on the first and second embodiments above, this embodiment provides a join RDD enumeration method, using the index established by the method in the first embodiment and the enumeration method in the second embodiment to search the RDD after natural connection.

[0060] Such as Figure 4 As shown, the method specifically includes the following steps:

[0061] S3-1, call the method of Embodiment 1 for the two RDDs that need to be joined, and construct the corresponding RDD index;

[0062] S3-2, calling the method of Embodiment 2 on the two RDDs to find the value value that meets the conditions;

[0063] S3-3, combine the query results of the two RDDs and return the results in the form of join data.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses an RDD partition internal data index establishing method, an RDD point check method and a join RDD point check method. The method includes: establishing an index for the internal data of the RDD Partition; hashMap is used for storing the position information of each piece of data in the partition; the indexes of all Partions are combined with the indexes of the journey RDD;the method comprises the following steps: searching a key; all data in the partition does not need to be traversed; instead, the position of the key in the partition is directly found through the HashMap, and then the corresponding value is directly obtained from the specific position of the partition by utilizing the slice interface of the partition Iterator. The problem that the performance ofthe Spark native lookup API is poor is solved, and the technical effect of improving the query efficiency is achieved. In addition, the actual join of the RDDs can be avoided by creating indexes for the two RDDs needing join and then executing query on the indexes, OOM can be effectively prevented, and the query efficiency is improved.

Description

technical field [0001] The invention relates to the field of RDD indexing, in particular to a method for establishing an RDD partition internal data index, an RDD enumeration method, and a join RDD enumeration method. Background technique [0002] With the development of big data processing, the requirements for processing speed are getting higher and higher. Traditional distributed big data processing platforms based on disk storage are getting more and more difficult when dealing with big data processing, especially data processing such as machine learning and iterative operations. More and more powerless. In-memory computing technology emerged as the times require. In-memory computing is based on memory and does not need to frequently save intermediate results to disk during processing, thus avoiding unnecessary I / O overhead. The advantages brought by in-memory computing technology are significant. First of all, it can effectively accelerate the complex analysis and pro...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/22G06F16/2455G06F16/25G06F16/28
CPCG06F16/2228G06F16/2456G06F16/252G06F16/283
Inventor 黄伟
Owner INSPUR SUZHOU INTELLIGENT TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products