An improved method for Spark Broadcasthashjoin

A technology of key value and hash table, which is applied in the improvement field of SparkBroadcasthashjoin operation, which can solve the problems of low efficiency of search connection and failure to consider the prior probability

Inactive Publication Date: 2017-09-15
ZHENGZHOU YUNHAI INFORMATION TECH CO LTD
View PDF3 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, when this method handles conflicts, it only mounts the corresponding data in the linked list corresponding to the same hash value in turn every time a conflict is encountered, and does not take into account the prior probability of the data in the small table being searched by the large table, that is, The probability of the same field in the large table and the small table, resulting in a low search connection efficiency during the subsequent large table join operation

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An improved method for Spark Broadcasthashjoin
  • An improved method for Spark Broadcasthashjoin
  • An improved method for Spark Broadcasthashjoin

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] The core of the present invention is to provide an improved method of Spark Broadcasthashjoin operation, so as to improve search efficiency and further improve join connection efficiency.

[0026] In order to enable those skilled in the art to better understand the solutions of the present invention, the following will clearly and completely describe the technical solutions in the embodiments of the present invention in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0027] Please refer to figure 1 , figure 1 The flow chart of the improved method of a kind of Spark Broadcasthashjoin operation provided by the p...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an improved method for Spark Broadcasthashjoin. The method comprises the steps of acquiring small tables and acquiring a prior probability of search of key values in the small tables by big tables; arranging the keys in the small tables in a descending order according to the prior probability to obtain a new ordered table; building a Hash table by using the new ordered table and broadcasting the Hash table to each node; acquiring the content of big tables from each node and connecting the content of the big tables with matching items in the Hash table. The method can increase search efficiency and increase the connecting efficiency of join.

Description

technical field [0001] The invention relates to the technical field of Spark big data processing, in particular to an improved method of SparkBroadcasthashjoin operation. Background technique [0002] At present, Spark is a distributed and parallel big data processing framework that has developed rapidly in recent years, and Spark SQL provides processing for structured data. Broadcasthashjoin is an important operation in Spark SQL, which is used to handle multi-table joins. Broadcastthashjoin is an optimized processing of the join operation. [0003] In Spark SQL, Broadcastthashjoin first uses broadcast to distribute the small table to each execution node, then uses the key data in the small table to calculate the hash value to build a hash table, and finally calculates the hash value corresponding to the large table, and searches in the hash table Finally complete the table join. During the broadcastthashjoin operation, a small table needs to be used to create a hash tab...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/22G06F16/2456
Inventor 曹芳
Owner ZHENGZHOU YUNHAI INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products