Optimization system and method for shuffling stage in Hadoop MapReduce

An optimization method and shuffling technology, which is applied in the fields of big data and cloud computing, can solve the problems of not finding instructions or reports, affecting the task completion time, and not yet collecting data, so as to shorten the data reading time and optimize the task completion time , Optimize the effect of tail delay

Active Publication Date: 2019-11-26
SHANGHAI JIAO TONG UNIV
View PDF4 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, dividing the task into a large number of subtasks will cause the shuffling phase to read and write a large number of small files
Such a large amount of small data volume and random I / O disk reading and writing will become the bottleneck of the shuffling stage and seriously affect the task completion time
[0011] At present, there is no description or report of the similar technology of the present invention, and no similar data at home and abroad have been collected yet.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Optimization system and method for shuffling stage in Hadoop MapReduce
  • Optimization system and method for shuffling stage in Hadoop MapReduce

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] The following is a detailed description of the embodiments of the present invention: this embodiment is implemented on the premise of the technical solution of the present invention, and provides detailed implementation methods and specific operation processes. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention, and these all belong to the protection scope of the present invention.

[0034] The embodiment of the present invention provides a kind of optimization system aiming at the shuffling phase in Hadoop MapReduce, including system master node and system work node; Wherein:

[0035] The main node of the system includes: a scheduler module and a communication module a, the scheduler module is used to schedule the time when partition files are merged in advance, the time for shuffling in advance, and the destination of the shuffling results; the communication module...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an optimization system for a shuffling stage in Hadoop MapReduce. The optimization system runs in a working node and a main node of the Hadoop MapReduce in a daemon process mode, and communicates with the Hadoop MapReduce in an inter-process communication and remote process calling mode. Meanwhile, the invention provides an optimization method based on the optimization system. After the optimization system provided by the invention is operated, all intermediate data in Hadoop MapReduce task operation is taken over; and by means of pre-merging and pre-shuffling, on one hand, the idle network bandwidth in the Map stage is reasonably utilized, and on the other hand, small file reading and writing are effectively reduced after intermediate data in the same node is merged, so that the MapReduce task completion time is optimized.

Description

technical field [0001] The invention relates to the technical field of big data and cloud computing, in particular to an optimization system and method for the Shuffle stage in Hadoop MapReduce. Background technique [0002] MapReduce is a distributed computing framework for processing big data. Hadoop MapReduce is the most well-known and widely used open source implementation of MapReduce. Hadoop MapReduce users can process massive data (TB-level or even PB-level data) in parallel on large-scale clusters (up to thousands of nodes) by simply writing Map and Reduce algorithms. Moreover, Hadoop MapReduce provides a strong fault tolerance capability to ensure that tasks are completed in thousands of nodes. [0003] Hadoop MapReduce follows the BSP (Bulk Synchronous Parallel) model and abstracts the distributed computing process into three stages: Map, Shuffle, and Reduce. [0004] The operation of the Map phase is divided into two sub-phases: Map calculation and partition (P...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/50G06F16/18
CPCG06F9/5066G06F16/1815Y02D10/00
Inventor 管海兵吴仲轩任锐戚正伟
Owner SHANGHAI JIAO TONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products