Method for realizing MapReduce data localization in operations

A technology of operation and data, applied in the computer field, can solve problems such as low practicability, low applicability, network bandwidth consumption, etc.

Inactive Publication Date: 2015-09-16
UNIV OF ELECTRONICS SCI & TECH OF CHINA
View PDF3 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] 1. The network bandwidth consumption caused by users uploading data from local to HDFS;
[0005] 2. The network bandwidth consumption caused by transferring the output of the Map stage to the Reduce stage in the Shuffle stage;
[0006] 3. The network bandwidth consumption caused by storing the processing results in HDFS in the Reduce stage;
[0007] 4. Network bandwidth consumption caused by non-localized tasks
In addition, through experiments, it is found that most of the non-localization tasks appear after the start of the Shuffle phase, and the non-localization tasks that appear at this time will compete with the Shuffle phase for network bandwidth resources, thereby delaying the execution progress of the job itself
[0009] There are many scheduling strategies for improving the data localization degree in the Map stage, but there are some problems such as low practicability and limited scope of application.
Zaharia et al. proposed a delay scheduling algorithm that can effectively improve data localization ("Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling," in Proceedings of the 5th European conference on Computer systems. ACM, 2010, pp .265–278.), but this delayed scheduling method is based on the loss of execution efficiency of local jobs, and this scheduling algorithm is not widely applicable, when only one or a few jobs are running, and Cannot achieve optimal data localization and overall job execution time
Xie et al proposed a method to distribute data in advance according to the performance of computing nodes (“Improving mapreduce performance through data placement in heterogeneous hadoop clusters,” in Parallel&Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on.IEEE, 2010 ,pp.1–9.), this method needs to measure the performance of each computing node in advance, and this method is not very practical under the MapReduce platform where the computing resources of computing nodes can be dynamically set by adjusting parameters

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for realizing MapReduce data localization in operations

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] A method for realizing the localization of MapReduce data within a job, the process of which is as follows figure 1 As shown, on a cluster with n physical computing nodes, for a specific scheduled job A, localization is realized in the following way during its implementation:

[0037] Step 1: Since clusters can be divided into homogeneous and heterogeneous, it is assumed that the cluster is homogeneous when the calculation has not yet started, that is, it is assumed that the computing performance of all physical computing nodes P i are all 1, where i∈[1,n]; for job A, assuming that the number of data blocks corresponding to the job is b, and the default number of backups for each data block on HDFS is 3, set The number of data blocks is F Ti , then the total number of data blocks ∑F Ti = 3b;

[0038] The number of localized data blocks of job A on each computing node is used as a parameter to establish a small top heap and perform the first round of task assignment o...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for realizing MapReduce data localization in operations and belongs to the technical field of computers. The method is characterized in that the task data localization degree of operations can be further improved by changing a task scheduling algorithm in the operations. For certain assigned operation, when a calculation task is scheduled to a calculation node, distribution of different tasks is carried out by comprehensively considering the number of the remaining localized data blocks about the operation of the calculation node and the estimated number of tasks needing to be processed in the future about the calculation node obtained by a series of processing calculation. The method provided by the invention has the advantages that the calculation performances of all the calculation nodes do not need to be measured in advance, the implementation is flexible and convenient without influence on the executing efficiency of local operation, and the network bandwidth occupation of the Map stage can be reduced to the greatest extent, so that the degree of parallelism of clustering operations is improved, and simultaneously the whole execution time of all the operations is also obviously shortened.

Description

technical field [0001] The invention belongs to the technical field of computers, and in particular relates to an optimization method for realizing the localization of MapReduce data in a job. Background technique [0002] With the development of the distributed computing model, after the MapReduce distributed computing model, many other distributed computing models have emerged, such as Spark and Storm. These models have their own emphasis on data processing, so in large Internet companies, these distributed models are all carried on a physical cluster at the same time. Although each distributed computing model can be isolated from each other, the network bandwidth resources of the entire physical cluster are shared. Therefore, improving the network bandwidth consumption of the MapReduce computing model and reducing the network bandwidth consumption of the MapReduce computing model are not only for the same physical cluster. Other computing models of the cluster are benefi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/46
Inventor 高胜立薛瑞尼管仲洋
Owner UNIV OF ELECTRONICS SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products