Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A way to achieve mapreduce data localization within a job

A job and data technology, applied in the computer field, can solve problems such as low practicability, low applicability, and network bandwidth consumption.

Inactive Publication Date: 2018-04-06
UNIV OF ELECTRONICS SCI & TECH OF CHINA
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] 1. The network bandwidth consumption caused by users uploading data from local to HDFS;
[0005] 2. The network bandwidth consumption caused by transferring the output of the Map stage to the Reduce stage in the Shuffle stage;
[0006] 3. The network bandwidth consumption caused by storing the processing results in HDFS in the Reduce stage;
[0007] 4. Network bandwidth consumption caused by non-localized tasks
In addition, through experiments, it is found that most of the non-localization tasks appear after the start of the Shuffle phase, and the non-localization tasks that appear at this time will compete with the Shuffle phase for network bandwidth resources, thereby delaying the execution progress of the job itself
[0009] There are many scheduling strategies for improving the data localization degree in the Map stage, but there are some problems such as low practicability and limited scope of application.
Zaharia et al. proposed a delay scheduling algorithm that can effectively improve the degree of data localization (“Delay scheduling: a simple technique for achieving locality and fairness including cluster scheduling,” in Proceedings of the 5th European conference on Computersystems.ACM,2010,pp.265 –278.), but this method of delayed scheduling is based on the loss of execution efficiency of local jobs, and this scheduling algorithm is not widely applicable, and cannot be obtained when only one or a few jobs are running Optimal data localization and overall job execution time
Xie et al proposed a method to distribute data in advance according to the performance of computing nodes (“Improving mapreduce performance through data placement intensive hadoop clusters,” in Parallel & Distributed Processing, Workshops and PhD Forum (IPDPSW), 2010 IEEE International Symposium on.IEEE, 2010, pp .1–9.), this method needs to measure the performance of each computing node in advance, and this method is not very practical under the MapReduce platform where computing resources of computing nodes can be dynamically set by adjusting parameters

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A way to achieve mapreduce data localization within a job

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] A method for realizing the localization of MapReduce data within a job, the process of which is as follows figure 1 As shown, on a cluster with n physical computing nodes, for a specific scheduled job A, localization is realized in the following way during its implementation:

[0037] Step 1: Since clusters can be divided into homogeneous and heterogeneous, it is assumed that the cluster is homogeneous when the calculation has not yet started, that is, it is assumed that the computing performance of all physical computing nodes P i are all 1, where i∈[1,n]; for job A, assuming that the number of data blocks corresponding to the job is b, and the default number of backups for each data block on HDFS is 3, set The number of data blocks is F Ti , then the total number of data blocks ∑F Ti = 3b;

[0038] The number of localized data blocks of job A on each computing node is used as a parameter to establish a small top heap and perform the first round of task assignment o...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a method for realizing MapReduce data localization in an operation, and belongs to the technical field of computers. In the present invention, by changing the task scheduling algorithm in the job, the localization degree of the task data of the job can be further improved. For a specified job, when scheduling a computing task to a computing node, the method of the present invention comprehensively considers the number of remaining localized data blocks of the computing node on the job and the number of localized data blocks on the computing node calculated through a series of processes. The number of tasks that are expected to be processed in the future is used to allocate different tasks. The method provided by the present invention does not need to measure the computing performance of each computing node in advance, it is flexible and convenient to implement, and does not affect the execution efficiency of local jobs, and can reduce the network bandwidth occupation in the Map stage to the greatest extent, thereby improving the parallelism of cluster jobs. The overall execution time of the job is also significantly reduced.

Description

technical field [0001] The invention belongs to the technical field of computers, and in particular relates to an optimization method for realizing the localization of MapReduce data in a job. Background technique [0002] With the development of the distributed computing model, after the MapReduce distributed computing model, many other distributed computing models have emerged, such as Spark and Storm. These models have their own emphasis on data processing, so in large Internet companies, these distributed models are all carried on a physical cluster at the same time. Although each distributed computing model can be isolated from each other, the network bandwidth resources of the entire physical cluster are shared. Therefore, improving the network bandwidth consumption of the MapReduce computing model and reducing the network bandwidth consumption of the MapReduce computing model are not only for the same physical cluster. Other computing models of the cluster are benefi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F9/46
Inventor 高胜立薛瑞尼管仲洋
Owner UNIV OF ELECTRONICS SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products