A method and system for accelerating data distributed processing across multiple data centers

A distributed processing, multi-data technology, applied in the field of data analysis, can solve problems such as insufficient consideration of site heterogeneity, and achieve the effect of reducing job processing time, accurate time estimation, and job response time.

Active Publication Date: 2021-05-11
NAT UNIV OF DEFENSE TECH
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to provide a method and system for accelerating data distributed processing across multiple data centers, so as to solve the technical defects in the prior art that do not fully consider the heterogeneity between sites

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and system for accelerating data distributed processing across multiple data centers
  • A method and system for accelerating data distributed processing across multiple data centers
  • A method and system for accelerating data distributed processing across multiple data centers

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0205] This example evaluates the performance of SDTP by comparing SDTP with several classical task placement methods in terms of average response time and average slowdown, mainly by reducing average response time and average slowdown compared to various methods to draw conclusions. Among them, slowdown is defined as the response time reduction rate of a single job compared to other methods. For example, the response time for job A using In-Place is , the response time of job A using SDTP is ; so the slowdown for job A compared to using the In-Place response time is . Average slowdown is the sum of all slowdowns for each job divided by the number of jobs.

[0206] Figure 9 (a) shows the SDTP improvement in average job response time with different number of sites. Clearly, SDTP significantly outperforms other benchmark methods. In particular, when the number of sites is 10, the method of the present invention reduces the average job response time of all job types by...

Embodiment 2

[0214] This example will quantify the impact of various parameters on SDTP, including and count the number of instances, is the ratio of the amount of intermediate data to the input data import stage.

[0215] Figure 12 (a) depicts Impact. The figure indicates the different The response time of the value is the same as ratio, where Yes response time. It can be seen that the job response time increases with increases with the increase. This is because the larger More intermediate data will be generated. Both transmitting this intermediate data during the shuffle phase and processing it during the reduce phase may increase the overall response time.

[0216] Figure 12 (b) illustrates the difference compared to In-Place, Iridium and Tetrium The reduction in the average response time of the value. It can be observed that as q increases, the decrease in mean response time increases compared to Tetrium, while the decrease in mean response time is relatively...

Embodiment 3

[0220] This embodiment considers the influence of parallelism in parallel computing. The effect of the forecasting method on the response time of different stages is first evaluated. Thereafter, the impact of computational properties on different methods in parallel computing and the improvement of the average response time of the methods are evaluated under consideration of the degree of parallelism.

[0221] This example uses BigDataBench to measure the time of multiple queries running on Spark with varying amounts of data and degrees of parallelism. According to the results, the present embodiment uses the multiple linear regression algorithm to construct a prediction model of the computation time of each stage. The results show that the R2 statistics are all greater than 0.9, where R is the correlation coefficient. The value of the F statistic is greater than the value according to the F distribution table. The probabilities p corresponding to the F statistic are all le...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention proposes a method for accelerating data distributed processing across multiple data centers. In this method, each station can perform corresponding computing tasks as long as it obtains the required input data. The input data loading, map calculation, shuffle transfer, and reduce calculation processes of each site do not need to wait for the previous processes of other sites to complete the corresponding operations. At the same time, the present invention provides accurate calculation time estimation, and makes the method of the present invention adapt to the dynamic wide area network bandwidth to improve the practicability of SDTP, and can greatly reduce the response time of the job. The present invention also proposes a data distributed processing acceleration system across multiple data centers. Corresponding to the above method, the network and computing resources of cross-regional distribution sites can be fully used, thereby effectively analyzing cross-regional distributed data without waiting The bottleneck site in the previous stage completes the corresponding data transmission or computing tasks.

Description

technical field [0001] The invention relates to the field of data analysis, and specifically discloses a data distributed processing acceleration method and system across multiple data centers. Background technique [0002] Cloud providers such as Google, Amazon and Alibaba have deployed data centers around the world to provide instant services. These services generate massive amounts of data on a global scale, including transaction data, user logs, and performance logs, among others. Mining this geographically distributed data (also known as wide-area analysis) is critical for business advice, anonymous detection, performance upgrades, and system maintenance, among others. Distributed computing frameworks such as Map-Reduce are often implemented to mine such massive datasets. The main challenge of this computing approach is the heterogeneity of hardware resources among geographically distributed sites, mainly including computation, uplink bandwidth and downlink bandwidth....

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): H04L12/24H04L29/08
CPCH04L41/0823H04L67/10H04L67/60
Inventor 郭得科陈亦婷袁昊郑龙罗来龙
Owner NAT UNIV OF DEFENSE TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products