Cross-multi-data-center data distributed processing acceleration method and system

A distributed processing and multi-data technology, applied in the field of data analysis, can solve problems such as insufficient consideration of site heterogeneity

Active Publication Date: 2021-03-19
NAT UNIV OF DEFENSE TECH
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to provide a method and system for accelerating data distributed processing across...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Cross-multi-data-center data distributed processing acceleration method and system
  • Cross-multi-data-center data distributed processing acceleration method and system
  • Cross-multi-data-center data distributed processing acceleration method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0205] This embodiment evaluates the performance of SDTP by comparing SDTP with several classic task placement methods, and comparing the average response time and average slowdown, mainly by reducing the average response time and average slowdown compared with various methods to draw conclusions. Among them, slowdown is defined as the reduction rate of response time of a single job compared with other methods. For example, the response time of job A using In-Place is , the response time of job A using SDTP is ; Therefore, compared to using the In-Place response time, the slowdown of job A is . The average slowdown is the sum of all slowdowns for each job divided by the number of jobs.

[0206] Figure 9 (a) shows SDTP's improvement in average job response time with varying numbers of stations. Clearly, SDTP significantly outperforms other baseline methods. In particular, when the number of sites is 10, our method reduces the average job response time of all job type...

Embodiment 2

[0214] This example will quantify the effect of various parameters on SDTP, including and the number of compute instances, It is the ratio of the intermediate data volume to the input data import stage.

[0215] Figure 12 (a) depicts Impact. The figure shows the different The value of the response time and the the ratio of Yes response time. It can be seen that the job response time varies with increased by the increase. This is because the larger Will generate more intermediate data. Transmitting these intermediate data during the shuffle phase and processing this intermediate data during the reduce phase may increase the overall response time.

[0216] Figure 12 (b) illustrates the difference compared to In-Place, Iridium and Tetrium The reduction in the average response time of the value. It can be observed that as q increases, the reduction in average response time increases compared to Tetrium, while the reduction in average response time is rela...

Embodiment 3

[0220] This embodiment considers the impact of parallelism in parallel computing. The effect of the prediction method on the response time of different stages is first evaluated. Thereafter, the impact of computational properties on different methods in parallel computing and the improvement in the average response time of methods is evaluated taking into account the degree of parallelism.

[0221] This embodiment uses BigDataBench to measure the time of multiple queries running on Spark with different data volumes and degrees of parallelism. According to the results, this embodiment uses a multiple linear regression algorithm to build a prediction model for the calculation time of each stage. The results showed that the R2 statistic was greater than 0.9, where R was the correlation coefficient. The value of the F statistic is greater than the value according to the F distribution table. The probabilities p corresponding to the F statistic are all less than 0.0001. That is...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a cross-multi-data-center data distributed processing acceleration method. According to the method, each station can execute the corresponding calculation task as long as obtaining the required input data. And the processes of input data loading, map calculation, buffer transmission and reduce calculation of each site do not need to wait for the previous process of other sites to complete the corresponding operation. Meanwhile, accurate calculation time estimation is provided, the method adapts to the dynamic wide area network bandwidth to improve the practicability of the SDTP, and the response time of operation can be greatly shortened. The invention further provides a cross-multi-data-center data distributed processing acceleration system, corresponding to the method, the network and computing resources of the cross-regional distribution sites can be fully used, and therefore the cross-regional distribution data can be effectively analyzed without waiting forthe bottleneck site of the previous stage to complete the corresponding data transmission or computing task.

Description

technical field [0001] The invention relates to the field of data analysis, and specifically discloses a data distributed processing acceleration method across multiple data centers and a system thereof. Background technique [0002] Cloud providers such as Google, Amazon, and Alibaba have deployed data centers around the world to provide instant services. These services generate large amounts of data globally, including transaction data, user logs, and performance logs, among others. Mining these geographically distributed data (also known as wide-area analytics) is critical for business recommendations, anonymous detection, performance upgrades, and system maintenance, among others. A distributed computing framework such as Map-Reduce is usually implemented to mine such massive datasets. The main challenge of this computing method is the heterogeneity of hardware resources among geographically distributed sites, mainly including computing, uplink bandwidth and downlink b...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): H04L12/24H04L29/08
CPCH04L41/0823H04L67/10H04L67/60
Inventor 郭得科陈亦婷袁昊郑龙罗来龙
Owner NAT UNIV OF DEFENSE TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products