spark-streaming intermediate data partition method, device, computer equipment and storage medium

A technology for intermediate data and data partitioning, applied in the field of data processing, which can solve problems such as extended job execution time, unbalanced reduce task load, and low job execution efficiency, to achieve uniform partitioning, improve job execution efficiency, and reduce time and space overhead Effect

Active Publication Date: 2021-05-11
HUNAN UNIV
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] When the number of tuples assigned to each partition is different, the amount of data in the partition is also different, which will cause unbalanced reduce task loads for processing these partitions
The completion time of the reduce phase is determined by the slowest task among the parallel reduce tasks. Therefore, when a reduce task is overloaded, it will take a long time to execute, resulting in prolonged job execution time and low job execution efficiency.
That is to say, the traditional intermediate data partition method has the problem of low job execution efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • spark-streaming intermediate data partition method, device, computer equipment and storage medium
  • spark-streaming intermediate data partition method, device, computer equipment and storage medium
  • spark-streaming intermediate data partition method, device, computer equipment and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0051] In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.

[0052] The method provided by this application can be applied as figure 1shown in the application environment. For a batch job, the map task reads the data and processes them in parallel on the nodes, and then outputs intermediate data in the form of key / value pairs, which are partitioned by the Range partitioner, such as figure 1 Each map data shown is divided into 3 parts. Then each reduce task will obtain the intermediate data of each map task for processing, and finally output the result. The processing flow of the Range partitioner includes sampling, Key cluster up...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present application relates to a Spark-Streaming intermediate data partition method, device, computer equipment and storage medium. The method in one embodiment includes: obtaining a plurality of elements in the intermediate data output by the Spark-Streaming map task, and sampling the plurality of elements based on the reservoir sampling algorithm to obtain the element cluster after the sampling process; through time series The prediction method updates the frequency weights corresponding to the elements in the element cluster, and sorts the elements in the updated element cluster according to the preset element order; based on the sorted element cluster, the boundary elements corresponding to the data partition are solved by the dynamic programming method; The elements in the updated element cluster are partitioned according to the boundary elements, so that the sum of the frequency weights corresponding to the elements in the largest partition after partition processing is the smallest.

Description

technical field [0001] The present invention relates to the field of data processing, in particular to a Spark-Streaming intermediate data partition method, device, computer equipment and storage medium. Background technique [0002] With the development of information technology and the rapid growth of network information resources, it is of great significance to process data streams in real time. MapReduce is a standard programming model for processing large-scale data. Apache Spark is an open source implementation of the MapReduce framework. Spark-Streaming is a real-time computing framework built on Spark, which extends Spark's ability to process large-scale streaming data. Spark-Streaming divides the data stream into continuous micro-batch data, and then processes the divided micro-batch data as a series of batch jobs. [0003] Taking a typical Spark batch job processing as an example, the map task reads data, processes the read data according to the user-defined map f...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/2455
CPCG06F16/24554
Inventor 唐卓付仲明陈岑陈建国李肯立李克勤廖湘科
Owner HUNAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products