Unlock instant, AI-driven research and patent intelligence for your innovation.

Mass data sorting method and device based on Spark, equipment and storage medium

A technology of massive data and sorting method, applied in the field of big data, can solve the problems of data skew and affect server performance, and achieve the effect of avoiding data skew and improving performance

Pending Publication Date: 2022-04-26
PINGAN INT SMART CITY TECH CO LTD
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The main purpose of the present invention is to solve the problem that the existing Spark-based massive data sorting method will cause data skew and then affect server performance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Mass data sorting method and device based on Spark, equipment and storage medium
  • Mass data sorting method and device based on Spark, equipment and storage medium
  • Mass data sorting method and device based on Spark, equipment and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0069] The embodiment of the present invention provides a Spark-based mass data sorting method, device, equipment, and storage medium, by distributing samples of any group to each partition with equal probability, and performing global group sorting with the assistance of an external storage medium, thereby avoiding Data skew.

[0070] The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of the present invention and the above drawings are used to distinguish similar objects, and not necessarily Used to describe a specific sequence or sequence. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the term "comprising" or "having" and any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or devic...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to the field of big data, and discloses a Spark-based mass data sorting method and device, equipment and a storage medium. The method comprises the following steps: receiving a data sorting request, determining a target elastic distributed data set according to the request, and sampling each partition in the target elastic distributed data set to obtain a plurality of sample data; calculating a partition boundary of each partition based on a preset partition number and each piece of sample data; according to the partition boundary, performing data partition on the target elastic distributed data set, and sorting data in each partition to obtain an intermediate data sequence; and obtaining triple data from an external storage medium, and globally sorting the intermediate data sequence groups according to the triple data to obtain a target data sequence. According to the method and the device, the partition boundaries are calculated, so that the partitions are ordered and then grouped and sorted, the data in any group is distributed to each partition in an equal probability manner, the problem of data skew during global grouping sorting is avoided, and the performance of the server is improved.

Description

technical field [0001] The present invention relates to the field of big data, in particular to a Spark-based massive data sorting method, device, equipment and storage medium. Background technique [0002] With the exponential growth of data volume, people have higher requirements for the performance of big data processing. Among them, sorting data is a common requirement in big data processing scenarios. When the amount of data reaches a certain scale, memory-based sorting algorithms can no longer meet production needs. The solutions provided by conventional big data frameworks, such as the Spark distributed computing framework, use a merge-sort-like algorithm for optimization at the bottom layer. [0003] The existing Spark-based massive data sorting method is to distribute the data of the same group to the same partition (multiple groups are allowed in the same partition, but the data of one group must be in the same partition) for sorting. If the amount of data in a g...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06K9/62
CPCG06F18/2113G06F18/214
Inventor 赵英龙
Owner PINGAN INT SMART CITY TECH CO LTD