A re-partitioning method and system based on cross-domain data skew, a terminal and a storage medium

By using a cross-domain data skew repartitioning method, which employs hash partitioning and a greedy strategy to identify and correct data skew, the problem of uneven data distribution in cross-geographic big data computing is solved, and more efficient task execution is achieved.

CN122064913BActive Publication Date: 2026-06-23GUANGDONG LAB OF ARTIFICIAL INTELLIGENCE & DIGITAL ECONOMY (SZ)

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGDONG LAB OF ARTIFICIAL INTELLIGENCE & DIGITAL ECONOMY (SZ)
Filing Date
2026-04-21
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

In cross-geographic big data computing scenarios, existing technologies cannot achieve strict global computing load balancing and efficient job execution optimization, resulting in uneven data distribution, forming a 'long tail effect', and affecting task completion time.

Method used

By acquiring key-value pair statistics, identifying skewed partitions using preset hash partitioning rules and skew detection threshold coefficients, and employing a greedy bucketing strategy for repartitioning, efficient data allocation is achieved.

Benefits of technology

It reduces the data scale and computational complexity of cross-domain transmission, improves the balance of data distribution, reduces task execution time, and improves job efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122064913B_ABST
    Figure CN122064913B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of data processing, and discloses a re-partitioning method and system based on cross-domain data skew, a terminal and a storage medium.The method comprises the following steps: acquiring a plurality of key-value pair statistical information, performing partitioning statistics on the plurality of key-value pair statistical information according to a preset hash partitioning rule to obtain a plurality of initial data volumes; acquiring a preset skew detection threshold coefficient, performing skew determination on all the initial data volumes according to the skew detection threshold coefficient to obtain a skew partitioning set; acquiring a stage identifier and a target capacity of a current job, performing re-partitioning on keys in the skew partitioning set according to a greedy bucketing strategy, the stage identifier and the target capacity to obtain a mapping relationship, performing reallocation on the skew partitioning set according to the mapping relationship to obtain an allocation result. The application identifies skew based on key-value statistics and hash partitioning, performs re-partitioning according to a greedy strategy, and realizes efficient allocation of data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing technology, and in particular to a repartitioning method, system, terminal, and computer-readable storage medium based on cross-domain data skew. Background Technology

[0002] In cross-domain scenarios, task scheduling requires prior decision-making regarding task partitioning and data transmission across data centers, based on the input data size and data center resource information, before task execution. Because data skew detection and correction by methods such as LAHP (Learning Automaton-based Hybrid Partitioning) and SCID (Skew-Centric Intermediate Data) occur after data has reached the target data center, their optimization granularity is limited to the partitioning and execution node level within the data center, and cannot affect the upper-level cross-domain scheduling strategy.

[0003] Therefore, the current coarse-grained partitioning method based on the average data size or a fixed proportion during the task allocation phase makes it difficult to perceive the potential skew risks introduced by different key distributions and hotspot clusters. Under this constraint, if the input data exhibits significant uneven key distribution or concentrated hotspot clusters globally, some data centers may be allocated a much higher effective computing load than other data centers during the initial task partitioning phase. Even if node-level load balancing is subsequently achieved within the data center using methods such as LAHP and SCID, the load imbalance problem between data centers cannot be eliminated. Ultimately, the completion time of the entire job is still constrained by the data center with the heaviest load, forming a significant "long tail effect," thereby reducing the overall execution efficiency of the job.

[0004] Traditional data skew handling methods (such as SCID and LAHP in single data center scenarios) are limited to the data center in terms of optimization granularity (only local repartitioning is performed after data cross-domain transmission is completed), cannot detect and correct uneven data distribution in advance during the global task partitioning stage, and may introduce additional cross-domain re-aggregation overhead due to the splitting of hot keys. As a result, it is difficult to achieve efficient data allocation in big data computing scenarios across geographical domains, which has become an urgent problem to be solved.

[0005] Therefore, existing technologies still need to be improved and developed. Summary of the Invention

[0006] The main objective of this invention is to provide a repartitioning method, system, terminal, and computer-readable storage medium based on cross-domain data skew, aiming to solve the problem in the prior art that it is impossible to achieve strict balancing of global computing load and efficient optimization of job execution efficiency in big data computing scenarios across geographical domains, and it is difficult to improve the efficient allocation of data.

[0007] To achieve the above objectives, the present invention provides a repartitioning method based on cross-domain data skew, the repartitioning method based on cross-domain data skew comprising the following steps:

[0008] Obtain multiple key-value pair statistics, and perform partition statistics on the multiple key-value pair statistics according to a preset hash partitioning rule to obtain multiple initial data volumes;

[0009] Obtain a preset tilt detection threshold coefficient, and determine the tilt of all the initial data based on the tilt detection threshold coefficient to obtain a set of tilt partitions;

[0010] Obtain the stage identifier and target capacity of the current job. Repartition the keys in the skewed partition set according to the greedy bucketing strategy, the stage identifier, and the target capacity to obtain a mapping relationship. Reallocate the skewed partition set according to the mapping relationship to obtain the allocation result.

[0011] Optionally, in the repartitioning method based on cross-domain data skew, the key-value pair statistics include the key and the amount of data.

[0012] The process of obtaining multiple key-value pair statistics and partitioning these statistics according to a preset hash partitioning rule to obtain multiple initial data volumes specifically includes:

[0013] Obtain multiple keys and multiple data volumes, perform a pre-aggregation operation on all the keys and all the data volumes, and obtain the target statistical information;

[0014] Obtain the hash value, and perform partition statistics on the target statistical information according to the preset hash partitioning rules and the hash value to obtain multiple initial data volumes.

[0015] Optionally, in the cross-domain data skew-based repartitioning method, the skew detection threshold coefficient includes a first skew detection threshold coefficient and a second skew detection threshold coefficient.

[0016] The step of obtaining a preset tilt detection threshold coefficient and determining the tilt of all the initial data based on the tilt detection threshold coefficient to obtain a tilt partition set specifically includes:

[0017] Obtain the number of partitions, and determine the first tilt detection threshold coefficient and the second tilt detection threshold coefficient based on the number of partitions;

[0018] The median is obtained by sorting all the initial data volumes using a cross-domain data skew detection algorithm.

[0019] Based on the median, the first tilt detection threshold coefficient, and the second tilt detection threshold coefficient, tilt determination is performed on all the initial data volumes to obtain a tilt partition set.

[0020] Optionally, the repartitioning method based on cross-domain data skew, wherein the step of determining the skewness of all the initial data volumes based on the median, the first skew detection threshold coefficient, and the second skew detection threshold coefficient to obtain a set of skewed partitions specifically includes:

[0021] The median, the first tilt detection threshold coefficient, and the second tilt detection threshold coefficient are multiplied to obtain the first threshold and the second threshold.

[0022] Determine whether the initial data volume is greater than the first threshold or whether the initial data volume is less than the second threshold;

[0023] If the initial data volume is greater than the first threshold or less than the second threshold, then the skewed partition set is determined according to the partition identifier corresponding to the initial data volume.

[0024] Optionally, the repartitioning method based on cross-domain data skew, wherein obtaining the stage identifier and target capacity of the current job, repartitioning the keys in the skewed partition set according to the greedy bucketing strategy, the stage identifier, and the target capacity to obtain a mapping relationship, and reallocating the skewed partition set according to the mapping relationship to obtain an allocation result, specifically includes:

[0025] Obtain the stage identifier and target capacity of the current job, and aggregate the keys in the skewed partition set based on the cross-domain balanced repartitioning algorithm to obtain the total data volume;

[0026] Based on the total amount of data, all keys are sorted in descending order to obtain multiple target key values;

[0027] Based on the greedy bucketing strategy, the stage identifier, and the target capacity, all the target key values ​​are repartitioned to obtain a mapping relationship. Based on the mapping relationship, the skew partition data of the skew partition set is redistributed to obtain the allocation result.

[0028] Optionally, the repartitioning method based on cross-domain data skew, wherein the repartitioning of all target key values ​​according to a greedy bucketing strategy, the stage identifier, and the target capacity to obtain a mapping relationship, and the redistribution of the skewed partition data in the skewed partition set according to the mapping relationship to obtain an allocation result, specifically includes:

[0029] Obtain the load capacity of the first bucket set, and determine whether the load capacity is less than the target capacity;

[0030] If the load capacity is less than the target capacity, then all the target key values ​​are repartitioned according to the greedy bucketing strategy, the stage identifier and the target capacity to obtain a mapping table;

[0031] The mapping relationship is determined according to the mapping table, and the skew partition data of the skew partition set is redistributed according to the mapping relationship to obtain the allocation result.

[0032] Optionally, the repartitioning method based on cross-domain data skew, wherein after obtaining the load capacity of the first bucket set and determining whether the load capacity is less than the target capacity, further includes:

[0033] If the load capacity is not less than the target capacity, then a second bucket set is established, and multiple target key values ​​are assigned to the second bucket set;

[0034] Obtain a preset coefficient and an optimal bucket set. Multiply the preset coefficient with the target capacity to obtain the small partition capacity. If the data volume of the second bucket set is less than the small partition capacity, then allocate the target key value of the second bucket set to the optimal bucket set.

[0035] Furthermore, to achieve the above objectives, the present invention also provides a repartitioning system based on cross-domain data skew, wherein the repartitioning system based on cross-domain data skew includes:

[0036] The data statistics module is used to obtain statistical information of multiple key-value pairs, and perform partition statistics on the multiple key-value pair statistical information according to a preset hash partitioning rule to obtain multiple initial data volumes;

[0037] The data skew determination module is used to obtain a preset skew detection threshold coefficient, and to determine the skewness of all the initial data based on the skew detection threshold coefficient to obtain a set of skewed partitions.

[0038] The data allocation module is used to obtain the stage identifier and target capacity of the current job, repartition the keys in the skewed partition set according to the greedy bucketing strategy, the stage identifier and the target capacity to obtain the mapping relationship, and reallocate the skewed partition set according to the mapping relationship to obtain the allocation result.

[0039] Furthermore, to achieve the above objectives, the present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a repartitioning program based on cross-domain data skew, and the repartitioning program based on cross-domain data skew, when executed by a processor, implements the steps of the repartitioning method based on cross-domain data skew as described above.

[0040] This invention acquires multiple key-value pair statistics, partitions these statistics according to a preset hash partitioning rule to obtain multiple initial data volumes, acquires a preset skew detection threshold coefficient, and determines the skewness of all initial data volumes based on this threshold coefficient to obtain a skewed partition set, acquires the stage identifier and target capacity of the current job, repartitions the keys in the skewed partition set according to a greedy bucketing strategy, the stage identifier, and the target capacity to obtain a mapping relationship, and then reallocates the skewed partition set according to the mapping relationship to obtain an allocation result. This invention identifies skew based on key-value statistics and hash partitioning, and repartitions according to a greedy strategy to achieve efficient data allocation. Attached Figure Description

[0041] Figure 1 This is a flowchart of a preferred embodiment of the repartitioning method based on cross-domain data skew of the present invention;

[0042] Figure 2 This is a schematic diagram of cross-domain task allocation, which is a preferred embodiment of the cross-domain data skew repartitioning method of the present invention.

[0043] Figure 3 This is a schematic diagram of data detection and repartitioning in a preferred embodiment of the repartitioning method based on cross-domain data skewness of the present invention;

[0044] Figure 4 This is a structural diagram of a preferred embodiment of the repartitioning system based on cross-domain data skew of the present invention;

[0045] Figure 5 This is a structural diagram of a preferred embodiment of the terminal of the device of the present invention. Detailed Implementation

[0046] To make the objectives, technical solutions, and advantages of this invention clearer and more explicit, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

[0047] Traditional data skew handling methods (such as SCID and LAHP in single-datacenter scenarios) are limited in their optimization granularity to the data center (only performing local repartitioning after cross-domain data transfer), cannot detect and correct uneven data distribution in advance during the global task partitioning stage, and may introduce additional cross-domain re-aggregation overhead due to splitting hot keys. As a result, it is difficult to achieve efficient data allocation in cross-geographic big data computing scenarios. Therefore, a repartitioning method based on cross-domain data skew is needed, which identifies skew based on key-value statistics and hash partitioning, and repartitions according to a greedy strategy to achieve efficient data allocation.

[0048] The preferred embodiment of the present invention describes a repartitioning method based on cross-domain data skew, such as... Figure 1 and Figure 2 As shown, the repartitioning method based on cross-domain data skew includes the following steps:

[0049] Step S10: Obtain multiple key-value pair statistics, and perform partition statistics on the multiple key-value pair statistics according to the preset hash partitioning rules to obtain multiple initial data volumes.

[0050] Step S10 includes:

[0051] Step S11: Obtain multiple keys and multiple data volumes, and perform a pre-aggregation operation on all the keys and all the data volumes to obtain target statistical information;

[0052] Step S12: Obtain hash values, and perform partition statistics on the target statistical information according to the preset hash partitioning rules and the hash values ​​to obtain multiple initial data volumes.

[0053] Specifically, the key-value pair statistics include keys and data volumes. Multiple keys and multiple data volumes are obtained, and all keys and all data volumes are pre-aggregated to obtain target statistics (to reduce communication overhead during cross-domain transmission of key value data and improve the efficiency of subsequent skew detection, this invention introduces a key pre-aggregation operation locally in the data center). Hash values ​​are obtained, and the target statistics are partitioned according to preset hash partitioning rules and the hash values ​​to obtain multiple initial data volumes (generating initial data volumes in the form of (key, data volume)). The key-value pair statistics include keys and data volumes.

[0054] In this embodiment, data center 1 contains data (apple, apple, have, have, eat, have, have), data center 2 contains data (we, we, eat, eat, have, have, have), and data center 3 contains data (out, out, give, give, eat). First, each data center performs a pre-aggregation operation, which merges records with the same key into one record. Here, the key is the word itself. After pre-aggregation, the statistics for data center 1 are ((apple, 2), (have, 4), (eat, 1)), the statistics for data center 2 are ((we, 2), (eat, 2), (have, 3)), and the statistics for data center 3 are ((out, 2), (give, 2), (eat, 1)).

[0055] For example, such as Figure 2 As shown, a company has multiple data centers distributed across regions A, B, and C. A daily log statistics task is performed for user behavior analysis. The data key is product ID + user ID; popular products have larger data volumes, while less popular products have smaller data volumes. This leads to data skew, where some partitions contain large amounts of data while others contain very small amounts. Consequently, some data centers may receive excessively large task loads, creating a 'long tail effect'. This invention detects data skew and then repartitions the data to achieve a more balanced partitioning, reducing the impact of data skew on the overall cross-domain task.

[0056] Furthermore, this preprocessing step can compress the scale of data requiring cross-domain transmission, reducing both the time and bandwidth consumption required for public network transmission and the computational complexity of the subsequent global data skew detection stage. After collecting key statistics, the system can identify potential data skewed partitions based on the key distribution of each partition and further perform balanced repartitioning operations, providing a more accurate and reasonable partition input basis for subsequent cross-domain task scheduling. After collecting (key, size) data from each data center, data skew detection is required.

[0057] Step S20: Obtain a preset tilt detection threshold coefficient, and determine the tilt of all the initial data based on the tilt detection threshold coefficient to obtain a tilt partition set.

[0058] Step S20 includes:

[0059] Step S21: Obtain the number of partitions, and determine the first tilt detection threshold coefficient and the second tilt detection threshold coefficient based on the number of partitions;

[0060] Step S22: Sort all the initial data volumes using a cross-domain data skew detection algorithm to obtain the median;

[0061] Step S23: Based on the median, the first tilt detection threshold coefficient, and the second tilt detection threshold coefficient, determine the tilt of all the initial data volumes to obtain a tilt partition set.

[0062] Specifically, the number of partitions is obtained, and a first skew detection threshold coefficient and a second skew detection threshold coefficient are determined based on the number of partitions (in the algorithm, β (first skew detection threshold coefficient) is set to 2, and γ (second skew detection threshold coefficient) is set to 0.2, meaning that a partition is judged as skewed when the amount of data in the partition is greater than twice the amount of data in the partition or less than 0.2 times the amount of data in the partition. β defaults to 2, and γ defaults to 0.2 because the number of partitions is determined according to the size of the data, i.e., the number of partitions). The initial data amounts are sorted using a cross-domain data skew detection algorithm to obtain the median (the median of the data amounts in all partitions (all the initial data amounts) is sorted). Skew is determined for all the initial data amounts based on the median, the first skew detection threshold coefficient, and the second skew detection threshold coefficient to obtain a set of skewed partitions (if no skewed partitions are detected, the final set of skewed partitions is empty).

[0063] In this embodiment, based on the (key, size) data of each data center, the data is hashed and partitioned according to the key, and the data volume of each partition is counted. Here, the key is the word itself, and partitioning is based on the hash value of the key. Assuming it is divided into three partitions, partition 1 contains the words (apple, give), partition 2 contains the words (have, we), and partition 3 contains the word (out). Therefore, the data in partition 1 is ((apple, 2), (give, 2)), the data in partition 2 is ((have, 7), (eat, 2)), and the data in partition 3 is (out, 2). Then, the data volume of partition 1 can be calculated as 4, the data volume of partition 2 as 9, and the data volume of partition 3 as 2.

[0064] Furthermore, since the number of partitions is determined by the size of the data:

[0065] ;

[0066] in, It is the total amount of data. The default partition size is 128MB. It represents the number of partitions.

[0067] As an example, in an ideal scenario without data skew, the data in each partition should be... The mean and median remained stable at 128MB. When data skew occurred, only a small number of partitions had significantly larger data sizes. Or the amount of data is significantly smaller However, the amount of data in most partitions remains around the default partition size. This is because the number of skewed partitions is usually much smaller than the total number of partitions. Therefore, the median of the partition data size distribution is still determined by the non-skewed partition, fluctuating around 128MB. During task execution, each partition in Spark (the computation engine) corresponds to a parallel task, and its execution time is approximately proportional to the partition data size. For skewed partitions, repartitioning makes the partition data more balanced and reduces the overall computation time, but it also introduces additional repartitioning overhead. Therefore, the benefits of splitting can only offset these additional overheads when the skewed partition's data size is significantly larger than or significantly smaller than other partitions, making repartitioning practically meaningful.

[0068] Step S23 includes:

[0069] Step S231: Multiply the median, the first tilt detection threshold coefficient, and the second tilt detection threshold coefficient to obtain the first threshold and the second threshold;

[0070] Step S232: Determine whether the initial data volume is greater than the first threshold or whether the initial data volume is less than the second threshold;

[0071] Step S233: If the initial data volume is greater than the first threshold or the initial data volume is less than the second threshold, then determine the skewed partition set according to the partition identifier corresponding to the initial data volume.

[0072] Specifically, the median, the first skew detection threshold coefficient, and the second skew detection threshold coefficient are multiplied to obtain a first threshold and a second threshold. It is then determined whether the initial data volume is greater than the first threshold (4*2) or less than the second threshold (4*0.2). If the initial data volume is greater than the first threshold or less than the second threshold, the skewed partition set is determined according to the partition identifier corresponding to the initial data volume.

[0073] In this embodiment, the data volume of all partitions is sorted and the median is calculated. Based on this, the algorithm sets a skew detection threshold to a fixed multiple of the median, β and γ, and iterates through all partitions one by one, marking partitions with data volumes exceeding or less than the threshold as skewed partitions. The result after sorting the partition data volumes is: partition 3 (data volume 2), partition 1 (data volume 4), partition 3 (data volume 9). Finally, the partition median is 4. Here, β and γ are set to 2 and 0.2, respectively, meaning that partitions with data volumes greater than 4*2 or less than 4*0.2 are determined to be skewed partitions. Therefore, partition 2 is identified as a skewed partition.

[0074] Step S30: Obtain the stage identifier and target capacity of the current job; repartition the keys in the skewed partition set according to the greedy bucketing strategy, the stage identifier and the target capacity to obtain the mapping relationship; reallocate the skewed partition set according to the mapping relationship to obtain the allocation result.

[0075] like Figure 3 As shown, step S30 includes:

[0076] Step S31: Obtain the stage identifier and target capacity of the current job, and aggregate the keys in the skewed partition set based on the cross-domain balanced repartitioning algorithm to obtain the total data volume;

[0077] Step S32: Sort all keys in descending order according to the total amount of data to obtain multiple target key values;

[0078] Step S33: Repartition all target key values ​​according to the greedy bucketing strategy, the stage identifier and the target capacity to obtain a mapping relationship, and redistribute the skew partition data of the skew partition set according to the mapping relationship to obtain the allocation result.

[0079] Specifically, the current job's stage identifier and target capacity are obtained. Based on the cross-domain balanced repartitioning algorithm, the keys in the skewed partition set are aggregated to obtain the total data volume (Size(key)). All keys are sorted in descending order according to the total data volume to obtain multiple target key values. All target key values ​​are repartitioned according to the greedy bucketing strategy, the stage identifier, and the target capacity to obtain a mapping relationship. The skewed partition data in the skewed partition set is redistributed according to the mapping relationship to obtain the allocation result (after detecting a data skewed partition, a repartitioning operation needs to be performed on the skewed partition data in the skewed partition set to achieve load balancing).

[0080] In this embodiment, the skewed partition set Q is known from the detection algorithm, so the repartitioning algorithm only needs to operate on the skewed partition data, as shown in Algorithm 2. First, it is necessary to traverse all data (key, size) in the skewed partition, and accumulate the data volume of the same key to obtain the total data volume (Size(key)) corresponding to each key.

[0081] Step S33 includes:

[0082] Step S331: Obtain the load capacity of the first bucket set and determine whether the load capacity is less than the target capacity;

[0083] Step S332: If the load capacity is less than the target capacity, then repartition all the target key values ​​according to the greedy bucketing strategy, the stage identifier and the target capacity to obtain a mapping table;

[0084] Step S333: Determine the mapping relationship according to the mapping table, and redistribute the skew partition data of the skew partition set according to the mapping relationship to obtain the allocation result.

[0085] Specifically, the load capacity of the first bucket set is obtained, and it is determined whether the load capacity is less than the target capacity. If the load capacity is less than the target capacity, all the target key values ​​are repartitioned according to the greedy bucketing strategy, the stage identifier and the target capacity to obtain a mapping table (Map(key)=pid). The mapping relationship is determined according to the mapping table, and the skew partition data of the skew partition set is redistributed according to the mapping relationship to obtain the allocation result.

[0086] In this embodiment, after key aggregation, the algorithm sorts all keys in descending order by Size(key), ensuring that keys with larger amounts of data are prioritized for bucketing. Then, a greedy strategy is used for repartitioning, initializing a bucket set B, where each bucket corresponds to a new future partition. Next, the sorted keys are processed one by one. For the current key, a bucket b is first searched in the existing bucket set, ensuring that the bucket's load capacity after adding the key does not exceed [a certain value]. Target capacity. Here... This is the ideal partition size for each partition after repartitioning, and by default it is based on the median size of the partition data volume during skew partition detection. .

[0087] As an example, it is also necessary to ensure that Size(b) is the largest among the buckets that satisfy the constraints, so that the loads of the buckets are closer to the given values. If a bucket that meets the criteria exists, the key is assigned to that bucket, and the bucket's load is updated. If no bucket exists, a new bucket `b_new` is created, and the key is assigned to the new bucket. After all keys have been bucketed, if a bucket has a data volume less than 0.2 * ... If a small partition (with a small partition capacity) is generated, the keys of this bucket are then distributed to other buckets sequentially, prioritizing the smallest bucket. The algorithm assigns a unique new partition number (pid) to each bucket and establishes a mapping relationship between keys and new partitions. Specifically, all keys in a bucket are mapped to the corresponding partition pid, forming a mapping table Map(key) = pid. This formula maps the keys in a bucket to the corresponding partitions. For example, if bucket 1 contains the keys (have, eat), then the final data will be (have, new partition 1), (eat, new partition 1).

[0088] In this embodiment, after step S331, the method further includes: if the load capacity is not less than the target capacity, then a second bucket set is established, and multiple target key values ​​are allocated to the second bucket set; a preset coefficient and an optimal bucket set are obtained; the preset coefficient is multiplied by the target capacity to obtain the small partition capacity; if the data volume of the second bucket set is less than the small partition capacity, then the target key values ​​of the second bucket set are allocated to the optimal bucket set.

[0089] For example, consider a skewed partition of 2 with keys (have, eat). The data volume for each key in the partition is: ((have, 7), (eat, 2)). Since the median of the partition data is 4 when detecting skewed partitions, therefore... The size is 4. Repartitioning is performed on skewed partition 2. First, bucketing is performed on the key with the largest data volume: "have". Since there is no bucket yet, a new bucket is created to store "have". Because "have" contains 6 data points, the default bucket limit is 4. Since the result is 4, the new bucket is full. When it's time for `eat`, another new bucket needs to be created. Finally, two new buckets are obtained: bucket 1 stores the data (`have`), and bucket 2 stores the data (`eat`). The skewed partition 2 is then split into two new partitions: new partition 1 contains the data (`have`), and new partition 2 contains the data (`eat`).

[0090] Furthermore, in the repartitioning algorithm The default setting is the median size of the partitions in the current job. The median is chosen as the target size for repartitioning because, compared to the mean, it is less sensitive to extremely large partitions and more accurately reflects the actual data size of most partitions. For example, suppose there are v skewed partitions, whose data size is much larger than that of normal partitions:

[0091] ;

[0092] in, Indicates the first The amount of data corresponding to each partition This represents the median of the data volume in the partition. It is a skewed partition set. It represents the number of partitions.

[0093] The partition mean is:

[0094] ;

[0095] We can obtain:

[0096] ;

[0097] here It is the mean. Because... ,even though , there are also It can be seen that the mean can lose its representativeness in some cases. Moreover, when the goal is to minimize the absolute deviation of the partition load, the optimal target value is the median of the partition data size. Therefore, this operation ensures that the new partition after repartitioning is as consistent as possible with the non-skewed partition in terms of data size, avoiding situations where the repartition is still significantly too large or too small, resulting in a more balanced load on each parallel task in the subsequent task execution phase.

[0098] Furthermore, such as Figure 4 As shown, based on the above-described repartitioning method based on cross-domain data skew, the present invention also provides a repartitioning system based on cross-domain data skew, wherein the repartitioning system based on cross-domain data skew includes:

[0099] Data statistics module 51 is used to obtain statistical information of multiple key-value pairs, and perform partition statistics on the multiple key-value pair statistical information according to a preset hash partitioning rule to obtain multiple initial data volumes;

[0100] The data tilt judgment module 52 is used to obtain a preset tilt detection threshold coefficient, and to perform tilt judgment on all the initial data based on the tilt detection threshold coefficient to obtain a tilt partition set.

[0101] The data allocation module 53 is used to obtain the stage identifier and target capacity of the current job, repartition the keys in the skewed partition set according to the greedy bucketing strategy, the stage identifier and the target capacity to obtain the mapping relationship, and reallocate the skewed partition set according to the mapping relationship to obtain the allocation result.

[0102] Furthermore, such as Figure 5 As shown, based on the above-mentioned repartitioning method and system based on cross-domain data skew, the present invention also provides a terminal, which includes a processor 10, a memory 20 and a display 30. Figure 5 Only some of the terminal components are shown; however, it should be understood that it is not required to implement all of the components shown, and more or fewer components may be implemented instead.

[0103] In some embodiments, the memory 20 may be an internal storage unit of the terminal, such as a hard disk or memory. In other embodiments, the memory 20 may be an external storage device of the terminal, such as a plug-in hard disk, smart media card (SMC), secure digital card (SD), flash card, etc. Further, the memory 20 may include both internal and external storage devices. The memory 20 is used to store application software and various types of data installed on the terminal, such as the program code installed on the terminal. The memory 20 can also be used to temporarily store data that has been output or will be output. In one embodiment, the memory 20 stores a cross-domain data skew-based repartitioning program 40, which can be executed by the processor 10 to implement the cross-domain data skew-based repartitioning method of this application.

[0104] In some embodiments, the processor 10 may be a central processing unit (CPU), a microprocessor, or other data processing chip, used to run program code stored in the memory 20 or process data, such as executing the repartitioning method based on cross-domain data skew.

[0105] In some embodiments, the display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, or an OLED (Organic Light-Emitting Diode) touchscreen. The display 30 is used to display information on the terminal and to display a visual user interface. The terminals communicate with each other via a system bus.

[0106] In one embodiment, when processor 10 executes the repartitioning procedure 40 based on cross-domain data skew in memory 20, the following steps are performed:

[0107] Obtain multiple key-value pair statistics, and perform partition statistics on the multiple key-value pair statistics according to a preset hash partitioning rule to obtain multiple initial data volumes;

[0108] Obtain a preset tilt detection threshold coefficient, and determine the tilt of all the initial data based on the tilt detection threshold coefficient to obtain a set of tilt partitions;

[0109] Obtain the stage identifier and target capacity of the current job; repartition the keys in the skewed partition set according to the greedy bucketing strategy, the stage identifier and the target capacity to obtain the mapping relationship; and reallocate the skewed partition set according to the mapping relationship to obtain the allocation result.

[0110] The key-value pair statistics include the key and the amount of data.

[0111] The process of obtaining multiple key-value pair statistics and partitioning these statistics according to a preset hash partitioning rule to obtain multiple initial data volumes specifically includes:

[0112] Obtain multiple keys and multiple data volumes, perform a pre-aggregation operation on all the keys and all the data volumes, and obtain the target statistical information;

[0113] Obtain the hash value, and perform partition statistics on the target statistical information according to the preset hash partitioning rules and the hash value to obtain multiple initial data volumes.

[0114] The tilt detection threshold coefficient includes a first tilt detection threshold coefficient and a second tilt detection threshold coefficient.

[0115] The step of obtaining a preset tilt detection threshold coefficient and determining the tilt of all the initial data based on the tilt detection threshold coefficient to obtain a tilt partition set specifically includes:

[0116] Obtain the number of partitions, and determine the first tilt detection threshold coefficient and the second tilt detection threshold coefficient based on the number of partitions;

[0117] The median is obtained by sorting all the initial data volumes using a cross-domain data skew detection algorithm.

[0118] Based on the median, the first tilt detection threshold coefficient, and the second tilt detection threshold coefficient, tilt determination is performed on all the initial data volumes to obtain a tilt partition set.

[0119] Specifically, the step of determining the skewness of all the initial data based on the median, the first skew detection threshold coefficient, and the second skew detection threshold coefficient to obtain a set of skewed partitions includes:

[0120] The median, the first tilt detection threshold coefficient, and the second tilt detection threshold coefficient are multiplied to obtain the first threshold and the second threshold.

[0121] Determine whether the initial data volume is greater than the first threshold or whether the initial data volume is less than the second threshold;

[0122] If the initial data volume is greater than the first threshold or less than the second threshold, then the skewed partition set is determined according to the partition identifier corresponding to the initial data volume.

[0123] The process of obtaining the stage identifier and target capacity of the current job, repartitioning the keys in the skewed partition set according to the greedy bucketing strategy, the stage identifier, and the target capacity to obtain a mapping relationship, and reallocating the skewed partition set according to the mapping relationship to obtain the allocation result specifically includes:

[0124] Obtain the stage identifier and target capacity of the current job, and aggregate the keys in the skewed partition set based on the cross-domain balanced repartitioning algorithm to obtain the total data volume;

[0125] Based on the total amount of data, all keys are sorted in descending order to obtain multiple target key values;

[0126] Based on the greedy bucketing strategy, the stage identifier, and the target capacity, all the target key values ​​are repartitioned to obtain a mapping relationship. Based on the mapping relationship, the skew partition data of the skew partition set is redistributed to obtain the allocation result.

[0127] Specifically, the step of repartitioning all target key values ​​according to the greedy bucketing strategy, the stage identifier, and the target capacity to obtain a mapping relationship, and then redistributing the skewed partition data of the skewed partition set according to the mapping relationship to obtain the allocation result, includes:

[0128] Obtain the load capacity of the first bucket set, and determine whether the load capacity is less than the target capacity;

[0129] If the load capacity is less than the target capacity, then all the target key values ​​are repartitioned according to the greedy bucketing strategy, the stage identifier and the target capacity to obtain a mapping table;

[0130] The mapping relationship is determined according to the mapping table, and the skew partition data of the skew partition set is redistributed according to the mapping relationship to obtain the allocation result.

[0131] The step of obtaining the load capacity of the first bucket set and determining whether the load capacity is less than the target capacity further includes:

[0132] If the load capacity is not less than the target capacity, then a second bucket set is established, and multiple target key values ​​are assigned to the second bucket set;

[0133] Obtain a preset coefficient and an optimal bucket set. Multiply the preset coefficient with the target capacity to obtain the small partition capacity. If the data volume of the second bucket set is less than the small partition capacity, then allocate the target key value of the second bucket set to the optimal bucket set.

[0134] The present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a repartitioning program based on cross-domain data skew, the repartitioning program based on cross-domain data skew implementing the steps of the repartitioning method based on cross-domain data skew as described above when executed by a processor.

[0135] In summary, this invention provides a repartitioning method, system, terminal, and storage medium based on cross-domain data skew. The method includes: acquiring multiple key-value pair statistics; performing partitioning statistics on the multiple key-value pair statistics according to a preset hash partitioning rule to obtain multiple initial data volumes; acquiring a preset skew detection threshold coefficient; determining skewness in all the initial data volumes according to the skew detection threshold coefficient to obtain a skewed partition set; acquiring the stage identifier and target capacity of the current job; repartitioning the keys in the skewed partition set according to a greedy bucketing strategy, the stage identifier, and the target capacity to obtain a mapping relationship; and reallocating the skewed partition set according to the mapping relationship to obtain an allocation result. This invention identifies skew based on key-value statistics and hash partitioning, and repartitions according to a greedy strategy to achieve efficient data allocation.

[0136] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal system that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal system. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal system that includes that element.

[0137] Of course, those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware (such as a processor, controller, etc.). The program can be stored in a computer-readable storage medium, and when executed, it can include the processes described in the above method embodiments. The computer-readable storage medium can be a memory, magnetic disk, optical disk, etc.

[0138] It should be understood that the application of the present invention is not limited to the examples above. Those skilled in the art can make improvements or modifications based on the above description, and all such improvements and modifications should fall within the protection scope of the appended claims.

Claims

1. A repartitioning method based on cross-domain data skew, characterized in that, The repartitioning method based on cross-domain data skew includes: Obtain multiple key-value pair statistics, and perform partition statistics on the multiple key-value pair statistics according to a preset hash partitioning rule to obtain multiple initial data volumes; Obtain a preset tilt detection threshold coefficient, and determine the tilt of all the initial data based on the tilt detection threshold coefficient to obtain a set of tilt partitions; Obtain the stage identifier and target capacity of the current job; repartition the keys in the skewed partition set according to the greedy bucketing strategy, the stage identifier and the target capacity to obtain the mapping relationship; and reallocate the skewed partition set according to the mapping relationship to obtain the allocation result. The tilt detection threshold coefficient includes a first tilt detection threshold coefficient and a second tilt detection threshold coefficient; The step of obtaining a preset tilt detection threshold coefficient and determining the tilt of all the initial data based on the tilt detection threshold coefficient to obtain a tilt partition set specifically includes: Obtain the number of partitions, and determine the first tilt detection threshold coefficient and the second tilt detection threshold coefficient based on the number of partitions; The median is obtained by sorting all the initial data volumes using a cross-domain data skew detection algorithm. Based on the median, the first skew detection threshold coefficient, and the second skew detection threshold coefficient, a skew partition set is obtained by determining the skewness of all the initial data volumes. The process of obtaining the stage identifier and target capacity of the current job, repartitioning the keys in the skewed partition set according to the greedy bucketing strategy, the stage identifier, and the target capacity to obtain a mapping relationship, and reallocating the skewed partition set according to the mapping relationship to obtain the allocation result specifically includes: Obtain the stage identifier and target capacity of the current job, and aggregate the keys in the skewed partition set based on the cross-domain balanced repartitioning algorithm to obtain the total data volume; Based on the total amount of data, all keys are sorted in descending order to obtain multiple target key values; Based on the greedy bucketing strategy, the stage identifier, and the target capacity, all the target key values ​​are repartitioned to obtain a mapping relationship. Based on the mapping relationship, the skew partition data of the skew partition set is redistributed to obtain the allocation result.

2. The repartitioning method based on cross-domain data skew according to claim 1, characterized in that, The key-value pair statistics include the key and the amount of data; The process of obtaining multiple key-value pair statistics and partitioning these statistics according to a preset hash partitioning rule to obtain multiple initial data volumes specifically includes: Obtain multiple keys and multiple data volumes, perform a pre-aggregation operation on all the keys and all the data volumes, and obtain the target statistical information; Obtain the hash value, and perform partition statistics on the target statistical information according to the preset hash partitioning rules and the hash value to obtain multiple initial data volumes.

3. The repartitioning method based on cross-domain data skew according to claim 1, characterized in that, The step of determining the skewness of all the initial data based on the median, the first skew detection threshold coefficient, and the second skew detection threshold coefficient to obtain a set of skewed partitions specifically includes: The median, the first tilt detection threshold coefficient, and the second tilt detection threshold coefficient are multiplied to obtain the first threshold and the second threshold. Determine whether the initial data volume is greater than the first threshold or whether the initial data volume is less than the second threshold; If the initial data volume is greater than the first threshold or less than the second threshold, then the skewed partition set is determined according to the partition identifier corresponding to the initial data volume.

4. The repartitioning method based on cross-domain data skew according to claim 1, characterized in that, The process of repartitioning all target key values ​​according to the greedy bucketing strategy, the stage identifier, and the target capacity to obtain a mapping relationship, and then redistributing the skewed partition data of the skewed partition set according to the mapping relationship to obtain the allocation result, specifically includes: Obtain the load capacity of the first bucket set, and determine whether the load capacity is less than the target capacity; If the load capacity is less than the target capacity, then all the target key values ​​are repartitioned according to the greedy bucketing strategy, the stage identifier and the target capacity to obtain a mapping table; The mapping relationship is determined according to the mapping table, and the skew partition data of the skew partition set is redistributed according to the mapping relationship to obtain the allocation result.

5. The repartitioning method based on cross-domain data skew according to claim 4, characterized in that, The process of obtaining the load capacity of the first bucket set, determining whether the load capacity is less than the target capacity, and then further includes: If the load capacity is not less than the target capacity, then a second bucket set is established, and multiple target key values ​​are assigned to the second bucket set; Obtain a preset coefficient and an optimal bucket set. Multiply the preset coefficient with the target capacity to obtain the small partition capacity. If the data volume of the second bucket set is less than the small partition capacity, then allocate the target key value of the second bucket set to the optimal bucket set.

6. A repartitioning system based on cross-domain data skew, characterized in that, The cross-domain data skew-based repartitioning system is applied to the cross-domain data skew-based repartitioning method of any one of claims 1-5, wherein the cross-domain data skew-based repartitioning system comprises: The data statistics module is used to obtain statistical information of multiple key-value pairs, and perform partition statistics on the multiple key-value pair statistical information according to a preset hash partitioning rule to obtain multiple initial data volumes; The data skew determination module is used to obtain a preset skew detection threshold coefficient, and to determine the skewness of all the initial data based on the skew detection threshold coefficient to obtain a set of skewed partitions. The data allocation module is used to obtain the stage identifier and target capacity of the current job, repartition the keys in the skewed partition set according to the greedy bucketing strategy, the stage identifier and the target capacity to obtain the mapping relationship, and reallocate the skewed partition set according to the mapping relationship to obtain the allocation result.

7. A terminal, characterized in that, The terminal includes: a memory, a processor, and a cross-domain data skew-based repartitioning program stored in the memory and executable on the processor, wherein the cross-domain data skew-based repartitioning program, when executed by the processor, implements the steps of the cross-domain data skew-based repartitioning method as described in any one of claims 1-5.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a repartitioning program based on cross-domain data skew, which, when executed by a processor, implements the steps of the repartitioning method based on cross-domain data skew as described in any one of claims 1-5.