A dynamic partitioning method and system based on data skew model

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a dynamic partitioning optimization method based on evolutionary algorithms and Q-learning, the data skew problem on the Spark platform was solved, achieving efficient data partitioning and resource utilization, thereby improving the execution efficiency of Spark jobs and system performance.

CN117931939BActive Publication Date: 2026-06-26HUANENG YIMIN COAL POWER CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: HUANENG YIMIN COAL POWER CO LTD
Filing Date: 2023-12-15
Publication Date: 2026-06-26

Application Information

Patent Timeline

15 Dec 2023

Application

26 Jun 2026

Publication

CN117931939B

IPC: G06F16/27; G06F16/22; G06F9/50

CPC: G06F16/278; G06F16/2255; G06F9/5083; Y02D10/00

AI Tagging

Technology Topics

Complete dataAlgorithm

Technical Efficacy Phrases

Efficient handlingComputationally efficient

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A method and system for constructing an intelligent data element identification system and a medium
CN121684073BImprove the level of intelligenceRealize dynamic adaptive adjustmentResource allocation Digital data protection Original data Edge node
Oral liquid transparent tray
CN224393275USimple structure easy to use Rigid containers Dentistry Biomedical engineering
A carrying and character recognition platform applied to the electronic industry
CN224372148UEfficient handlingstable handlingProgramme-controlled manipulator Sorting
Method and device for maintaining a catalog of space objects fusing post-detection multi-source observation data
CN121681493BEfficient fusionEfficient handlingDatabase updating Special data processing applications
High strength forged duplex gear
CN224414287USimple structure Reduce maintenance costs Butt joint Gear wheel

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing dynamic partitioning algorithms based on data skew models suffer from several problems on the Spark platform, including insufficient static partitioning strategies, inadequate accuracy of data skew models, high computational overhead, uneven resource utilization, and insufficient generalization ability. These issues make it difficult to adapt to complex data skew patterns and real-time requirements.

Method used

An evolutionary algorithm-based data skew evaluation model is adopted, which combines random number partitioning, random number strategy for secondary allocation of hash partitions and adjacent position hash partitions. Dynamic partition optimization is performed through fitness function and Q-learning reinforcement learning algorithm, and the data distribution is monitored in real time and the partitioning strategy is adjusted.

Benefits of technology

It improves the execution efficiency of Spark jobs, balances the load on the computing cluster, makes full use of resources, enhances the adaptability and real-time performance of the system, and improves resource utilization and performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117931939B_ABST

Patent Text Reader

Abstract

The application discloses a kind of dynamic partitioning method and system based on data skew model, it is related to big data technology field, including collecting data and pre-processing data, establish data skew evaluation model to predict pre-processed data, according to data characteristics, the data skew degree in job processing process is predicted, determines partition optimization strategy, designs evaluation experiment to compare dynamic partitioning with static partitioning, and repeatedly optimizes dynamic partitioning.This application proposes dynamic partitioning algorithm based on data skew model, which can select suitable partitioning strategy according to the skew degree of the data being processed, use the algorithm of three optimization methods based on Spark dynamic partitioning to process data performance comparison experiment, design and verify the universality and efficiency of the above dynamic partitioning scheme, can make the load balancing of the whole computing cluster, make full use of the computing resources of cluster, more efficiently complete data processing and calculation.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of big data, and in particular to a dynamic partitioning method and system based on a data skew model. Background Technology

[0002] Currently, large-scale data processing jobs running on the Spark platform involve complex data distributions and processing requirements. As data volume increases, data skew becomes a significant problem, where some data partitions contain far more data than others, leading to uneven job execution times, performance degradation, and unbalanced resource utilization. Traditional static partitioning strategies cannot effectively handle this real-time, dynamic data skew issue, thus requiring a more dynamically adaptable solution.

[0003] Existing dynamic partitioning algorithms based on data skew models have several shortcomings on the Spark platform. First, some algorithms still employ static partitioning strategies, failing to dynamically adjust during job execution and thus limiting the system's adaptability to real-time data skew. Second, the accuracy of data skew models may be insufficient, failing to comprehensively capture complex data skew patterns, leading to inaccuracies in dynamic partitioning strategies and consequently impacting system performance. Furthermore, some algorithms introduce significant computational overhead, especially in the context of real-time performance adjustments, potentially increasing job execution time and reducing the algorithm's practical utility. Scalability is also an issue; some algorithms may perform poorly on large-scale datasets because they may require global data analysis, resulting in performance degradation. Inappropriate resource utilization is another challenge; some algorithms may lead to uneven resource allocation, resulting in resource waste and poor performance.

[0004] Furthermore, these algorithms lack generalization ability, making it difficult to adapt to data skew situations that may occur in various real-world applications, thus limiting their adaptability in diverse scenarios. Finally, some algorithms lack real-time performance adjustment mechanisms, making it difficult to respond instantly to data skew that occurs in real time, reducing the system's dynamism and real-time performance. These shortcomings collectively affect the effectiveness of existing technologies in handling data skew on the Spark platform, requiring further improvement and innovation to enhance their adaptability, performance, and practicality. Summary of the Invention

[0005] In view of the problems existing in the dynamic partitioning and systems based on data skew models, this invention is proposed.

[0006] Therefore, the problem to be solved by this invention is how to monitor data distribution in real time and dynamically adjust the partitioning strategy according to the actual situation in order to maintain the balanced execution of jobs.

[0007] To solve the above-mentioned technical problems, the present invention provides the following technical solution:

[0008] In a first aspect, embodiments of the present invention provide a dynamic partitioning method based on a data skew model, comprising: collecting data and preprocessing the data; establishing a data skew evaluation model to predict the preprocessed data, predicting the degree of data skew during job processing based on data characteristics, and determining a partitioning optimization strategy; designing an evaluation experiment to compare dynamic partitioning with static partitioning, and repeatedly optimizing the dynamic partitioning.

[0009] As a preferred embodiment of the dynamic partitioning method based on the data skew model described in this invention, the data skew evaluation model is an evolutionary algorithm-based model, comprising: establishing a chromosome containing multiple genes, each chromosome as an individual, each gene corresponding to one partitioning operation, and the chromosome comprising three parts, each part representing a random number partition, a hash partition for secondary allocation using a random number strategy, and a hash partition of adjacent positions, denoted as follows:

[0010] Chromosome=(Gene1, Gene2, Gene3)

[0011] Where Gene1 represents random number partitioning, Gene2 represents hash partitioning based on secondary allocation using the random number strategy, and Gene3 represents hash partitioning based on adjacent positions. A fitness function is designed, considering task execution time, resource utilization, load balancing, and combinations of different partitioning strategies, while also taking into account the tail shape and heterogeneity of the data. The formula is expressed as:

[0012]

[0013] Among them, X i It is the i-th observation in the dataset. α is the mean of the dataset, n is the number of observations in the dataset, and α and β are parameters that control tail shape and anisotropy. and This is the part that takes into account the shape and anomalousness of the tail.

[0014] As a preferred embodiment of the dynamic partitioning method based on the data skew model described in this invention, the fitness function is improved by introducing a tradeoff factor γ that varies within the range of [0, 1], as expressed by the formula:

[0015]

[0016] When γ is 0, the formula is equivalent to the original Skewness formula; when γ is 1, it means that the influence of the two has equal weight.

[0017] As a preferred embodiment of the dynamic partitioning method based on the data skew model described in this invention, the fitness function is used to calculate the selection probability, expressed by the following formula:

[0018]

[0019] Among them, Weight Fiteness Weight TailShpe and Weight Heterogeneity These are the weights for fitness, tail shape, and heteromorphism, respectively.

[0020] As a preferred embodiment of the dynamic partitioning method based on the data skew model described in this invention, the fitness function is applied to evaluate individuals, and combined with the characteristics of the individuals, including tail shape and heteromorphism, it is determined whether the solution of the fitness function at this time is the optimal solution. The optimal solution is determined when the number of iterations reaches the upper limit, or when the maximum fitness value is taken as the optimal solution. The solution at the current moment is compared with the maximum value. If the solution at the current moment is greater than the maximum value, then the solution at the current moment is taken as the maximum value. Similarly, if the difference between the solution at the current moment, the solution at the previous moment, and the current maximum value does not exceed m% of the current maximum value, then the current maximum value is taken as the optimal solution.

[0021] As a preferred embodiment of the dynamic partitioning method based on the data skew model described in this invention, the partitioning optimization strategy includes random number partitioning, random number strategy-based secondary allocation hash partitioning, and adjacent position hash partitioning. The random number partitioning assumes n < key, value> key-value pairs and m partitions. If n data items can be randomly allocated to m partitions, the allocation result is uniform. A random function returns partition numbers between 0 and (m-1). The random number strategy-based secondary allocation hash partitioning uses random numbers to return partition numbers, a random number function to return partition numbers, and then uses random numbers again to obtain the total number of data items. Based on the set number of partitions, the average value (also a threshold) in each partition can be obtained. A Map cache is used to store the number of data items in each partition. Each time, it is determined whether the number of data items in a partition has reached the threshold. If the threshold has been reached, the partition numbers that have not reached the threshold are found and stored in the corresponding partitions. The adjacent hash partitions are obtained by using random numbers to get the initial partition number. Then, the partition numbers of the two adjacent positions of this partition number are found. The number of partitions is stored in the Map cache. The number of partitions is defined. The number of partitions is obtained by taking the remainder between the hash partition number and the number of partitions to get the number of data in the partition. The number of data in the two adjacent partition numbers is also obtained. The number of data in the three partition numbers is compared. The data is placed in the partition with the least amount of data corresponding to the partition number and the minimum partition number is returned.

[0022] As a preferred embodiment of the dynamic partitioning method based on the data skew model described in this invention, the repeated optimization process includes: adjusting the data skew evaluation model using a loss function, expressed as follows:

[0023]

[0024] in, y is the model's predicted value. i This represents the actual data skewness. We introduce the Q-learning reinforcement learning algorithm and update the Q-value in each iteration, denoted as...

[0025] Q(s,a)=Q(s,a)+α·(R+γ·max a′ Q(s′,a′)-Q(s,a))

[0026] Where Q(s, a) is the current Q value of performing action a in state s, R is the immediate reward obtained after performing action a, α is the learning rate, γ is the discount factor, and Q(s′, a′) represents the Q value corresponding to the best action a′ to be performed in the next state s′.

[0027] Secondly, embodiments of the present invention provide a dynamic partitioning system based on a data skew model, comprising: an acquisition module for acquiring data and preprocessing it; a construction module for constructing a model and selecting the optimal partitioning optimization strategy; and an optimization module for optimizing the model to achieve real-time adjustment of dynamic partitions.

[0028] Thirdly, embodiments of the present invention provide a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement any step of the above-described dynamic partitioning method based on a data skew model.

[0029] Fourthly, embodiments of the present invention provide a computer-readable storage medium having a computer program stored thereon, wherein: when the computer program is executed by a processor, it implements any step of the above-described dynamic partitioning method based on a data skew model.

[0030] The beneficial effects of this invention are that it proposes a dynamic partitioning algorithm based on a data skew model for the Spark platform. This algorithm can select a suitable partitioning strategy according to the degree of data skewness, efficiently completing data partitioning and improving the execution efficiency of Spark jobs under data skew conditions. Comparative experiments are conducted to process data using algorithms based on three optimization methods for Spark dynamic partitioning, designing and verifying the universality and efficiency of the proposed dynamic partitioning scheme. This enables load balancing across the entire computing cluster, fully utilizing the cluster's computing resources and completing data processing and computation more efficiently. Comparative experiments with the default partitioning scheme verify the universality and efficiency of the proposed scheme. Attached Figure Description

[0031] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. Wherein:

[0032] Figure 1 This is a flowchart of the present invention;

[0033] Figure 2 Here is a flowchart of the random number partitioning strategy;

[0034] Figure 3 Flowchart of secondary allocation hash partitioning strategy for random number strategy;

[0035] Figure 4 The flowchart shows the hash partitioning strategy for adjacent positions. Detailed Implementation

[0036] To make the above-mentioned objects, features, and advantages of the present invention more apparent and understandable, specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the protection scope of the present invention.

[0037] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of the invention. Therefore, the invention is not limited to the specific embodiments disclosed below.

[0038] Secondly, the term "one embodiment" or "embodiment" as used herein refers to a specific feature, structure, or characteristic that may be included in at least one implementation of the present invention. The phrase "in one embodiment" appearing in different places in this specification does not necessarily refer to the same embodiment, nor is it a single or selective embodiment that is mutually exclusive with other embodiments.

[0039] Example 1

[0040] Reference Figures 1-4 This is the first embodiment of the present invention, which provides a dynamic partitioning method based on a data skew model, including:

[0041] S1: Collect data and preprocess it.

[0042] Collect historical data on system operation, including task execution time, resource utilization, and data distribution. Ensure the dataset is diverse, including data with different characteristics and skewness levels. Conduct multiple experiments, each using a different subset or randomly selected data to reduce the impact of randomness on the results. Select publicly available standard datasets for comparison to ensure the results are comparable to other studies. Normalize the data to complete dataset creation.

[0043] S2: Establish a data skew assessment model, predict the degree of data skew during the operation process based on data characteristics, predict the pre-processed data, and determine the partition optimization strategy.

[0044] Models based on evolutionary algorithms specifically include:

[0045] S2.1: Each individual is represented as a chromosome containing multiple genes, with each gene corresponding to one partitioning operation. The chromosome consists of three parts, each representing a random number partition, a hash partition obtained through a secondary allocation using a random number strategy, and a hash partition of adjacent positions. The formula is expressed as:

[0046] Chromosome=(Gene1, Gene2, Gene3)

[0047] Where Gene1 represents the random number partition, Gene2 represents the hash partition for secondary allocation using the random number strategy, and Gene3 represents the hash partition for adjacent positions.

[0048] During the population initialization phase, a set of individuals is randomly generated, each containing the values of three randomly selected genes. These values should be randomly selected within an appropriate range to ensure that the diversity of partitioning strategies can be explored.

[0049] S2.2: Design a fitness function that considers metrics such as task execution time, resource utilization, and load balancing, as well as combinations of different partitioning strategies, and takes into account the tail shape and heterogeneity of the data. The formula is as follows:

[0050]

[0051] Among them, X i It is the i-th observation in the dataset. α is the mean of the dataset, n is the number of observations in the dataset, and α and β are parameters that control tail shape and anisotropy. and This considers the tail shape and anomalies. These terms allow the Skewness calculation to weight different parts of the data distribution to varying degrees, in order to more comprehensively capture the skewness of the distribution.

[0052] To address this invention, a tradeoff factor γ is introduced, varying within the range [0, 1]. This factor is adjusted to balance the effects of tail shape and anisotropy, as expressed by the formula:

[0053]

[0054] When calculating tail shape and anomaly, the relative influence of the two can be adjusted by introducing a γ tradeoff factor. When γ is 0, the formula is equivalent to the original Skewness formula; when γ is 1, it indicates that the influence of the two has equal weight.

[0055] S2.3: Individuals are evaluated based on the fitness function to determine which individuals will be selected for the next generation of evolution. Since the fitness function already includes considerations for tail shape and anomaly, the goal of the selection operation should be to retain these advantageous traits. In the selection strategy, we can consider increasing the weighting of these traits or introducing selection probabilities specifically considering tail shape and anomaly. Furthermore, assuming that the tail shape and anomaly components are within the same range and the sum of their weights is 1, the improved selection probability calculation formula is as follows:

[0056]

[0057] Among them, Weight Fiteness Weight TailShape and WeightH eterogeneity These are the weights for fitness, tail shape, and heteromorphism, respectively.

[0058] Individuals with good tail shape and anomalous traits are selectively retained, which can be achieved by performing some additional processing on these individuals. This includes identifying individuals with good characteristics based on the tail shape and anomalous traits in the fitness function before the selection operation.

[0059] Individuals identified as having better characteristics are given a higher probability of selection or are given special selection mechanisms to make them more likely to be selected.

[0060] By combining special treatments with other conventional selection strategies, the entire selection process is made diverse and targeted.

[0061] S2.4: After selection, a crossover operation is performed, taking into account the characteristics of the fitness function, to ensure that information on tail shape and aberration is properly transmitted and combined. A three-point crossover adjustment is used, considering that genes related to tail shape and aberration in the fitness function have a greater chance of becoming crossover points. For the crossover points to be selected, it is necessary to ensure that: (1) gene segments related to tail shape and aberration in the fitness function are retained as much as possible; (2) through appropriate combination strategies, the offspring chromosomes can inherit the advantageous characteristics from the parents. This ensures that the offspring chromosomes reasonably combine the genes of the parents.

[0062] S2.5: Introduce population variation to ensure population diversity. Considering tail shape and heteromorphism, for genes with favorable traits, the magnitude of variation can be reduced to ensure that the mutated gene still retains the advantageous trait; when mutating, the focus should be on local adjustments rather than large-scale global changes, which helps to preserve the local advantageous traits of individuals.

[0063] For individuals with favorable traits, it may be advisable to directly replicate the individual without mutation. This ensures that advantageous traits are preserved.

[0064] S2.6: Apply a fitness function to evaluate each newborn individual, considering individual characteristics such as tail shape and heteromorphism, to ensure the accuracy and comprehensiveness of the evaluation. The purpose of the evaluation is to quantify the merits of each individual.

[0065] S2.7: In order to ensure that individuals in the population gradually tend to be more adaptable during the evolutionary process, individuals with better adaptability are selected from the parents and offspring to form the next generation of the population.

[0066] Repeat steps S2.3 to S2.7 until the iteration limit is reached or the fitness reaches the optimal solution. Select the chromosome with the optimal solution at this point and check the partitioning strategy as the optimal partitioning strategy.

[0067] The number of iterations is adjusted according to the fitness. The maximum fitness value is taken as the optimal solution. The solution at the current moment is compared with the maximum value. If the solution at the current moment is greater than the maximum value, then the solution at the current moment is taken as the maximum value. Similarly, if the solution at the current moment, the solution at the previous moment and the current maximum value are not more than 5% of the current maximum value, then the current maximum value is taken as the optimal solution.

[0068] S3: Design an evaluation experiment to compare dynamic partitioning with static partitioning, and repeatedly optimize the dynamic partitioning.

[0069] The evaluation experiments included assessments of load balancing and processing efficiency. The load balancing assessment was performed on each task t. i Execution time ET i Record the data and calculate N for each node. j Resource utilization, including CPU, memory, network, etc., is comprehensively evaluated based on the differences between the two partitioning strategies. The evaluation of processing efficiency is based on the calculation of the overall job execution time. The initial determination is that the processing speed is the reciprocal of the job execution time. The partitioning strategy is evaluated based on the performance indicators under the two partitioning strategies.

[0070] The iterative optimization process includes: adjusting the data skew assessment model, introducing reinforcement learning algorithms, and real-time monitoring and adjustment. The following loss function is used to optimize the accuracy and efficiency of the data skew assessment model:

[0071]

[0072] in, y is the model's predicted value. i This refers to the actual data skewness.

[0073] Introducing the Q-learning reinforcement learning algorithm enables the system to automatically adjust parameters and policies based on real-time performance, updating the Q-value in each iteration, as shown below:

[0074] Q(s,a)=Q(s,a)+α·(R+γ·max a′ Q(s′,a′)-Q(s,a))

[0075] Where Q(s, a) is the current Q value of performing action a in state s, R is the immediate reward obtained after performing action a, α is the learning rate, γ is the discount factor, and Q(s′, a′) represents the Q value corresponding to the best action a′ to be performed in the next state s′.

[0076] By continuously analyzing experimental results, adjusting model and algorithm parameters, and introducing adaptive mechanisms, the system performance is gradually improved, enabling it to better cope with different situations and data distributions and select the most suitable partitioning optimization strategy.

[0077] In conclusion,

[0078] Example 2

[0079] Building upon the first embodiment, this embodiment further provides a dynamic partitioning system based on a data skew model, comprising:

[0080] The acquisition module is used to acquire data and preprocess it;

[0081] The building module is used to build the model and select the optimal partitioning optimization strategy;

[0082] The optimization module is used to optimize the model and enable real-time adjustment of dynamic partitions.

[0083] This embodiment also provides a computer device applicable to the dynamic partitioning method based on a data skew model, including a memory and a processor; the memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions to implement the dynamic partitioning method based on a data skew model as proposed in the above embodiment.

[0084] The computer device can be a terminal, comprising a processor, memory, communication interface, display screen, and input devices connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, carrier networks, NFC (Near Field Communication), or other technologies. The display screen can be an LCD screen or an e-ink screen. The input devices can be a touch layer covering the display screen, buttons, a trackball, or a touchpad on the computer device's casing, or an external keyboard, touchpad, or mouse.

[0085] This embodiment also provides a storage medium on which a computer program is stored. When the program is executed by a processor, it implements the dynamic partitioning method based on the data skew model proposed in the above embodiments.

[0086] The storage medium proposed in this embodiment and the data storage method proposed in the above embodiments belong to the same inventive concept. Technical details not described in detail in this embodiment can be found in the above embodiments, and this embodiment has the same beneficial effects as the above embodiments.

[0087] Example 3

[0088] Based on the previous two embodiments, this embodiment provides a dynamic partitioning method based on a data skew model. To verify the beneficial effects of the present invention, scientific demonstration is carried out through economic benefit calculations and simulation experiments.

[0089] Compared with existing technologies, this invention exhibits significant advantages in real-time dynamic adjustment, accurate data skew model, optimized computational overhead, rational resource utilization, better generalization ability, and overall performance improvement. By introducing a real-time dynamic adjustment mechanism and adopting a more accurate data skew model, this invention achieves superior performance, as shown in Table 1:

[0090] Table 1 Comparison of Advantages

[0091]

[0092] This invention achieves more even resource utilization and demonstrates excellent generalization ability by introducing a real-time dynamic adjustment mechanism and employing a more accurate data skew model, thus comprehensively improving overall performance. In contrast, existing technologies may be limited by static partitioning and model accuracy, making it difficult to adapt to real-time requirements and diverse scenarios, resulting in a disadvantage in overall performance.

[0093] Similarly, this invention demonstrates significant advantages over existing technologies in terms of job execution time, resource utilization, accuracy of data skew handling, and real-time performance adjustment overhead, as shown in experimental data. These data more specifically reflect the performance improvements of this invention in practical applications compared to existing technologies, as detailed in Table 2.

[0094] Table 2 Performance Comparison Table

[0095]

[0096]

[0097] This invention demonstrates significant advantages in key performance indicators. Compared to existing technologies, this invention achieves substantial advantages in experimental data regarding job execution time, average task execution time, maximum node load difference, accuracy of data skew handling, real-time performance adjustment overhead, system stability, and data distribution uniformity. Specifically, it manifests in shorter job execution time, faster average task execution time, smaller maximum node load difference, higher accuracy of data skew handling, lower real-time performance adjustment overhead, higher system stability, and more uniform data distribution, collectively improving overall system performance. In contrast, existing technologies may be subject to a series of limitations, resulting in less significant performance in these key indicators compared to this invention.

[0098] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

Claims

1. A dynamic partitioning method based on a data skew model, characterized in that: include, Collect data and preprocess it; Establish a data skew assessment model to predict the preprocessed data, predict the degree of data skew during the operation based on data characteristics, and determine the partition optimization strategy; The data skew assessment model is an evolutionary algorithm-based model, including: A chromosome containing multiple genes is constructed, with each chromosome representing an individual and each gene corresponding to one partitioning operation. The chromosome consists of three parts: a random number partition, a hash partition based on a random number strategy for secondary allocation, and a hash partition of adjacent positions, represented as follows: in, Indicates random number partitioning. This indicates that a random number strategy is used for secondary allocation of hash partitions. Represents hash partitions of adjacent positions; Design a fitness function that considers task execution time, resource utilization, load balancing, and combinations of different partitioning strategies, while also taking into account the tail shape and heterogeneity of the data. The formula is expressed as: in, It is the first in the dataset One observation value, It is the average of the dataset. It is the number of observations in the dataset. and These are parameters that control tail shape and anomalousness. and This refers to the part that considers tail shape and anomalousness; The partitioning optimization strategy includes random number partitioning, random number strategy for secondary allocation of hash partitioning, and hash partitioning of adjacent positions; The random number partition is assumed to have n partitions.<key,value> For key-value pairs and m partitions, if n data items can be randomly assigned to m partitions, the assignment result is uniform. Use a random function to return the partition number between 0 and (m-1). The random number strategy for secondary allocation of hash partitions involves using a random number method to return the partition number, using a random number function to return the partition number, and then using a random number method again to obtain the total number of data. Next, based on the set number of partitions, the average value in each partition is obtained, which is also the threshold value. A Map cache is used to store the number of data put into each partition. Each time, it is checked whether the number of data in a certain partition has reached the threshold value. If the threshold value has been reached, the partition numbers that have not reached the threshold value are traversed and stored in the corresponding partition. The adjacent position hash partition is obtained by using random numbers to obtain the initial partition number. Then, the partition numbers of the two adjacent positions of this partition number are found. The number of partitions is stored in the Map cache. The number of partitions is defined. The partition number obtained from the hash partition is modulo the number of partitions to get the number of items in the partition. The number of items corresponding to the two adjacent partition numbers is also obtained. The number of items is compared with the number of items of the three partition numbers. The data is placed in the partition with the least amount of data corresponding to the partition number and the minimum partition number is returned. The design evaluation experiment compares dynamic partitioning with static partitioning and repeatedly optimizes dynamic partitioning.

2. The dynamic partitioning method based on a data skew model as described in claim 1, characterized in that: The fitness function is improved by introducing a tradeoff factor. To make it vary within the range [0, 1], the formula is expressed as: Among them, when When the value is 0, the formula is equivalent to the original Skewness formula; when... A value of 1 indicates that the two influences have equal weight.

3. The dynamic partitioning method based on the data skew model as described in claim 2, characterized in that: The selection probability of the fitness function is calculated, and the formula is expressed as follows: in, , and These are the weights for fitness, tail shape, and heteromorphism, respectively.

4. The dynamic partitioning method based on the data skew model as described in claim 3, characterized in that: The fitness function is applied to evaluate the individual, and combined with the individual's characteristics, including tail shape and heteromorphism, it is determined whether the solution of the fitness function is the optimal solution at this time; The optimal solution is determined when the number of iterations reaches the upper limit, or when the maximum fitness value is taken as the optimal solution. The solution at the current moment is compared with the maximum value. If the solution at the current moment is greater than the maximum value, then the solution at the current moment is taken as the maximum value. Similarly, if the difference between the solution at the current moment, the solution at the previous moment, and the current maximum value does not exceed m% of the current maximum value, then the current maximum value is taken as the optimal solution.

5. The dynamic partitioning method based on the data skew model as described in claim 4, characterized in that: The process of repeated optimization includes, The data skew assessment model is adjusted using a loss function, denoted as follows: in, The predicted value of the model. This refers to the actual data skewness; We introduce the Q-learning reinforcement learning algorithm and update the Q-value in each iteration, denoted as, in, R is the current Q value when performing action a in state s, and R is the immediate reward obtained after performing action a. It's the learning rate. It is the discount factor. This represents the Q value corresponding to the optimal action a′ to be performed in the next state s′.

6. A dynamic partitioning system based on a data skew model, based on the dynamic partitioning method based on a data skew model as described in any one of claims 1 to 5, characterized in that: include, The acquisition module is used to acquire data and preprocess it; The building module is used to build the model and select the optimal partitioning optimization strategy; The optimization module is used to optimize the model and enable real-time adjustment of dynamic partitions.

7. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that: When the processor executes the computer program, it implements the steps of the dynamic partitioning method based on the data skew model as described in any one of claims 1 to 5.

8. A computer-readable storage medium having a computer program stored thereon, characterized in that: When the computer program is executed by the processor, it implements the steps of the dynamic partitioning method based on the data skew model as described in any one of claims 1 to 5.