A Flink load balancing method and system based on multi-index performance evaluation
By constructing a multi-index performance evaluation model and a dynamic resource matching algorithm, Flink's task allocation is optimized, solving the problem of unbalanced computing load caused by differences in node performance, improving resource utilization and system throughput, and optimizing system performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HENAN UNIVERSITY
- Filing Date
- 2026-02-05
- Publication Date
- 2026-06-30
AI Technical Summary
Flink's default scheduling strategy fails to adequately consider the performance differences of nodes under dynamically changing workloads, resulting in uneven distribution of computing load, overload of some nodes becoming performance bottlenecks, low resource utilization, and impacting the overall system performance.
A comprehensive model based on multi-metric performance evaluation is constructed. Node status information is obtained through Flink Metrics to form a comprehensive performance score. Path utility function and virtual load loading state simulation are used to optimize task allocation. A custom partitioner wrapper is combined to achieve dynamic data flow weight adaptive resource matching.
It significantly improves resource utilization and system throughput, reduces computational latency, optimizes overall system performance, and solves the problem of uneven resource allocation in traditional scheduling strategies.
Smart Images

Figure CN122309129A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer technology, and in particular to a Flink load balancing method and system based on multi-index performance evaluation. Background Technology
[0002] In recent years, data has exploded, with massive amounts of data generated in fields such as big data, artificial intelligence, and the Internet of Things. From a storage capacity perspective, data volume has rapidly expanded from the early KB, MB, and GB levels to the current EB, ZB, and even YB levels; simultaneously, storage technology has undergone significant transformations from USB drives and floppy disks to hard drives, data centers, and the cloud. According to research and predictions by Seagate and IDC, the global data volume is expected to reach 163 ZB by 2025. This not only reflects the explosive growth in data volume but also places higher demands on data processing speed. Faced with such massive and rapidly generated data, traditional data processing methods are no longer sufficient. Hadoop, Spark, and Flink are increasingly widely used. Apache Flink, as an advanced distributed stream processing framework, is renowned for its excellent in-memory execution efficiency, large-scale computing power, low latency, high throughput, and strong consistency, making it an ideal choice for processing unbounded and bounded data streams. Flink not only efficiently supports batch and stream processing tasks but also possesses excellent scalability and fault tolerance mechanisms, and is widely used in e-commerce real-time recommendation systems, financial transaction settlement, the Internet of Things, and many other fields, establishing its core position in the global big data processing field. However, with the increasing complexity of application scenarios and the widespread adoption of computing environments, Flink's scheduling strategies have gradually revealed their limitations. Under dynamically changing workloads, the default scheduling mechanism does not fully consider the performance differences between nodes, leading to uneven distribution of computing load and some nodes becoming performance bottlenecks due to resource overload. At the same time, traditional strategies lack sufficient global resource evaluation, resulting in low resource utilization and severely restricting the overall performance of the platform. Therefore, optimizing task scheduling and resource allocation has become a key issue in improving the performance of streaming computing platforms.
[0003] In Flink, the JobManager acts as the master node, responsible for management and scheduling. The TaskManager is responsible for executing tasks and processing data. Based on the JobManager's schedule, computational tasks are executed and data processing is completed. Job execution is planned and coordinated by the JobManager, transforming it into a series of execution graphs, ultimately forming the physical execution graph. The job execution flow begins with a user-defined StreamGraph, then undergoes a series of transformations into the JobGraph, determining task merging and data transfer paths. Subsequently, within the JobManager, this is further refined into an ExecutionGraph, instantiating parallel tasks and defining data dependencies between tasks, ultimately forming the physical execution graph on the TaskManager. This series of transformations forms the execution model from logical to physical.
[0004] In the Flink computing environment, the data inflow rate is inherently uneven, often fluctuating significantly over time. However, Flink's default scheduling strategy has significant limitations. It does not adequately consider resource utilization, employing simple allocation methods such as round-robin, and cannot effectively handle differences in node performance. When a node is overloaded, it can easily trigger a chain reaction of performance degradation and increased latency; conversely, underloaded nodes leave resources idle, failing to fully utilize their capabilities. This imbalance directly increases data processing latency and reduces system throughput, severely impacting the core real-time requirements of streaming computing. Furthermore, the lack of a dynamic adjustment mechanism prevents flexible task allocation based on real-time node load, further exacerbating the "busy get busier, idle get idler" phenomenon. This severely impacts the reliability of the system's overall operation and its low-latency response capabilities, hindering the performance of the Flink platform. Summary of the Invention
[0005] To address the issue of uneven computational load distribution in Flink's default scheduling strategy, this invention provides a Flink load balancing method and system based on multi-metric performance evaluation. First, a comprehensive performance evaluation model based on multiple metrics is established. Node status information is obtained through Flink Metrics to generate a comprehensive performance score and obtain an initial set of task allocation schemes. Then, optimization is performed based on path utility functions and virtual load loading state simulation to obtain the final scheme. Finally, the final scheme is used to allocate data flow and optimized paths through the CustomPartitionerWrapper interface provided by the Flink platform, achieving balanced computational load distribution. This invention effectively solves the problem of uneven computational load distribution, significantly improves resource utilization and system throughput, and reduces computational latency.
[0006] To achieve the above objectives, the technical solution of the present invention is as follows:
[0007] The first aspect of this invention proposes a Flink load balancing method based on multi-metric performance evaluation, comprising:
[0008] Step 1: Collect multiple task manager node metrics and build a comprehensive performance evaluation model based on multiple metrics.
[0009] Step 2: Obtain the comprehensive performance score of each task manager node based on the comprehensive performance evaluation model based on multiple indicators, and construct an initial task allocation scheme set based on the comprehensive performance score of each task manager node to facilitate subsequent screening to obtain the optimal scheme;
[0010] Step 3: Obtain the baseline performance score of each scheme in the initial task allocation scheme set according to the preset path utility function, and obtain the optimal recommended scheme and the second-best recommended scheme based on the baseline performance score;
[0011] Step 4: Simulate the virtual load loading state of the optimal and suboptimal recommended solutions. Based on the virtual load loading state simulation results and the benchmark performance score, obtain the optimal solution to facilitate obtaining a solution that meets the requirements.
[0012] Step 5: Using the optimal solution as the new initial task allocation scheme, repeat steps 3 and 4 to iteratively optimize the optimal solution and obtain the final solution;
[0013] Step 6: Rewrite the partitioning method of Flink's custom partitioner wrapper interface, embed a dynamic data stream weight adaptive resource matching algorithm in the partitioning method to process the final solution, and complete the task manager node allocation.
[0014] Furthermore, the comprehensive performance evaluation model based on multiple indicators is expressed by the following formula:
[0015]
[0016]
[0017] in, Give a comprehensive performance score to node i. CPU performance metrics For memory performance metrics, For network speed performance indicators, These represent the weighting percentages of CPU, memory, and network speed, respectively.
[0018] Furthermore, the CPU performance metrics are expressed by the following formula:
[0019]
[0020] in, CPU performance metrics Let θ be the average CPU utilization of node i, and θ be the preset ideal load point. Let be the standard deviation of CPU utilization for node i;
[0021] The memory performance metrics are expressed by the following formula:
[0022]
[0023] in, For memory performance metrics, Let i be the average memory usage of node i. Let be the maximum memory capacity of node i.
[0024] The network speed performance index is expressed by the following formula:
[0025]
[0026] in, For network speed performance indicators, Let i be the average measured network speed of node i. Let i be the theoretical maximum network speed.
[0027] Furthermore, the path utility function is expressed by the following formula:
[0028]
[0029]
[0030] in, As the baseline performance score, , , These are the aggregated evaluation values of the critical execution path corresponding to Solution G in terms of CPU, memory, and network speed. This is the bottleneck indicator, and `max` is the operation to retrieve the maximum value.
[0031] Furthermore, step three specifically includes:
[0032] The baseline performance score of each scheme in the initial task allocation scheme set is obtained based on the preset path utility function;
[0033] All solutions are ranked according to their benchmark performance scores. The solution with the highest benchmark performance score is selected as the optimal recommended solution, and the solution with the second highest benchmark performance score is selected as the suboptimal recommended solution.
[0034] Furthermore, step four includes:
[0035] The optimal recommended solution is simulated under virtual load conditions. Node resource data in the simulation environment is collected and substituted into the path utility function to obtain the expected baseline performance score under the simulation conditions.
[0036] Calculate the deviation rate between the baseline performance score of the optimal recommended solution and the expected baseline performance score under simulated conditions; if the deviation rate is less than a preset threshold, it is determined that the evaluation is consistent, and the optimal recommended solution is directly implemented; if the deviation rate is greater than the preset value, it is determined that the evaluation is inconsistent, and a second evaluation is performed; the deviation rate is: ,in, The deviation rate, The baseline performance score for the optimal recommendation is... This represents the expected baseline performance score under simulated conditions.
[0037] In the second evaluation, the bottleneck index of the optimal recommended solution and the bottleneck index of the second-best recommended solution are calculated and compared, and the solution with the smaller bottleneck index is selected as the current optimal solution.
[0038] Furthermore, the dynamic data stream weight adaptive resource matching algorithm includes:
[0039] Calculate the overall performance score of each task manager node in the final solution, use the overall performance score as the weight of the task manager node, and obtain the total weight based on the weight;
[0040] Initialize a target set and a total weight. Add each task manager node in the final solution to the target set and accumulate the weight of the task manager node to the total weight. The target set also includes the weight of each task manager node and the task execution invoker.
[0041] After calculating the total weight of all task manager nodes, if the total weight is less than or equal to 0, it means that the weight in the target set is invalid or there is a problem with the data, and an empty list is returned directly. If the target set contains only one task manager node, the task execution caller of that task manager node is returned directly.
[0042] If the target set contains multiple task manager nodes, a weighted random number is generated, ranging from 0 to the total weight. The target set is traversed, and the weighted random number is gradually reduced according to the weight of the task manager node. Each time, the weight of the current task manager node is subtracted. When the weighted random number is less than or equal to 0, it indicates that the current target is selected. The algorithm returns the task execution caller for that target so that the task execution caller can be used to assign a task to the selected target. If no suitable target is found after traversing all task manager nodes, an empty list is returned and an error is reported.
[0043] A second aspect of this invention proposes a Flink load balancing system based on multi-metric performance evaluation, comprising:
[0044] The evaluation model building module is used to collect metrics from multiple task manager nodes and build a comprehensive performance evaluation model based on multiple metrics.
[0045] The initial scheme module is used to obtain the comprehensive performance score of each task manager node based on a multi-index comprehensive performance evaluation model, and to construct an initial task allocation scheme set based on the comprehensive performance score of each task manager node, so as to facilitate subsequent screening to obtain the optimal scheme.
[0046] The filtering module is used to obtain the baseline performance score of each scheme in the initial task allocation scheme set according to the preset path utility function, and to obtain the optimal recommended scheme and the second-best recommended scheme based on the baseline performance score.
[0047] The comparison module is used to simulate the virtual load loading state of the optimal and second-best recommended solutions. Based on the simulation results and the benchmark performance score, the optimal solution is obtained, which facilitates the selection of a solution that meets the requirements.
[0048] The iteration module is used to take the optimal solution as a new initial task allocation scheme. The repeated filtering module and the comparison module iteratively optimize the optimal solution to obtain the final solution.
[0049] The execution module is used to rewrite the partitioning method of Flink's custom partitioner wrapper interface. It embeds a dynamic data stream weight adaptive resource matching algorithm in the partitioning method to process the final solution and complete the task manager node allocation.
[0050] A third aspect of the present invention provides an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, implements the method described in the first aspect above.
[0051] A fourth aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method described in the first aspect above.
[0052] The beneficial effects of this invention are:
[0053] This invention constructs a comprehensive performance evaluation model based on multiple indicators such as memory, CPU, and network speed, which achieves reasonable allocation of memory usage, increases the CPU load ratio of high-performance nodes, significantly improves network transmission efficiency, optimizes the overall system performance, and effectively solves the problem of uneven resource allocation in traditional scheduling strategies. Attached Figure Description
[0054] Figure 1 This is a flowchart of a Flink load balancing method based on multi-index performance evaluation, provided as an embodiment of the present invention.
[0055] Figure 2 This is a schematic diagram of the Flink architecture provided in an embodiment of the present invention.
[0056] Figure 3 This is a schematic diagram of the experimental topology provided in an embodiment of the present invention.
[0057] Figure 4 This is a schematic diagram illustrating the experimental comparison and analysis provided in the embodiments of the present invention.
[0058] Figure 5 This is an architecture diagram of a Flink load balancing system based on multi-index performance evaluation, provided for an embodiment of the present invention. Detailed Implementation
[0059] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of the embodiments of this invention will be clearly described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0060] Example 1
[0061] like Figure 1 As shown, a Flink load balancing method based on multi-metric performance evaluation includes:
[0062] S101: Collect multiple task manager node metrics and construct a comprehensive performance evaluation model based on multiple metrics.
[0063] S102: Obtain the comprehensive performance score of each task manager node based on the comprehensive performance evaluation model based on multiple indicators, and construct an initial task allocation scheme set based on the comprehensive performance score of each task manager node.
[0064] S103: Obtain the baseline performance score of each scheme in the initial task allocation scheme set according to the preset path utility function, and obtain the optimal recommended scheme and the second-best recommended scheme based on the baseline performance score.
[0065] S104: Perform virtual load loading state simulation on the optimal and suboptimal recommended schemes, and obtain the optimal scheme based on the virtual load loading state simulation results and the benchmark performance score.
[0066] S105: Using the optimal solution as the new initial task allocation scheme, repeat S103 and S104 to iteratively optimize the optimal solution and obtain the final solution.
[0067] S106: Rewrite the partitioning method of Flink's custom partitioner wrapper interface, embed a dynamic data stream weight adaptive resource matching algorithm in the partitioning method to process the final solution and complete the task manager node allocation.
[0068] This invention constructs a comprehensive performance evaluation model based on multiple indicators to obtain a comprehensive performance score for each task manager node. Based on this comprehensive performance score, an initial task allocation scheme is constructed. The initial task allocation scheme is then optimized using a preset path utility function to obtain an optimal and a second-best recommended scheme. The optimal and second-best recommended schemes are compared to determine the current optimal scheme. This process is then iteratively repeated to obtain the final scheme. This invention improves the CPU load ratio of high-performance nodes, significantly enhances network transmission efficiency, optimizes overall system performance, and effectively solves the problem of uneven resource allocation in traditional scheduling strategies.
[0069] Example 2
[0070] Based on the above embodiments, this invention proposes a specific implementation process for a Flink load balancing method based on multi-index performance evaluation, including:
[0071] S201: Collect multiple task manager node metrics and construct a comprehensive performance evaluation model based on multiple metrics.
[0072] Specifically, a comprehensive performance evaluation model based on multiple indicators is established: real-time quantification of node CPU utilization, memory usage, and network speed performance indicators. The comprehensive performance evaluation model based on multiple indicators is expressed by the following formula:
[0073]
[0074]
[0075] in, Give a comprehensive performance score to node i. CPU performance metrics For memory performance metrics, For network speed performance indicators, These represent the weighting percentages of CPU, memory, and network speed, respectively.
[0076] The average memory usage of node i is defined as:
[0077]
[0078] in, Let represent the memory usage of node i at time t. Let be the average memory usage of node i, and n be the number of time points.
[0079] The memory utilization rate is converted into the performance metric M(i) as follows:
[0080]
[0081] in. For memory performance metrics, Let i be the average memory usage of node i. Let be the maximum memory capacity of node i.
[0082] The average CPU utilization of node i is defined as:
[0083]
[0084] in, Let i be the average CPU utilization of node i. Let be the CPU utilization of node i at time point t, and n be the number of time points. The CPU utilization is converted into a performance metric C(i) by the following formula:
[0085]
[0086] in, Here, θ represents the CPU performance metric, and θ is the preset ideal load point. Let be the standard deviation of CPU utilization for node i.
[0087] Network speed performance metrics are expressed using the following formula:
[0088]
[0089] in, For network speed performance indicators, Let i be the average measured network speed of node i. Let i be the theoretical maximum network speed.
[0090] S202: Obtain the comprehensive performance score of each task manager node based on the comprehensive performance evaluation model based on multiple indicators, and construct an initial task allocation scheme set based on the comprehensive performance score of each task manager node.
[0091] Specifically, based on the results of the multi-index comprehensive performance evaluation model, the CPU utilization, memory utilization, and network speed performance indicators of each TaskManager node are collected to form an initial task allocation scheme set.
[0092] S203: Obtain the baseline performance score of each scheme in the initial task allocation scheme set according to the preset path utility function, and obtain the optimal recommended scheme and the second-best recommended scheme based on the baseline performance score.
[0093] Specifically, the baseline performance score of each scheme in the initial task allocation scheme set is obtained according to a preset path utility function. All schemes are then ranked based on their baseline performance scores, with the scheme with the highest baseline performance score being designated as the optimal recommended scheme, and the scheme with the second highest baseline performance score being designated as the suboptimal recommended scheme.
[0094] The path utility function is expressed by the following formula:
[0095]
[0096]
[0097] in, As the baseline performance score, , , These are the aggregated evaluation values of the critical execution path corresponding to Solution G in terms of CPU, memory, and network speed. "Max" represents the bottleneck metric, and "max" represents the maximum value operation. The aggregate evaluation involves performing an arithmetic average of the metrics across the three dimensions of CPU, memory, and network speed along the execution path, and then deriving a value reflecting the overall resources.
[0098] Geometric mean term Ensure that the optimization scheme is balanced across all performance dimensions, and avoid the advantage of a single indicator masking overall performance problems.
[0099] This is a metric that measures the impact of bottleneck tasks in a path, designed to help optimize path selection and achieve load balancing. It identifies bottlenecks by evaluating node resource usage (such as CPU and memory) and assesses the bottleneck impact of each node based on task type (such as CPU-intensive or memory-intensive). During calculation, it converts the positive performance metrics of nodes into resource stress coefficients and takes the maximum value from all nodes. The maximum value is taken from all nodes in path G. Converting the node's positive performance metrics into resource stress coefficients identifies the most severe resource stress on all nodes in path G, serving as the bottleneck index for that path. Weak links in the path are identified by taking all nodes and all resource stress coefficients.
[0100] S204: Perform virtual load loading state simulation on the optimal and suboptimal recommended schemes, and obtain the optimal scheme based on the virtual load loading state simulation results and the benchmark performance score.
[0101] Specifically, the optimal recommended solution is simulated under virtual load conditions. Node resource data in the simulation environment is collected and substituted into the path utility function to obtain the expected baseline performance score under the simulation conditions.
[0102] Calculate the deviation rate between the baseline performance score of the optimal recommended solution and the expected baseline performance score under the simulation state; if the deviation rate is less than the preset threshold, it is determined that the evaluation is consistent, the optimal recommended solution is directly executed, and S205 is skipped and S206 is executed directly.
[0103] If the deviation rate is greater than the preset value, it is determined that the evaluation is inconsistent and a second evaluation is performed.
[0104] The deviation rate is: ,in, The deviation rate, The baseline performance score for the optimal recommendation is... This represents the expected baseline performance score under simulated conditions.
[0105] In the second evaluation, the bottleneck index of the optimal recommended solution and the bottleneck index of the second-best recommended solution are calculated and compared, and the solution with the smaller bottleneck index is selected as the current optimal solution.
[0106] S205: Using the optimal solution as the new initial task allocation scheme, repeat S203 and S204 to iteratively optimize the optimal solution and obtain the final solution.
[0107] Specifically, based on the current optimal solution, special optimization and comprehensive evaluation are performed cyclically, with a maximum number of iterations and a number of consecutive rounds without improvement set. When the maximum number of iterations is reached or the performance converges, the iteration is terminated, the final solution is output as the optimization result, and it is passed to the subsequent dynamic weight adaptive resource matching module.
[0108] S206: Rewrite the partitioning method of Flink's custom partitioner wrapper interface, embed a dynamic data stream weight adaptive resource matching algorithm in the partitioning method to process the final solution and complete the task manager node allocation.
[0109] Specifically, the dynamic data stream weighted adaptive resource matching algorithm includes:
[0110] Calculate the overall performance score of each task manager node in the final solution, use the overall performance score as the weight of the task manager node, and obtain the total weight based on the weight.
[0111] Initialize a target set and a total weight. Add each task manager node in the final solution to the target set and accumulate the weight corresponding to the task manager node into the total weight. The target set also includes the weight and task execution caller corresponding to each task manager node.
[0112] After calculating the total weight of all task manager nodes, if the total weight is less than or equal to 0, it indicates that the weight in the target set is invalid or there is a problem with the data, and an empty list is returned directly. If the target set contains only one task manager node, the task execution caller of that task manager node is returned directly.
[0113] If the target set contains multiple task manager nodes, a weighted random number is generated, with the weighted random number ranging from 0 to the total weight. The target set is traversed, and the weighted random number is gradually reduced according to the weight of the task manager node. Each time, the weight of the current task manager node is subtracted. When the weighted random number is less than or equal to 0, it indicates that the current target is selected. The algorithm returns the task execution caller for that target so that the task execution caller can be used to assign tasks to the selected target.
[0114] Example 3
[0115] Based on the above embodiments, this invention proposes a specific implementation method for a Flink load balancing method based on multi-index performance evaluation, including:
[0116] Job topology abstraction: The JobManager, as the master node, is responsible for management and scheduling. The TaskManager is responsible for executing tasks and processing data. Based on the JobManager's scheduling, computational tasks are executed and data processing is completed. Job execution is planned and coordinated by the JobManager, transforming it into a series of execution graphs, ultimately forming a physical execution graph. The job execution flow begins with a user-defined StreamGraph, then undergoes a series of transformations into a JobGraph, determining task merging and data transmission paths. This is further refined into an ExecutionGraph within the JobManager, instantiating parallel tasks and defining data dependencies between tasks, ultimately forming a physical execution graph on the TaskManager. This series of transformations forms an execution model from logic to physical. The Data Flow Graph (DAG) is the execution graph of a Flink job, detailing the operators and execution sequence within the job, ensuring efficient and orderly execution of the entire job process. The Flink architecture diagram of this invention is as follows: Figure 2 As shown.
[0117] Performance data collection and evaluation: Use Flink Metrics to periodically collect runtime metrics for each TaskManager node: CPU utilization, memory usage, and network speed.
[0118] Based on the collected average CPU utilization, average memory utilization, and average network speed, standardized performance metrics for each node are calculated, namely CPU performance metrics, memory performance metrics, and network speed performance metrics. Weights α, β, and γ are then assigned according to task characteristics and cluster status to calculate the overall performance score P(i).
[0119] Multi-stage path optimization:
[0120] Input and Initialization: Path optimization uses the node-level score P(i) of a multi-index comprehensive performance evaluation model as input. The system uses the comprehensive performance score P(i) of each node as the weight basis for task allocation. Then, it traverses each parallel subtask in the job topology H, determining the execution node of the task by generating random numbers and matching node weight ranges. Nodes with better performance have higher weights and a greater probability of being assigned tasks. After traversal, an initial allocation scheme is formed. This scheme serves as the starting point for optimization, and its performance benchmark is calculated using the path utility function. To verify the algorithm's effectiveness, multiple rounds of parameter optimization tests were conducted, and the final parameter configuration CPU utilization threshold was calibrated. Memory utilization threshold Internet speed threshold These values correspond to reserving approximately 20-25% of resources for CPU and memory. All thresholds are designed as configurable parameters, which can be adjusted according to the hardware performance and job type of the specific cluster, thereby enhancing the adaptability of the strategy in different scenarios.
[0121] Detailed optimization strategies:
[0122] CPU Optimization N1: When the node's CPU utilization exceeds a set threshold When this happens, the computing resources are dynamically allocated, and the computing capabilities and load matching of each node are reassessed to achieve a balanced distribution of CPU resources.
[0123] Memory optimization N2: If the node's memory usage exceeds a set threshold In this case, by adjusting task allocation weights and data flow routing strategies, the computing load is shifted from nodes with high memory pressure to nodes with sufficient memory resources, thus avoiding system performance degradation caused by memory bottlenecks.
[0124] Network Speed Optimization N3: Based on network speed throughput, if the node's network speed is lower than a set threshold... This reduces the allocation of data streams and directs data to high-speed nodes to optimize network transmission efficiency and reduce processing latency caused by network bottlenecks.
[0125] Path utility function evaluation: Candidate solutions are aggregated at the path layer. , , Then substitute the values into the path utility function. Repeat the above steps for iterative optimization.
[0126] The final solution after iteration is passed to the next stage of the dynamic data flow weight adaptive resource matching algorithm to achieve the final scheduling and execution of the data flow.
[0127] Override the `partition` method of Flink's `CustomPartitionerWrapper` interface. In the `partition` method, call the weighted adaptive resource matching algorithm:
[0128] Input: A list of target nodes on the optimization path P in the final solution. The weight of each target (i.e., the task manager node) can be set as P(i) of that node, and the proportional weight is calculated based on P(i). Calculate the total weight weightSum. Generate a random number sum = random.nextLong(weightSum).
[0129] The core logic of this weighted adaptive resource matching algorithm is as follows:
[0130] First, an empty target set `destinations` is initialized to store the weight of each target and its corresponding list of `invokers` (task execution callers). Simultaneously, `weightSum` is initialized to 0 to accumulate the sum of the weights of all targets in the target set. Next, a thread-safe random number generator `random` is created to generate weighted random numbers, thus achieving random target selection. Then, the input target list `input_destinations` is iterated over, and each target (including its weight and `invokers`) is added to the `destinations` list, with its weight accumulated in `weightSum`.
[0131] After calculating the total weights of all targets, if weightSum is less than or equal to 0, it indicates that the weights in the target set are invalid or there is a problem with the data. In this case, an empty list [] is returned directly. If the target set contains only one target, the invokers of that target are returned directly to avoid unnecessary calculations.
[0132] If the target set contains multiple targets, a weighted random number `sum` is generated, ranging from 0 to `weightSum`. Then, the target set is iterated through, and `sum` is gradually reduced based on the target's weight. Each time the weight of the current target is subtracted, when `sum` is less than or equal to 0, it indicates that the current target is selected, and the algorithm returns the invokers for that target. Preferably, if no suitable target is found after iterating through all targets, an empty list `[]` is returned, indicating that no valid target was selected, and an error is reported. By utilizing the target's weight information, tasks are allocated more efficiently, ensuring that the load is distributed according to the node's capabilities. An example is given below:
[0133] Assuming there are 3 available TaskManager nodes (Node A, B, and C) in the current set, the weights calculated by the system based on a multi-index evaluation model are 5 (high performance), 3 (medium), and 2 (average), respectively. The algorithm first initializes and accumulates a total weight of 10. During this process, if an invalid total weight (≤0) or containing only a single node is detected, the corresponding result is returned directly. If the verification passes, the algorithm generates a random integer (assumed to be 6) within the range [0, 10). Subsequently, the algorithm iterates through the node list, subtracting the weight of each node from the random number in turn: In the first round, the weight of Node A (5) is subtracted from the random number, leaving 1 (>0), indicating a missed task; in the second round, the weight of Node B (3) is subtracted, leaving -2 (≤0), satisfying the termination condition, and Node B is selected as the execution node. This mechanism ensures that the probability of each node obtaining a task (A: 50%, B: 30%, C: 20%) strictly corresponds to its performance score, effectively avoiding the bottleneck effect caused by traditional round-robin strategies and improving the overall throughput and resource utilization efficiency of the cluster.
[0134] Example 4
[0135] Based on the above embodiments, this invention proposes a verification method for a Flink load balancing method based on multi-index performance evaluation, specifically including:
[0136] To verify the effectiveness of the proposed method, it was evaluated in a real-world cluster consisting of six physical servers, including one JobManager node and five TaskManager nodes. Each node had identical software (CentOS-7.9, Hadoop-3.1.3, Flink-1.13.0, JDK 1.8, Kafka 3.6.3, Zookeeper 3.6.3) and hardware configurations, as shown in Table 1. These nodes were located in three different racks, each containing two nodes. Figure 3 As shown.
[0137]
[0138] A representative benchmark program, WordCount, was selected for testing. WordCount counts the frequency of each word in the input data. After multiple rounds of parameter optimization and testing, the final key parameter configurations were set as α=0.3, β=0.4, and γ=0.3. Experimental results (e.g.) Figure 4 (As shown in the image) This indicates a significant improvement in resource utilization:
[0139] Memory usage such as Figure 4 (a): The multi-stage path optimization UGA strategy makes memory usage more stable and gradually increases, avoiding the drastic fluctuations of the polling strategy and the low utilization of the Shuffle strategy, thus achieving a more reasonable allocation.
[0140] CPU utilization as Figure 4 (b): The multi-stage path optimization (UGA) strategy significantly improved the CPU load allocation ratio of high-performance nodes and greatly increased CPU utilization, indicating that resource allocation is more reasonable.
[0141] Increased internet speed Figure 4 (c): Under sustained load, the network speed of the multi-stage path optimization (UGA) strategy maintained a steady upward trend and resumed growth after a brief pullback. In contrast, the polling and shuffle strategies exhibited low and volatile network speeds. UGA significantly improved network transmission efficiency (approximately 16%) by directing data to high-speed nodes.
[0142] Example 5
[0143] Based on the above embodiments, such as Figure 5 As shown, this invention proposes a Flink load balancing system based on multi-index performance evaluation, comprising:
[0144] The evaluation model building module is used to collect metrics from multiple task manager nodes and build a comprehensive performance evaluation model based on multiple metrics.
[0145] The initial scheme module is used to obtain the comprehensive performance score of each task manager node based on a multi-index comprehensive performance evaluation model, and to construct an initial task allocation scheme set based on the comprehensive performance score of each task manager node.
[0146] The filtering module is used to obtain the baseline performance score of each scheme in the initial task allocation scheme set according to the preset path utility function, and to obtain the optimal recommended scheme and the second-best recommended scheme based on the baseline performance score.
[0147] The comparison module is used to simulate the virtual load loading state of the optimal and second-best recommended solutions, and obtain the optimal solution based on the virtual load loading state simulation results and the benchmark performance score.
[0148] The iteration module is used to take the optimal solution as a new initial task allocation scheme. The repeated filtering module and the comparison module iteratively optimize the optimal solution to obtain the final solution.
[0149] The execution module is used to rewrite the partitioning method of Flink's custom partitioner wrapper interface. It embeds a dynamic data stream weight adaptive resource matching algorithm in the partitioning method to process the final solution and complete the task manager node allocation.
[0150] It should be noted that the Flink load balancing system based on multi-index performance evaluation provided in this embodiment of the invention is to implement the above-mentioned Flink load balancing method based on multi-index performance evaluation. Its specific functions can be referred to in the above-mentioned method embodiments, and will not be repeated here.
[0151] Example 6
[0152] Based on the above embodiments, the present invention proposes an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method described in Embodiment 1 above.
[0153] Example 7
[0154] Based on the above embodiments, the present invention proposes a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method described in Embodiment 1 above.
[0155] In summary, this invention achieves a reasonable allocation of memory usage, increases the CPU load ratio of high-performance nodes, significantly improves network transmission efficiency, optimizes the overall system performance, and effectively solves the problem of uneven resource allocation in traditional scheduling strategies by constructing a comprehensive performance evaluation model based on memory, CPU, and network speed.
[0156] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A Flink load balancing method based on multi-index performance evaluation, characterized in that, include: Step 1: Collect multiple task manager node metrics and build a comprehensive performance evaluation model based on multiple metrics. Step 2: Obtain the comprehensive performance score of each task manager node based on the comprehensive performance evaluation model based on multiple indicators, and construct an initial task allocation scheme set based on the comprehensive performance score of each task manager node; Step 3: Obtain the baseline performance score of each scheme in the initial task allocation scheme set according to the preset path utility function, and obtain the optimal recommended scheme and the second-best recommended scheme based on the baseline performance score; Step 4: Simulate the virtual load loading state for the optimal and suboptimal recommended solutions, and obtain the optimal solution based on the virtual load loading state simulation results and the benchmark performance score; Step 5: Using the optimal solution as the new initial task allocation scheme, repeat steps 3 and 4 to iteratively optimize the optimal solution and obtain the final solution; Step 6: Rewrite the partitioning method of Flink's custom partitioner wrapper interface, embed a dynamic data stream weight adaptive resource matching algorithm in the partitioning method to process the final solution, and complete the task manager node allocation.
2. The Flink load balancing method based on multi-index performance evaluation according to claim 1, characterized in that, The comprehensive performance evaluation model based on multiple indicators is expressed by the following formula: in, Give a comprehensive performance score to node i. CPU performance metrics For memory performance metrics, For network speed performance indicators, These represent the weighting percentages of CPU, memory, and network speed, respectively.
3. The Flink load balancing method based on multi-index performance evaluation according to claim 2, characterized in that, The CPU performance metrics are expressed by the following formula: in, CPU performance metrics Let θ be the average CPU utilization of node i, and θ be the preset ideal load point. Let be the standard deviation of CPU utilization for node i; The memory performance metrics are expressed by the following formula: in, For memory performance metrics, Let i be the average memory usage of node i. Let be the maximum memory capacity of node i. The network speed performance index is expressed by the following formula: in, For network speed performance indicators, Let i be the average measured network speed of node i. Let i be the theoretical maximum network speed.
4. The Flink load balancing method based on multi-index performance evaluation according to claim 2, characterized in that, The path utility function is expressed by the following formula: in, As the baseline performance score, , , These are the aggregated evaluation values of the critical execution path corresponding to Solution G in terms of CPU, memory, and network speed. This is the bottleneck indicator, and `max` is the operation to retrieve the maximum value.
5. The Flink load balancing method based on multi-index performance evaluation according to claim 1, characterized in that, Step three specifically includes: The baseline performance score of each scheme in the initial task allocation scheme set is obtained based on the preset path utility function; All solutions are ranked according to their benchmark performance scores. The solution with the highest benchmark performance score is selected as the optimal recommended solution, and the solution with the second highest benchmark performance score is selected as the suboptimal recommended solution.
6. A Flink load balancing method based on multi-index performance evaluation according to claim 4 or 5, characterized in that, Step four includes: The optimal recommended solution is simulated under virtual load conditions. Node resource data in the simulation environment is collected and substituted into the path utility function to obtain the expected baseline performance score under the simulation conditions. Calculate the deviation rate between the baseline performance score of the optimal recommended solution and the expected baseline performance score under simulated conditions; if the deviation rate is less than a preset threshold, it is determined that the evaluation is consistent, and the optimal recommended solution is directly implemented; if the deviation rate is greater than the preset value, it is determined that the evaluation is inconsistent, and a second evaluation is performed; the deviation rate is: ,in, The deviation rate, The baseline performance score for the optimal recommendation is... This represents the expected baseline performance score under simulated conditions. In the second evaluation, the bottleneck index of the optimal recommended solution and the bottleneck index of the second-best recommended solution are calculated and compared, and the solution with the smaller bottleneck index is selected as the current optimal solution.
7. The Flink load balancing method based on multi-index performance evaluation according to claim 1, characterized in that, The dynamic data stream weight adaptive resource matching algorithm includes: Calculate the overall performance score of each task manager node in the final solution, use the overall performance score as the weight of the task manager node, and obtain the total weight based on the weight; Initialize a target set and a total weight. Add each task manager node in the final solution to the target set and accumulate the weight of the task manager node to the total weight. The target set also includes the weight of each task manager node and the task execution invoker. After calculating the total weight of all task manager nodes, if the total weight is less than or equal to 0, it means that the weight in the target set is invalid or there is a problem with the data, and an empty list is returned directly. If the target set contains only one task manager node, the task execution caller of that task manager node is returned directly. If the target set contains multiple task manager nodes, a weighted random number is generated, ranging from 0 to the total weight. The target set is traversed, and the weighted random number is gradually reduced according to the weight of the task manager node. Each time, the weight of the current task manager node is subtracted. When the weighted random number is less than or equal to 0, it indicates that the current target is selected. The algorithm returns the task execution caller for that target so that the task execution caller can be used to assign a task to the selected target. If no suitable target is found after traversing all task manager nodes, an empty list is returned and an error is reported.
8. A Flink load balancing system based on multi-index performance evaluation, characterized in that, include: The evaluation model building module is used to collect metrics from multiple task manager nodes and build a comprehensive performance evaluation model based on multiple metrics. The initial scheme module is used to obtain the comprehensive performance score of each task manager node based on a multi-index comprehensive performance evaluation model, and to construct an initial task allocation scheme set based on the comprehensive performance score of each task manager node. The filtering module is used to obtain the baseline performance score of each scheme in the initial task allocation scheme set according to the preset path utility function, and to obtain the optimal recommended scheme and the second-best recommended scheme based on the baseline performance score. The comparison module is used to simulate the virtual load loading state of the optimal and second-best recommended solutions, and obtain the optimal solution based on the virtual load loading state simulation results and the benchmark performance score. The iteration module is used to take the optimal solution as a new initial task allocation scheme. The repeated filtering module and the comparison module iteratively optimize the optimal solution to obtain the final solution. The execution module is used to rewrite the partitioning method of Flink's custom partitioner wrapper interface. It embeds a dynamic data stream weight adaptive resource matching algorithm in the partitioning method to process the final solution and complete the task manager node allocation.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1 to 7.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1 to 7.