A cloud workflow scheduling method based on structure perception and adaptive evolution
By adopting a cloud workflow scheduling method based on structure awareness and adaptive evolution, the problems of intense resource competition and low search efficiency in existing technologies are solved, achieving efficient and flexible cloud resource scheduling, adapting to complex topologies and heterogeneous environments, and improving system stability and decision availability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHONGQING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-03-23
- Publication Date
- 2026-06-19
AI Technical Summary
Existing cloud workflow scheduling algorithms suffer from problems such as intense resource contention, low search efficiency, simplistic scheduling strategies, and blind resource matching when dealing with complex scientific applications, resulting in low resource utilization, high computational overhead, and high decision-making delays.
By employing a structure-aware and adaptive evolution approach, a multi-objective cloud workflow scheduling model is constructed. This model combines task priority evaluation, adaptive operator pooling, and elite retention mechanisms to optimize task mapping and resource allocation, thereby achieving efficient and flexible scheduling decisions.
It significantly improves search efficiency and resource utilization, reduces computational overhead, enhances system stability and decision availability, adapts to different topologies and resource environments, and supports real-time scheduling scenarios.
Smart Images

Figure CN122240318A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of resource allocation technology, specifically to a cloud workflow scheduling method based on structure awareness and adaptive evolution. Background Technology
[0002] With the rapid development of big data and artificial intelligence technologies, complex scientific applications such as bioinformatics comparison, astronomical mapping data processing, and weather forecasting models are typically presented in the form of workflows. These workflows consist of thousands of tasks with logical dependencies, usually modeled as directed acyclic graphs (DAGs). In a cloud computing environment, how to efficiently, cost-effectively, and reliably map these tasks to a large-scale heterogeneous virtual machine (VM) resource pool is a core challenge faced by cloud service providers (CSPs).
[0003] While significant progress has been made in existing research on cloud workflow scheduling, many bottlenecks still remain to be addressed:
[0004] 1. Traditional heuristic algorithms, such as the Heterogeneous Earliest Finish Time (HEFT) algorithm and the Critical-Path-on-a-Processor (CPOP) algorithm, have significant shortcomings in the depth of topology mining. Their task priority calculation often focuses only on the depth analysis of the critical path, while ignoring the instantaneous throughput pressure on system resources caused by the parallel load of task fan-out. This leads to intense local resource competition when processing wide, flat or complex mesh DAGs, reducing the parallel utilization of resources.
[0005] 2. Population-based evolutionary algorithms exhibit a serious contradiction between search efficiency and convergence performance when facing large-scale tasks. The exponentially growing search space results in extremely low quality of solutions generated by random initialization. Not only does it require a long iterative process to enter the feasible region, but it is also prone to premature convergence or search stagnation in the later stages of evolution.
[0006] 3. Existing evolutionary scheduling frameworks generally suffer from the problems of single and rigid scheduling strategies. Their fixed proportion of genetic operators cannot dynamically perceive the differentiated needs of global exploration and local development according to different stages of evolution, making it difficult to obtain a uniformly distributed Pareto front in multi-objective trade-offs.
[0007] 4. Due to the different preferences of heterogeneous resources in the cloud environment for CPU, memory, input / output (IO), etc., existing algorithms often simplify resources to a single computational overhead and ignore the affinity between tasks and resources, resulting in blind resource matching and poor actual deployment performance. Summary of the Invention
[0008] In view of this, this application discloses a cloud workflow scheduling method based on structure awareness and adaptive evolution to solve the problems in the prior art, including:
[0009] S1. Obtain the workflow description file and construct a multi-objective cloud workflow scheduling model and evaluation function;
[0010] S2. Perform task priority evaluation and guided initialization based on structure awareness and volume constraints on the multi-objective cloud workflow scheduling model to obtain the initial population;
[0011] S3. Perform task-refined mapping evolution on the initial population. The evolution process is based on an adaptive operator pool to obtain a task-refined population. The adaptive operator pool is constructed based on an evaluation function.
[0012] S4. Based on the feedback mechanism of elite retention and dominance relationship, global optimization evolution is performed on the task-refined population to obtain the Pareto optimal solution set.
[0013] The beneficial effects of this application include:
[0014] This application designs a structure-aware search guidance mechanism that significantly reduces the blindness in the scheduling and optimization process. By introducing structure awareness, key task nodes are identified in the workflow DAG and given priority weights, enabling the algorithm to focus on exploring high-potential areas in the early stages of the search, reducing invalid evaluations and redundant trials, thereby improving search efficiency and search quality.
[0015] This application can significantly accelerate the convergence speed of the algorithm in a large-scale environment; by combining structure-aware guidance, adaptive operators and local / global cooperative strategies, it can maintain efficient optimization even for workflows with thousands of nodes; experimental results show that the method designed in this application can enter the steady state in advance when processing thousands of task nodes, thus achieving significant savings in overall computational overhead and running time, and is suitable for large-scale, real-time or near-real-time scheduling scenarios.
[0016] This application proposes a mechanism based on volume constraints and concurrency prediction for high resource contention periods. This mechanism can effectively alleviate the problem of intense resource contention. The volume constraints can quantify the risk of concurrent resource occupation in the optimization objective. Combined with task execution window and resource availability prediction, it can avoid contention behavior in the high-concurrency stage in advance, smooth the system resource utilization curve, reduce failures or retries caused by peak load and resource conflicts, thereby improving the stability and overall throughput of the system.
[0017] This application generates Pareto solutions through multi-objective non-dominated sorting, providing flexible and interpretable decision-making references for operations and cloud administrators. The generated Pareto front simultaneously covers multiple dimensions such as total scheduling cost, completion time, resource consumption, and reliability, enabling managers to quickly select appropriate solutions from the solution set based on real-time business needs (such as prioritizing task completion or cost saving), supporting dynamic switching by strategy and reducing decision latency, thereby enhancing the availability and operability of scheduling solutions.
[0018] The method designed in this application has good scalability and robustness. Through adaptive weight design and topology-aware parameter adjustment strategy, the algorithm can adapt to workflow topologies of different scales and forms (including complex structures such as chain, tree, and network) and heterogeneous resource environments, effectively cope with load fluctuations and topology changes, maintain stable optimization performance, and ensure that robust and high-quality scheduling results can still be obtained in heterogeneous cloud environments or hybrid cloud deployments. Attached Figure Description
[0019] Figure 1 This is a schematic diagram of the cloud workflow scheduling method based on structure awareness and adaptive evolution in the embodiments of this application. Detailed Implementation
[0020] To make the objectives, technical solutions, features, and advantages of this application clearer and to facilitate a better understanding of the technical solutions of this application by those skilled in the art, the following detailed description of this application is provided in conjunction with the accompanying drawings and embodiments.
[0021] This embodiment includes a cloud workflow scheduling method based on structure awareness and adaptive evolution, such as... Figure 1 As shown, it includes:
[0022] S1. Obtain the workflow description file and construct a multi-objective cloud workflow scheduling model and evaluation function; including:
[0023] S11. Parse the workflow description file and construct the logical topology of the DAG.
[0024] Extract the task node set by parsing the workflow description file (such as JSON or XML). and directed edge set For each task in the workflow Assign a standard instruction load And construct an adjacency list to represent All direct front-wheel drive and direct successor This allows for the complete reconstruction of the logical topology of the DAG.
[0025] S12. Perform heterogeneous resource profiling modeling for computing or storage resources.
[0026] The heterogeneous resource profiling model, namely the multi-objective cloud workflow scheduling model, includes a virtual machine resource pool model and a task-resource affinity prediction matrix; the virtual machine resource pool model will consider each VM node... The model is a multi-dimensional vector, which includes at least: computing power (MIPS), memory size, network I / O bandwidth, and unit time rental cost; and a task-resource affinity prediction matrix, which describes the execution gain of different types of tasks on a specific hardware architecture.
[0027] S13. Construct a multidimensional evaluation function.
[0028] Scheduling scheme The quality is measured by the following three functions: global completion time ( ) function, resource utilization rate ( Functions, Task-Resource Affinity ( )function.
[0029] The function calculates the maximum completion time of all exit node tasks, representing the workflow's response speed and reflecting the system's processing efficiency. The smaller the value, the faster the entire task set completes, and the higher the system throughput is usually. The formula is:
[0030]
[0031] in, Indicates the first The completion time of each task.
[0032] The function is used to calculate the ratio of the effective working time of all VMs during workflow execution to the total uptime, in order to assess whether there is any idle waste of resources. The range of values is Low utilization indicates idle resources and wasted costs. Optimizing this parameter can reduce system operating costs. The formula is:
[0033]
[0034] in, Indicates allocation to computing resource nodes The task set, Indicates task At computing resource nodes The actual processing time on the device, This represents the total time window capacity that the system can provide within the scheduling period. Indicates the number of nodes. This indicates the available time for each node.
[0035] The function evaluates the matching degree between tasks and resources based on the vector cosine similarity formula, with values ranging from [value range missing]. When the affinity is close to 1, it means that the task has been assigned to the most suitable hardware node for its execution (e.g., a node with a specific accelerator card), which can fully utilize the hardware performance. The formula is:
[0036]
[0037] in, This indicates the number of requirement items involved in the calculation. Indicates the first The demand vector for each task Indicates the first pairing with this demand Capability vectors for each virtual machine.
[0038] Based on a multidimensional evaluation function, the scheduling scheme is evaluated to determine whether the task is assigned to the resource most suitable for its attributes.
[0039] S2. Perform task priority evaluation and guided initialization based on structure awareness and volume constraints on the multi-objective cloud workflow scheduling model to obtain the initial population; including:
[0040] S21. Estimate the expected capabilities of the task.
[0041] Computational tasks Expected execution time at average cluster computing power And calculate the system's total processing capacity per unit time. This provides a benchmark for subsequent volume calculations.
[0042] S22. Calculate the path depth through reverse topological traversal. .
[0043] Specifically, based on the known baseline execution time of each task, the execution time of each task node is calculated by performing a reverse topological traversal of the directed acyclic graph of task dependencies. path depth The greater the task depth, the more decisive its impact on the overall workflow duration. The path depth... The calculation formula is as follows:
[0044]
[0045] In the formula, Indicates task The set of direct successor tasks; when task When the task is the exit task of the workflow (i.e., without a successor node), its path depth is... It equals its own expected execution time ; Indicates from subsequent tasks The longest path length from the start to the end of the entire workflow. Indicates task Any one of its direct successor tasks.
[0046] S23. Calculate the parallel load volume based on the expected capacity estimate. The formula is:
[0047]
[0048] in, Represents a node The set of direct successors, Indicates the successor node Expected execution time at the average computing level of the cluster This indicates the system's total processing capacity per unit of time. This item specifies: if the task... Once executed, it will immediately release a certain number of subsequent tasks. Traditional solutions suffer from severe queuing when the total computational demand of subsequent tasks far exceeds the system's total capacity; however, volume constraints constructed using load volume can predict such congestion.
[0049] S24. Construct workflow constraints based on path depth and parallel load volume, and synthesize priorities. The initial population is generated by combining random perturbations.
[0050] Specifically, the task priority is defined as follows: The generation scale is In the initial population process, firstly, based on the idea of heterogeneous earliest completion time algorithm, according to the task... Based on the sorting results, tasks are sequentially assigned to computing resource nodes that can achieve the earliest completion time, thus constructing a high-quality guided seed scheme. Subsequently, a pre-set perturbation probability factor is introduced into the guided seed scheme. (Its preferred value range is) The random perturbation variant logic performs a random redistribution operation on each task node in the seed scheme. That is, when the generated random number falls into the perturbation threshold range, the task is redirected from the original optimal resource node to a random candidate node in the resource pool. In addition, the local scheduling sequence is replaced for the set of tasks with the same path depth, thereby generating variant individuals with significant diversity and completing the population initialization.
[0051] The core of this step is to break away from the limitations of traditional algorithms that rely solely on critical path weights. By comprehensively considering the path depth and release volume of tasks in the topology graph, bottleneck tasks that significantly impact system performance are identified. Guiding seeds generated using structure-aware information provide the evolutionary population with a high-quality starting search position, significantly shortening convergence time and thus solving the task scheduling order problem in complex DAGs.
[0052] S3. Perform task-refined mapping evolution on the initial population. The evolution process is based on an adaptive operator pool to obtain a task-refined population. The adaptive operator pool is constructed based on an evaluation function and includes:
[0053] The resource affinity optimization operator scans the current scheduling scheme, selects task mapping pairs with lower Affinity function scores in the evaluation function, and reassigns these task mapping pairs to virtual machines with higher similarity vectors without violating time constraints.
[0054] The dynamic critical path operator selects the critical path that determines the Makespan function value by analyzing the Gantt chart generated by the current scheduling scheme; for tasks on the path, it migrates them to VMs with earlier idle slices or VMs with stronger computing power to achieve precise compression of bottleneck paths.
[0055] The load balancing operator monitors the load distribution of each VM and selects overloaded VMs based on the utilization function, migrating non-critical tasks to VMs in idle or low-load states. This operator aims to fill resource gaps and improve overall utilization. Specifically, in the scheduling optimization phase, this invention utilizes a resource utilization function. The load saturation of each virtual machine in the cluster is quantitatively assessed, and nodes whose resource utilization exceeds a preset high-water mark threshold (e.g., a virtual machine's resource utilization exceeds 0.8) are identified as overloaded virtual machines. Subsequently, from the pending task sequences of the overloaded nodes, those with weaker global completion time constraints, i.e., those with lower path depths, are identified. Lower-critical tasks are assigned to lightly loaded or idle virtual machines with resource utilization in a preset low-water range.
[0056] The random perturbation operator is used to prevent the algorithm from getting trapped in local optima in a complex Pareto space. This operator is introduced to perform random task position swaps and resource reassignments, ensuring the diversity of the population.
[0057] Furthermore, the operators are dynamically invoked based on the performance of the evaluation function, and the operator library covers a variety of strategies, ranging from local fine-tuning (such as critical path compression) to global exploration (such as random perturbation). By coordinating the invocation of these operators, the algorithm can perform fine-grained mapping adjustments in the multi-dimensional objective space, improving the distribution quality of the Pareto front.
[0058] The task refinement mapping evolution stops iterating when it reaches the preset number of iterations.
[0059] S4. Based on the elite preservation and dominance feedback mechanism, a global optimization evolution is performed on the task-refined population to obtain the Pareto optimal solution set. The dominance feedback mechanism includes: recording the offspring individuals generated by each operator in each generation of evolution. Its corresponding parent individual ,if Able to control If the operator is successful in this search, then the control is considered to be successful; the control, i.e. Performance on at least one evaluation function It is better and the other evaluation functions are not bad.
[0060] The iterative process of global optimization evolution includes:
[0061] S41, for each operator Assign dynamic weights ;
[0062] S42. Determine the offspring of the operator. With parental individuals The dominant relationship, if Strict control Then, the operator is given an additive reward based on the probability of selection. The formula is: ;
[0063] S43. At the end of each generation of evolution, all operator weights are normalized and a minimum selection probability boundary is set. The probability of operators below the threshold is increased to avoid operator starvation, while the probability of operators above the threshold remains unchanged. The effectiveness of the overall probability distribution is maintained through the normalization process.
[0064] Furthermore, the global optimization evolution adopts dynamic weight updates, and the operators... Selection probability Based on the operator's historical cumulative success rate Dynamic adjustment, the formula is:
[0065]
[0066] in, This represents the learning rate. In this way, the algorithm tends to use more exploratory operators in the early stages of the search, while automatically switching to more expansive operators in the later stages.
[0067] Furthermore, during the iterative process, the global optimization evolution also employs a fast non-dominated sorting algorithm to stratify all parent generations and their generated offspring into layers, and within the same layer, maintains a uniform distribution of solutions using crowding distance.
[0068] Specifically, after completing the fast non-dominated sorting, to further ensure the uniformity of the solution set distribution in the target space, a crowding distance is introduced for evaluation of individuals in the same non-dominated layer: For each objective function, all individuals in that layer are first sorted in ascending order according to their objective function values, and the crowding distance of individuals located at the boundary is set to infinity to ensure that extreme solutions are preferentially retained; subsequently, for intermediate individuals, the difference in their objective function values between adjacent individuals is normalized and accumulated to obtain their overall crowding distance. This distance reflects the sparsity of an individual in the target space; a larger value indicates a sparser distribution of solutions around it. In the selection phase, when the number of individuals in a certain non-dominated layer exceeds the remaining retention capacity, they are sorted in descending order of crowding distance, and individuals with larger crowding distances are preferentially selected to enter the next generation of the population. This ensures the convergence of the solution set while effectively maintaining the diversity and uniformity of the Pareto front distribution, ultimately retaining the top-ranked individuals. A higher-order individual will advance to the next iteration.
[0069] When the evolution reaches the preset number of iterations, or the distribution of the Pareto front is continuous When the iteration stops changing significantly, the Pareto optimal solution set is output to the scheduling center. The scheduling center automatically matches and issues execution instructions from the solution set according to the current cloud platform operation strategy, such as energy-saving mode or high-performance mode.
[0070] The overall time complexity of the method designed in this application is: ,in, This represents the asymptotic upper bound of the algorithm's running time as the input size increases. Indicates the number of iterations. Indicates population size, Indicates the number of tasks. This indicates the number of dependent edges; during the execution of the algorithm, the linear order structure guides the logic to avoid the high computational overhead brought by the exact solver.
[0071] The method designed in this application is based on a feedback learning mechanism, which dynamically adjusts the calling probability of each operator by monitoring the number of superior offspring generated during the iteration process. This mechanism enables the algorithm to automatically switch the most effective optimization strategy according to the current search terrain, and finally output the optimal scheduling scheme set.
[0072] Finally, it should be noted that the above description only depicts some embodiments of this application. For those skilled in the art, various changes, modifications, substitutions, and variations can be conceived of these embodiments without departing from the principles and spirit of this application. The scope of protection of this application is defined by the appended claims and their equivalents, and all the above-mentioned behaviors should be covered within the scope of protection of this application.
Claims
1. A cloud workflow scheduling method based on structure awareness and adaptive evolution, characterized in that, include: S1. Obtain the workflow description file and construct a multi-objective cloud workflow scheduling model and evaluation function; S2. Perform task priority evaluation and guided initialization based on structure awareness and volume constraints on the multi-objective cloud workflow scheduling model to obtain the initial population; S3. Perform task-refined mapping evolution on the initial population. The evolution process is based on an adaptive operator pool to obtain a task-refined population. The adaptive operator pool is constructed based on an evaluation function. S4. Based on the feedback mechanism of elite retention and dominance relationship, global optimization evolution is performed on the task-refined population to obtain the Pareto optimal solution set; The dominance feedback mechanism includes: recording the offspring individuals generated by each operator in each generation of evolution. Its corresponding parent individual ,if Able to control If the operator is successful in this search, then the control is considered to be successful; the control, i.e. Performance on at least one evaluation function It is better and the other evaluation functions are not bad.
2. The cloud workflow scheduling method based on structure awareness and adaptive evolution according to claim 1, characterized in that, The construction of the multi-objective cloud workflow scheduling model includes: S11. Parse the workflow description file and construct the logical topology of the DAG; including: extracting the task node set by parsing the workflow description file. and directed edge set For each task in the workflow Assign a standard instruction load And construct an adjacency list to represent All direct front-wheel drive and direct successor ; S12. Perform heterogeneous resource profiling modeling for computing or storage resources; the heterogeneous resource profiling modeling includes a virtual machine resource pool model and a task-resource affinity prediction matrix.
3. The cloud workflow scheduling method based on structure awareness and adaptive evolution according to claim 1, characterized in that, The evaluation function includes: The global completion time function is used to calculate the maximum completion time of all exit node tasks; The resource utilization function is used to calculate the ratio of the effective working time of all VMs during workflow execution to the total uptime. The task-resource affinity function evaluates the matching degree between tasks and resources based on the vector cosine similarity formula.
4. The cloud workflow scheduling method based on structure awareness and adaptive evolution according to claim 1, characterized in that, The process of performing task priority evaluation and guided initialization based on structure awareness and volume constraints on the multi-objective cloud workflow scheduling model includes: S21. Estimate the expected capabilities of the task; S22. Calculate the path depth by reverse topological traversal; S23. Calculate the parallel load volume based on the expected capacity estimate; S24. Construct workflow constraints based on path depth and parallel load volume, and synthesize priorities. The initial population is generated by combining random perturbations.
5. The cloud workflow scheduling method based on structure awareness and adaptive evolution according to claim 4, characterized in that, The generation of the initial population includes: According to the task Based on the sorting results, tasks are sequentially assigned to the computing resource nodes that can achieve the earliest completion time, thus constructing a guided seed scheme. A preset perturbation probability factor is then introduced into the guided seed scheme. The random perturbation variant logic performs a random redistribution operation on each task node in the seed scheme. When the generated random number falls within the perturbation threshold range, the task is redirected from the original optimal resource node to a random candidate node in the resource pool. In conjunction with the local scheduling timing replacement of the task set with the same path depth, variant individuals are generated and the population initialization is completed.
6. The cloud workflow scheduling method based on structure awareness and adaptive evolution according to claim 5, characterized in that, The adaptive operator pool includes: The resource affinity optimization operator selects task mapping pairs with lower Affinity function scores in the evaluation function and reassigns them to virtual machines with higher similarity vectors without violating temporal constraints. The dynamic critical path operator selects the critical path that determines the value of the Makespan function; for tasks on the path, it migrates the tasks to VMs with earlier idle slices or VMs with stronger computing power. The load balancing operator selects the overloaded VMs based on the Utilization function and migrates non-critical tasks to VMs that are idle or under low load. The random perturbation operator performs random task position swaps and resource reassignments.
7. The cloud workflow scheduling method based on structure awareness and adaptive evolution according to claim 1, characterized in that, The iterative process of global optimization evolution includes: S41, for each operator Assign dynamic weights ; S42. Determine the offspring of the operator. With parental individuals The dominant relationship, if Strict control Then assign the operator a selection probability. Additive rewards The formula is: ; S43. At the end of each generation of evolution, normalize the weights of all operators and set the minimum selection probability boundary.
8. The cloud workflow scheduling method based on structure awareness and adaptive evolution according to claim 7, characterized in that, The global optimization evolution adopts dynamic weight updates and operators. Selection probability Based on the operator's historical cumulative success rate Dynamic adjustment, the formula is: ; in, Indicates the learning rate.
9. The cloud workflow scheduling method based on structure awareness and adaptive evolution according to claim 7, characterized in that, During the iterative process of global optimization evolution, a fast non-dominated sorting algorithm is used to stratify all parent generations and their generated offspring into layers. Within the same layer, the crowding distance is used to maintain a uniform distribution of solutions.
10. The cloud workflow scheduling method based on structure awareness and adaptive evolution according to claim 1, characterized in that, The cloud workflow scheduling method described above has a total time complexity of O(n). ,in, This represents the asymptotic upper bound of the algorithm's running time as the input size increases. Indicates the number of iterations. Indicates population size, Indicates the number of tasks. This indicates the number of dependent edges.