A dynamic scheduling method for uncertain data-intensive workflow in cloud environment

CN115827176BActive Publication Date: 2026-06-23NANJING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NANJING UNIV OF POSTS & TELECOMM
Filing Date
2021-09-17
Publication Date
2026-06-23

Smart Images

  • Figure CN115827176B_ABST
    Figure CN115827176B_ABST
Patent Text Reader

Abstract

The application provides a dynamic scheduling method for uncertain data-intensive workflow in a cloud environment, solves the scheduling problem caused by the lack of transmission data size information, and helps to simultaneously reduce the cross-data center data transmission amount of the workflow and the execution cost of the workflow. First, the workflow structure is abstracted to obtain a DAG graph; then, static task pre-allocation is performed under the condition of partial data size information missing, an executable task forest graph of each data center is obtained, each task node in the forest graph is sorted according to the saved transmission data size, the task node with the largest saved data transmission size is allocated in the corresponding data center, and the predecessor node and the successor node of the node in the data center are also allocated in the data center, until all tasks are pre-allocated; then, dynamic adjustment of task allocation is performed based on the static task pre-allocation result and the actual transmission data size generated after the execution of each task in the workflow, and finally, an allocation scheme is obtained.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of cloud computing, specifically a dynamic scheduling method for uncertain, data-intensive workflows in a cloud environment. Background Technology

[0002] Cloud computing is a new computing technology that provides users with virtual, scalable, and dynamic resources on a paid basis based on usage. Cloud computing alleviates the burden on local servers, helping people access data from anywhere in the world; therefore, it can be viewed as a data sharing platform. Cloud computing does not lose any data due to insufficient space. This technology is widely used in scientific research to enhance data storage capabilities. A workflow consists of multiple tasks with an execution order and dependencies between them. Workflows are commonly used to describe the execution process of scientific applications deployed on cloud providers, and are widely used in fields such as bioinformatics, astronomy, physics, and geology. Many scientific experiments in these fields use workflows, making workflows increasingly data-intensive, and data-intensive workflows more common. Data-intensive workflows require the transfer of large amounts of data and complex calculations between tasks, requiring large storage capacity and high computing power to run. Workflow scheduling refers to assigning each task of the workflow to a suitable service provider to meet the user-defined quality of service; this is the core issue of workflow. Data-intensive workflows often involve acquiring, processing, and transmitting large amounts of data. Scheduling tasks with data transmission relationships across different service providers incurs transmission costs. Inappropriate task scheduling strategies can lead to excessive data transmission, severely impacting the efficiency of scientific workflows. Only reasonable scheduling can minimize transmission volume, thereby reducing workflow execution costs. Currently, most workflow scheduling methods do not consider the impact of data transmission volume on execution costs. Therefore, this patent proposes a dynamic scheduling method for uncertain, data-intensive workflows. Summary of the Invention

[0003] The purpose of this invention is to reduce the amount of data transmitted during the execution of data-intensive workflows when some data transmission size is missing. This invention proposes a method for dynamically allocating tasks to the server based on the actual execution of the workflow, thereby overcoming the defects in the prior art.

[0004] To achieve the above objectives, this invention provides a dynamic scheduling method for uncertain data-intensive workflows in a cloud environment. First, the workflow structure is abstracted based on data dependencies. Then, a pre-allocation scheme for each task is obtained based on the known partial data transmission size and the forest graph of executable tasks in each data center. Next, the allocation scheme is dynamically adjusted based on the actual data transmission size during workflow execution. Finally, an allocation scheme that reduces the data transmission size during the execution of uncertain data-intensive workflows in the cloud environment is obtained.

[0005] 1. The technical solution further defined in this invention is as follows:

[0006] Preferably, the above technical solution includes the following steps:

[0007] Step 1: Abstract the workflow structure. This paper uses a Directed Acyclic Graph (DAG) to represent the workflow, considering it a directed weighted graph where preceding and subsequent tasks are connected, have data dependencies, and have weights on the edges representing the amount of data transferred between tasks. This is a typical data flow structure, providing a theoretical basis for subsequent workflow scheduling.

[0008] Step 2 defines the resources used in the workflow. Each data center can process one or more tasks in the workflow, and each task can be scheduled by one or more data centers. This resource model definition provides a theoretical basis for further scheduling of workflow tasks.

[0009] Step 3: Obtain the forest graph of executable tasks for each data center. By combining the resource model definition results with the DAG graph, the forest graph of executable tasks for each data center can be obtained, providing a theoretical basis for further scheduling of workflow tasks.

[0010] Step 4: Based on the known partial data transfer sizes and the forest graph of executable tasks in each data center, static task pre-allocation is performed in cases where partial data size information is missing. According to the forest graph of executable tasks in each data center, the task node that saves the most data transfer size in each data center is calculated. This task node is preferentially allocated to the corresponding data center, and task nodes in the same forest graph that have a direct data transfer relationship with it are also allocated to the same data center. The pre-allocated task nodes are then removed from the forest graph, resulting in an updated forest graph. The next round of pre-allocation is then performed until all tasks are pre-allocated, providing the criteria for the final dynamic task allocation algorithm.

[0011] Step 5, dynamic task allocation during execution. Assign the workflow entry task t... entry The allocation is performed according to the allocation results in step 3. After the entry task is executed, t is obtained. entryThe actual data transfer size to its subsequent tasks. From t entry The subsequent tasks begin to be assigned. Based on the static task pre-assignment results and the actual amount of data transmitted after the execution of each task in the workflow, the current task t to be assigned is calculated. i Based on the data transfer saving data size across different data centers, assign the task to the data center that offers the greatest data transfer saving.

[0012] 2. Preferably, in the above technical solution, the structure of the abstract workflow in step 1 is as follows:

[0013] Step 11, abstract the task nodes. T = {t} i |i∈1,2,...,n}, T represents the workflow task set, t i This represents the i-th task in the workflow.

[0014] Step 12: Abstract the data transmission size and transmission relationships. E = {e i,j |e i,j ≥0, i, j∈1, 2,..., n}, edge e i,j This indicates the expected data transfer size between tasks before workflow execution, and also represents task t. i Must be in task t j Complete before starting, if t i to t j If the size information of the transmitted data is missing, then e i,j =0. R = {r i,j |r i,j ≥0, i, j∈1,2,...,n}, edge r i,j This represents the actual data transfer size between tasks after the workflow begins execution. The workflow entry task is denoted as t. entry The export task is t exit Task t i The precursor mission is denoted as t i pre The subsequent task is denoted as t. i succ The final workflow is abstracted as W =<T,E,R> .

[0015] 3. Preferably, in the above technical solution, step 2 defines the resources used by the workflow, specifically as follows:

[0016] A DC is a group of data centers capable of processing tasks in W, represented as DC = {dc} i |dc i Let $\mathbf{i}$ be the i-th data center participating in the scheduling, where $i \in \mathbf{i}$, $i \in \mathbf{m}$. $CD$ is a set used to display candidate data centers for tasks in $W$, denoted as $CD = {cd\mathbf{i}}$. i|cd i Denotes the candidate data centers for task t, where i∈1, 2,..., n}, CT={ct i |ct i This indicates the tasks that the data center (dc) can handle, among which i∈1,2,...,n}.

[0017] 4. Preferably, in the above technical solution, step 4, in the case of missing partial data size information, involves static task pre-allocation, specifically as follows:

[0018] Step 41, set i = 1, max = -1, p = -1, q = -1, atn = 0.

[0019] Step 42, obtain DC based on Step 2. i Using the task list CT and the forest graph obtained in step 3, calculate the ct for each task in CT. j Assigned in DC i The amount of data that can be saved (sum) j , sum j The value is dc i All tasks related to CT in the task list j Tasks with direct data transmission (ct) k Data transfer size e between ctj,ctj The sum, if sum j If the maximum value is greater than max, then update max to sum. j p is the task ct corresponding to max. j q is the data center dc corresponding to max. i This is used to record the task with the largest data transfer size and its corresponding data center.

[0020] Step 43: Set i = i + 1. If i > m, then jump to step 44; otherwise, jump to step 42.

[0021] Step 44: Assign task p to data center q. Assign all tasks in data center q that have a direct data transmission relationship with node p to data center q. Record the number of tasks k that have been arranged in the current step. Delete the currently arranged tasks from the forest graph, list CT, and CD to obtain the updated forest graph and list. If atn+k=n, the static pre-allocation is completed; otherwise, set i=1 and jump to step 42.

[0022] 5. Preferably, in the above technical solution, the dynamic task allocation during execution in step 5 specifically includes:

[0023] Step 51, assign the workflow entry task t entryThe allocation is performed according to the allocation results in step 4. After the entry task is executed, t is obtained. entry The actual data transfer size r to its successor task. Set i = 1.

[0024] Step 52: Let i = i + 1. If i > n, it means all tasks have been assigned. Otherwise, obtain task t. i The candidate data center list CD, set j=1, sum j =0, j is used to identify t i The corresponding data center number in the CD, sum j Used to calculate t i The corresponding CD data center can save on data transfer size.

[0025] Step 53, if the predecessor node t of the current task node l It is an already assigned task node, and the current task node is related to node t. l If schedulable within the same data center, then sum j =sum j +r lj Iterate through cd j Task list ct k If ct k It is t i The successor node, then sum j =sum j +e i,ctk Let j = j + 1. If the size of j exceeds the length of CD, then jump to step 54; otherwise, continue to step 53.

[0026] Step 54, calculate sum j The maximum value is max, and task t is... i Scheduled to the data center corresponding to max j Jump to step 52. Attached Figure Description

[0027] Figure 1 A schematic diagram defining the workflow.

[0028] Figure 2 This is a diagram showing the data center's processing capacity.

[0029] Figure 3 This is a candidate data center diagram for workflow tasks.

[0030] Figure 4 This is the initial forest graph of executable tasks for each data center.

[0031] Figure 5 This is a diagram illustrating the actual data transfer size between workflow tasks.

[0032] Figure 6 This is a flowchart of a dynamic scheduling method for uncertain, data-intensive workflows according to the present invention. Detailed Implementation

[0033] Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention.

[0034] This invention proposes a dynamic scheduling method for uncertain, data-intensive workflows in a cloud environment, aiming to solve scheduling problems caused by missing data size information and help simultaneously reduce cross-data center data transfer volume and workflow execution costs. First, the workflow structure is abstracted to obtain a directed acyclic graph (DAG) composed of task nodes and the data transfer sizes between nodes. Then, in the case of missing data size information, static task pre-allocation is performed, obtaining a forest graph of executable tasks for each data center. The task nodes in the forest graph are sorted according to the amount of data transfer saved, and the task node with the largest data transfer saving is allocated to the corresponding data center. The predecessor and successor nodes of this node in the data center are also allocated to that data center, until all tasks are pre-allocated. However, in our dynamic task allocation method, tasks are only allocated after all previous tasks have been executed. The actual task allocation is based on the static pre-allocation scheme, relevant data dependencies, and candidate data centers.

[0035] Example 1 demonstrates a static pre-allocation scheduling scheme for uncertain, data-intensive workflows in a cloud environment. Workflow G contains 10 tasks, where t1 is the starting task, t... 10 To end the mission, e ij Values ​​such as Figure 1 As shown, if the data transfer size is missing, then e ij =0.

[0036] The entire workflow is comprised of data centers (DCs). The task processing capacity of a DC is as follows: Figure 2 As shown, the candidate data centers for tasks in W are as follows: Figure 3 As shown, the initial executable task forest graph for each data center is as follows: Figure 4 As shown.

[0037] (1) Static pre-allocation

[0038] Data center DC1 can process tasks t3, t4, t6, t8, and t9. 10 From the task forest graph of DC1, we can see that e 3,6 =62,e 4,6=28, and there is no data transfer between the remaining tasks or the data transfer size is unknown. Therefore, sum[3] = e 3,6 =62, sum[4]=e 4,6 =28, sum[6]=e 3,6 +e 4,6 =90, sum[8]=0, sum

[10] =0.

[0039] Data center dc2 can process tasks t2 and t7. From the task forest graph of dc2, it can be seen that there is no data transmission between t2 and t7. Therefore, sum[2] = 0 and sum[7] = 0.

[0040] Data center DC3 can process tasks t1, t5, t7, and t9. From the task forest graph of DC3, we can see that e 1,7 =0,e 7,9 =149, there is no data transmission between the remaining tasks. Therefore, sum[1]=0, sum[5]=0, sum[7]=sum[9]=e 7,9 =149.

[0041] Data center dc4 can only process task t3, so sum[3] = 0.

[0042] Data center DC5 can process tasks t2, t4, t6, t8, and t9. 10 From the task forest graph of DC4, we can see that e 2,4 =60,e 2,6 =0,e 4,6 =28,e 6,8 =44, and there is no data transfer between the remaining tasks or the data transfer size is unknown. Therefore, sum[2] = e 2,4 =60, sum[4]=e 4,6 +e 2,4 =88, sum[6]=e 6,8 +e 4,6 =72, sum[8]=e 6,8 =44.

[0043] After the first traversal, the task node that saves the most data transmission is t6. The data center corresponding to sum[6] = 90 is dc1. In the task list of dc1, the predecessor nodes of t6 are t3 and t4, and there are no successor nodes. Therefore, t3, t4, and t6 are assigned to dc1. These three nodes are then removed from the lists CT and CD and the forest graph. The updated graph and list are obtained.

[0044] Repeat the process of traversing and assigning tasks until all tasks have been assigned. The final pre-assignment scheme is listed below.

[0045] Data Center Number Pre-assigned task number 1 3,4,6,8,10 2 2 3 1,5,7,9 4 5

[0046] (2) Dynamic task allocation during execution

[0047] Example 2 demonstrates a dynamic allocation and scheduling scheme for uncertain, data-intensive workflows in a cloud environment.

[0048] After allocating t1 to data center dc3 according to the pre-allocation scheme, the workflow begins execution, obtaining the missing data transfer size r between t1 and t7. 1,7 r 1,2 The value is also related to e 1,2 The value is slightly off. With the task completed, the size of the missing data transfers is known, r ij Values ​​such as Figure 5 As shown.

[0049] Then, t2 is assigned. The CD list for t2 is dc2, dc5. If t2 is assigned to dc2, data transfer size cannot be saved. If t2 is assigned to dc5, since the already assigned node t1 is not assigned to dc5, and t2 has data transfer with node t4 in the dc5 task list, and e 2,4 =60. We get sum[2]=0 and sum[5]=60, so we assign t2 to dc5.

[0050] The CD list for t3 is dc1 and dc4. If t3 is assigned to dc4, data transfer size cannot be saved. If t3 is assigned to dc1, none of the already assigned nodes are assigned to dc1. t3 has data transfer with node t6 in the dc1 task list, and e 3,6 =62. We get sum[1]=62, sum[4]=0, so we assign t3 to dc1.

[0051] The CD list for t4 is dc1, dc5. t3 has been assigned in dc1's task list, but there is no direct data transfer between t3 and t4. There is data transfer between t4 and node t6 in dc1's task list, and e 4,6 =28. There is direct data transfer between t4 and unassigned task t6 in the dc5 task list, and e 4,6 =28, t4 and dc5 have direct data transfer with the assigned task t2 in the task list and r 2,4 =57, so sum[5] = e 4,6 +r 2,4 =85, sum[1]=62. Therefore, t4 is allocated to data center dc5.

[0052] Similarly, after all tasks are assigned, the assignment results are shown in the table below:

[0053] Based on the actual allocation results, the actual saved data transmission size is: r sum =r1,7 +r 7,9 +r 2,4 +r 2,6 +r 4,6 +r 6,8 =62+132+57+65+25+27=368GB. Domestic Alibaba Cloud and Huawei Cloud servers are billed based on actual data throughput (GB). Using the above allocation scheme can save on corresponding transmission costs.

[0054] Data Center Number Actual assigned task number 1 3,10 2 3 1,5,7,9 4 5 2,4,6,8

[0055] The foregoing description of specific exemplary embodiments of the invention is for illustrative and explanatory purposes. These descriptions are not intended to limit the invention to the precise forms disclosed, and it will be apparent that many changes and variations can be made in accordance with the foregoing teachings. The exemplary embodiments were chosen and described in order to explain the specific principles of the invention and its practical application, thereby enabling those skilled in the art to implement and utilize various different exemplary embodiments of the invention, as well as various different choices and variations. The scope of the invention is intended to be limited by the claims and their equivalents.

Claims

1. A dynamic scheduling method for uncertain, data-intensive workflows in a cloud environment, characterized in that, Includes the following steps: Step 1: Abstract the structure of the workflow. This paper uses a DAG graph to represent the workflow, which is considered to be a directed weighted graph. The preceding tasks are connected to the following tasks, and the preceding and following tasks have data dependencies. The edges have weights, which are the data transfer sizes between tasks. This is a typical data flow structure, which provides a theoretical basis for subsequent workflow scheduling. Step 2: The resources used in the workflow are defined. Each data center can process one or more tasks in the workflow, and each task can be scheduled by one or more data centers. The resource model definition provides a theoretical basis for further scheduling of workflow tasks. Step 3: Obtain the forest graph of executable tasks for each data center. By defining the resource model and the DAG graph, the forest graph of executable tasks for each data center can be obtained, which provides a theoretical basis for further scheduling of workflow tasks. Step 4: Static task pre-allocation before execution. Based on the known partial data transmission size and the forest graph of executable tasks in each data center, static task pre-allocation is performed when some data size information is missing, providing a judgment condition for the final dynamic task allocation. Step 41: Set i = 1, max = -1, p = -1, q = -1, atn = 0; Step 42: Obtain the task with the largest data transfer size and its corresponding data center. Based on the data center dc obtained in Step 2... i Using the task list CT and the forest graph obtained in step 3, calculate the ct for each task in CT. j Assigned to DC i The amount of data that can be saved (sum) j , sum j The value is dc i All tasks related to CT in the task list j Tasks with direct data transmission (ct) k Data transfer size e between ctj,ctj The sum, if sum j If the maximum value is greater than max, then update max to sum. j p is the task ct corresponding to max. j q is the data center dc corresponding to max. i To record the task with the largest data transfer size and its corresponding data center; Step 43: Set i = i + 1. If i > m, then jump to step 44; otherwise, jump to step 42. Step 44: Perform task allocation operations on the tasks and data centers determined in Step 42. Assign task p to data center q, and assign all tasks in data center q that have a direct data transmission relationship with node p to data center q. Record the number of tasks k that have been arranged in the current step, and delete the currently arranged tasks from the forest graph, list CT, and CD to obtain the updated forest graph and list. If atn+k=n, the static pre-allocation is completed; otherwise, set i=1 and jump to step 42. Step 5: Dynamic task allocation during execution; Step 51: Set the workflow entry task t entry The allocation is carried out according to the allocation result in step 4. After the entry task is executed, t is obtained. entry Set i = 1 to determine the actual data transfer size r to its successor task; Step 52: Let i = i + 1. If i > n, it means that all tasks have been assigned. Otherwise, get task t i The candidate data center list CD, set j=1, sum j =0, j is used to identify t i The corresponding data center number in the CD, sum j Used to calculate t i The amount of data transfer that can be saved in the corresponding CD data center; Step 53: If the predecessor node t of the current task node is... l It is an already assigned task node, and the current task node is related to node t. l If schedulable within the same data center, then sum j =sum j +r lj traverse cd j Task list ct k If ct k It is t i The successor node, then sum j =sum j +e i,ctk Let j = j + 1. If the size of j exceeds the length of CD, then jump to step 54; otherwise, continue to step 53. Step 54: Calculate the sum j The maximum value is max, and task t is... i Scheduled to the data center corresponding to max j Jump to step 52.

2. The dynamic scheduling method for uncertain data-intensive workflows in a cloud environment according to claim 1, characterized in that, The specific process of step 1 is as follows: Step 11: Abstract the task nodes, T = {t} i |i∈1,2,...,n}, T represents the workflow task set, t i This represents the i-th task in the workflow; Step 12: Abstract the data transmission size and transmission relationship, E = {e i,j |e i,j ≥0, i, j∈1, 2,..., n}, edge e i,j This indicates the expected data transfer size between tasks before workflow execution, and also represents task t. i Must be in task t j Complete before starting, if t i to t j If the size information of the transmitted data is missing, then e i,j =0, R={r i,j |r i,j ≥0, i, j∈1,2,...,n}, edge r i,j This represents the actual data transfer size between tasks after the workflow begins execution. The workflow entry task is denoted as t. entry The export task is t exit Task t i The precursor mission is recorded as Subsequent tasks are denoted as The final workflow is abstracted as W =<T,E,R> .

3. The dynamic scheduling method for uncertain data-intensive workflows in a cloud environment according to claim 1, characterized in that, The specific process of step 2 is as follows: A DC is a group of data centers capable of processing tasks in W, represented as DC = {dc} i |dc i Let $\mathbf{i}$ be the i-th data center participating in the scheduling, where $i \in \mathbf{i}$, and $\mathbf{c}$ is a set used to display candidate data centers for tasks in $W$, denoted as $\mathbf{c}$.