A task scheduling method and system based on cluster optimization

By modeling the task scheduling problem as an optimization problem, and using load collection and cluster topology query modules, combined with network topology and capacity conditions, an optimization solution method is adopted to solve the problem of low efficiency of iterative search strategies in large-scale cluster scheduling, and achieve efficient and interpretable scheduling results.

CN116346925BActive Publication Date: 2026-06-26CHINA TELECOM CLOUD TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA TELECOM CLOUD TECH CO LTD
Filing Date
2022-12-06
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing task scheduling strategies based on iterative search suffer from rapidly increasing time complexity when facing large-scale cluster scheduling, making it difficult to effectively utilize computing resources, and the scheduling results lack interpretability.

Method used

The task scheduling problem is abstracted into an optimization problem. By combining network topology, load and capacity conditions through load collection, cluster topology query and cluster capability module, the optimal scheduling cluster is calculated using optimization solution method, including policy calculation and execution module to ensure scheduling success.

Benefits of technology

In large-scale cluster scheduling, it significantly improves scheduling efficiency, can solve for the global optimal solution, and improves the interpretability of scheduling results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116346925B_ABST
    Figure CN116346925B_ABST
Patent Text Reader

Abstract

The application relates to the field of information technology, and particularly discloses a task scheduling method and system based on cluster optimization, which comprises the following steps: a cluster S where an application or a virtual machine C to be scheduled is located initiates a request to a strategy calculation module, and carries the resource requirements and capacity requirements required by the C; the strategy calculation module receives the request, queries corresponding data from a load collection module, a cluster topology module and a cluster capacity module, calculates an optimal target scheduling cluster T that meets the conditions according to the method, and informs the two clusters of the result through an execution module; the application can quantize the related conditions and restrictions in the scheduling problem of the rendering application or the virtual machine, so that the differences between the clusters can be clearly seen during scheduling, and the scheduling result has interpretability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of information technology, specifically to a task scheduling method and system based on cluster optimization. Background Technology

[0002] With the implementation of the national strategy of "Eastern Data, Western Computing" and the development of the edge computing industry, various data centers and edge clusters have sprung up like mushrooms after rain. After the computing power needs are met, the next problem to be solved is how to make full use of these computing resources. When facing the scheduling problem between clusters, multiple aspects need to be considered, such as the network topology information and network status between clusters, as well as the current load and capacity of the target cluster. Because certain special types of virtual machines require the target cluster to have specific capabilities when scheduling, when the number of tasks to be scheduled increases and the number of target clusters rises, the existing scheduling strategy based on iterative search will face the problem of rapidly increasing time complexity. When the scale of the problem expands, it will quickly reach the bottleneck.

[0003] To address the above problems, this invention provides a cluster-based optimal task scheduling system. Summary of the Invention

[0004] The purpose of this invention is to provide a task scheduling system based on cluster optimization, which abstracts the scheduling strategy problem described above into an optimization problem, quantifies the factors, constraints and the final goal to be achieved during scheduling, and thus can directly use the optimization solution approach to find the optimal target scheduling cluster. Although the optimization solution approach requires a longer solution time than the iterative search algorithm when the problem size is small in the early stage, the time required is significantly better than the latter when the problem size expands rapidly.

[0005] To achieve the above objectives, the present invention provides the following technical solution: a cluster-based optimal task scheduling system, the system comprising:

[0006] The load collection module, whose main function is to provide the load status of the corresponding cluster;

[0007] The cluster topology query module's main function is to provide the network relationship between two clusters.

[0008] The cluster capability module is used for scheduling decisions. Specifically, it labels various capabilities of the cluster through a combination of indicator collection and manual maintenance, and provides functions for transforming and querying specific cluster capability data.

[0009] The strategy calculation module is used to receive a scheduling request, which includes the virtual machine, the current cluster where the task is located, and the capacity requirements. Based on the parameters in the request and the data in the cluster capacity module, cluster load collection module, and cluster topology query module, the module performs calculations according to the model to obtain the optimal scheduling cluster that meets the conditions and is the best in the current context.

[0010] Execution Module: After the strategy calculation module completes the calculation, the execution module informs clusters S and T of the scheduling target, so that the two clusters are ready to start scheduling applications and virtual machines. At the same time, it reconfirms the status of the two clusters. If the status of the scheduling is not met, the optimal cluster T will be selected again. The process will end only when the scheduling is successful.

[0011] S and T represent two clusters, respectively.

[0012] As a preferred embodiment of the present invention, the load collection module is based on the cluster monitoring data collected by the monitoring system, and preprocesses and aggregates it for use by the subsequent strategy calculation module, providing the function of querying the load of a specific cluster and taking into account the usage of CPU, memory, disk and GPU in the cluster during scheduling.

[0013] As a preferred embodiment of the present invention, the network relationship in the cluster topology query module is the network topology relationship, that is, what network the two clusters are connected through. Its contents are as follows: whether the two clusters are in the same local area network, whether the two clusters are in the same metropolitan area network or need to go to the backbone network to connect, and whether the two clusters are connected by a dedicated line; in addition, network quality data between the two clusters is also required, including network bandwidth and packet loss rate.

[0014] As a preferred embodiment of the present invention, when selecting a scheduling strategy, the cluster capability module should consider whether the target cluster has the corresponding capabilities. That is, when some rendering virtual machines or tasks are scheduled, it is necessary to consider whether the target cluster has the corresponding computing power or resources. Only clusters that meet the requirements can be candidates for target scheduling.

[0015] In a preferred embodiment of the present invention, the policy calculation module measures scheduling parameters including network topology cost, network quality cost, cluster load cost, and cluster capability conditions.

[0016] As a preferred embodiment of the present invention, the detailed method for determining network topology cost is as follows:

[0017] Data acquisition is achieved through an interface provided by the cluster topology query module. During scheduling, the fewer network layers traversed between clusters S and T, the lower the cost. Therefore, the network topology cost between clusters S and T is also lower.

[0018] TopologyCost(S,T) is defined as follows:

[0019] TopologyCost(S,T)=α×Level(S,T),α∈(0,1)

[0020] Here, α is a dynamic factor used to adjust the topology cost. Level(S,T) is obtained from the cluster topology query module and represents the network level that needs to be traversed when communicating between clusters S and T. The lower the level traversed when communicating between two clusters, the lower the cost. The cost order is: leased line > LAN > MAN > WAN > backbone network. The corresponding values ​​include the following:

[0021] Level(S,T) = 1, and S and T are connected by a dedicated line;

[0022] Level(S,T) = 2, and S and T are connected via a local area network;

[0023] Level(S,T)=3, S and T are connected via a metropolitan area network;

[0024] Level(S,T)=4, S and T are connected via a wide area network;

[0025] Level(S,T) = 5, and S and T are connected through a backbone network.

[0026] As a preferred embodiment of the present invention, the detailed method for determining network quality cost is as follows:

[0027] During scheduling, the network quality cost NetQualityCost(S,T) between clusters S and T is defined as:

[0028] NetQualityCost(S,T)=β×BandWidthCost(S,T)+γ×QualityCost(S,T),

[0029] β∈(0,1)γ∈(0,1)

[0030] Where β and γ are dynamic weighting factors, and BandWidthCost(S,T) is the network bandwidth between clusters S and T. The higher the bandwidth between clusters, the lower the cost during scheduling. Therefore, BandWidthCost(S,T) is defined as follows:

[0031]

[0032] BandWidth(S,T) is the bandwidth value between clusters S and T, which can be queried from the cluster topology query module. BandWidth(S,T)∈(0,+∞), and the unit is bps. The definition of BandWidthCost(S,T) means that the larger the network bandwidth between clusters, the lower the cost, and vice versa.

[0033] QualityCost(S,T) represents the network quality cost between clusters S and T. The network quality between clusters is measured by the packet loss rate. QualityCost(S,T) is defined as follows:

[0034] QualityCost(S,T)=Quality(S,T)

[0035] Quality(S,T) represents the packet loss rate between clusters S and T. Quality(S,T)∈[0,1], and Quality(S,T) can be queried through the cluster topology module. Through the definition of QualityCost(S,T), the lower the packet loss rate between clusters, the lower the cost, and vice versa.

[0036] As a preferred embodiment of the present invention, the detailed method for determining the cluster load cost is as follows:

[0037] During scheduling, clusters with lower loads are scheduled. The load rate of a cluster is jointly measured by the utilization of CPU, memory, and GPU within the cluster. The load cost LoadCost(T) of cluster T is defined as follows:

[0038] LoadCost(T)=δ×CPU(T)+ε×Mem(T)+ζ×GPU(T)

[0039] Where CPU(T) is the sum of 95% of the CPU utilization of all machines in cluster T over 10 minutes, Mem(T) is the sum of 95% of the memory utilization of all machines in cluster T over 10 minutes, GPU(T) is the sum of 95% of the GPU utilization of all machines in cluster T over 10 minutes, and δ, ε, and ζ are dynamic factors that adjust these three values; LoadCost(T) indicates that the cost decreases when the cluster load is low and increases when the load is high.

[0040] The cluster load cost corresponds to the following cluster load conditions:

[0041] During scheduling, the cluster T to be scheduled must be able to meet the load conditions of the application to be scheduled, and the number of resources in the cluster T to be scheduled must be greater than the number of resources required by the application or virtual machine to be scheduled. Considering only the number of CPU cores and available memory in the cluster during scheduling, the constraints for the application or virtual machine C to be scheduled are described as follows:

[0042] CPUCore(T) - CPUCore need (C)≥0

[0043] Mem(T)-Mem need (C)≥0

[0044] Where CPUCore(T) is the number of currently idle CPU cores in cluster T, and CPUCore need (C) represents the memory requirement of the application or virtual machine C, while Mem(T) represents the current available memory of cluster T. need (C) is the memory requirement of the application or virtual machine C.

[0045] As a preferred embodiment of the present invention, the cluster capability conditions are as follows:

[0046] The capabilities of cluster T to be scheduled need to meet the requirements of the application or virtual machine to be scheduled. Information from each cluster is collected through the cluster capability module and organized into a matrix according to a unified standard numbering system. Simultaneously, during scheduling, the requirements of the application or virtual machine to be scheduled are abstracted into a capability requirement vector, where the cluster capability matrix is ​​the capacity. c The capacity vector of each cluster is defined as capacity. c =(s1,s2,...,s n ) T Where s1, s2, ..., s n Let s represent the n capabilities of cluster c, and s n = 0 or 1, where m is the total number of clusters, and s m,n Indicates whether the nth capability of cluster m is satisfied, if s m,n =1 indicates that the condition is met, otherwise it is not met, and the cluster capacity matrix is... m,n for

[0047]

[0048] The capability requirements of a certain application or virtual machine C are vector R. C =(r1,r2,...,r n ) T , where r n representing the need for a certain ability, r n =1 represents the required ability s n Conversely, if the requirements are not met, no such calculation is needed. When screening clusters that meet the capability requirements, the following calculations are required:

[0049]

[0050] in, Then when This indicates that cluster c m If a cluster can meet the capability requirements of an application or virtual machine C, it can be used as one of the target clusters during scheduling; then Res(T) is defined as representing cluster T, and NeedCount is defined as... C The value representing the capability required by a specific application or virtual machine C, calculated according to the steps above, means that the necessary and sufficient condition for T to be a candidate cluster for scheduling application or virtual machine C is:

[0051] Res(T)≥NeedCount C

[0052] The ultimate goal of the strategy calculation module is to calculate the cluster T that minimizes network topology cost, network quality cost, and cluster load cost while satisfying cluster load conditions and cluster capacity requirements. The details are as follows:

[0053] minTopologyCost(S,T)+NetQualityCost(S,T)+LoadCost(T)

[0054] stCPUCore(T)≥CPUCore need (C)

[0055] Mem(T)≥Mem need (C)

[0056] Res(T)≥NeedCoun C

[0057] T∈[1,m]

[0058] Where m is the number of candidate target scheduling clusters. TopologyCost(S,T) represents the network topology cost between clusters S and T, NetQualityCost(S,T) represents the network quality cost between clusters S and T, and LoadCost(T) represents the load cost of cluster T. CPUCore(T) represents the number of idle CPU resources in cluster T, while CPUCo... need (C) represents the CPU resource requirements of the application or virtual machine C; Mem(T) represents the amount of free memory resources in cluster T, and Mem... need (C) represents the memory requirements of the application or virtual machine C; while Res(T) represents the capacity value of the cluster T, and NeedCoun C This represents the capability requirement value of the application or virtual machine C.

[0059] A cluster-based optimal task scheduling method, the method comprising the following steps:

[0060] Step S100: The cluster S containing the application or virtual machine C to be scheduled sends a request to the policy calculation module, including the resource and capability requirements of C.

[0061] Step S200: After receiving the request, the strategy calculation module queries the corresponding data from the load collection module, cluster topology module, and cluster capability module.

[0062] Step S300: Calculate the optimal target scheduling cluster T that meets the conditions according to the method proposed in the above system.

[0063] Step S400: Then, the execution module informs the clusters on both sides of the result and reconfirms the status of the clusters on both sides. If the status does not meet the requirements, the calculation is repeated until the scheduling is successful.

[0064] Compared with the prior art, the beneficial effects of the present invention are:

[0065] 1) The method of the present invention can quantify the relevant conditions and constraints in the scheduling problem of rendering applications or virtual machines, so that the differences between clusters can be more clearly seen during scheduling, and the scheduling results are interpretable.

[0066] 2) This invention models the scheduling problem as an integer programming problem, which can be solved using an optimization package. It has better efficiency than iterative algorithms when the number of candidate clusters increases, and can find the global optimal solution. Attached Figure Description

[0067] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention.

[0068] Figure 1 This is a diagram illustrating the architecture of a cluster-based task scheduling system according to the present invention.

[0069] Figure 2 This is a flowchart of a cluster-based optimal task scheduling method according to the present invention. Detailed Implementation

[0070] To make the technical problems to be solved, the technical solutions, and the beneficial effects of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present invention and are not intended to limit the present invention.

[0071] Please see Figures 1-2This invention provides a task scheduling method and system based on cluster optimization. The system mentioned in this invention mainly includes a load collection module, a cluster topology query module, a cluster capability module, a strategy calculation module, and an execution module.

[0072] The overall process is as follows: The cluster S where the application or virtual machine C to be scheduled resides sends a request to the policy calculation module, along with the resource and capability requirements of C. After receiving the request, the policy calculation module queries the corresponding data from the load collection module, cluster topology module, and cluster capability module. Then, according to the method proposed in this invention, it calculates the optimal target scheduling cluster T that meets the conditions. The execution module then informs the clusters on both sides of the result and reconfirms the status of the clusters on both sides. If the status does not meet the requirements, the calculation is repeated until the scheduling is successful.

[0073] The functions of each module are described below, and the modeling method proposed in this invention is described in detail in the strategy calculation module.

[0074] 1) Load collection module

[0075] The main function of the load collection module is to provide the load status of the corresponding cluster. This is achieved by preprocessing and aggregating the cluster monitoring data collected by the monitoring system so that the subsequent strategy module can use it directly. Its main function is to provide the function of querying the load of a specific cluster. Currently, the scheduling takes into account the usage of CPU, memory, disk and GPU in the cluster.

[0076] 2) Cluster Topology Query Module

[0077] The main function of the cluster topology query module is to provide the network relationship between two clusters, including the network topology, i.e., what network connects the two clusters. For example, it determines whether the two clusters are in the same local area network (LAN), metropolitan area network (MAN), or require a backbone network for connection, or whether there is a dedicated line connecting them. It also needs to provide network quality data between the two clusters, including network bandwidth and packet loss rate. During scheduling, the network topology and quality data between the two clusters need to be considered. On the one hand, network transmission costs between clusters at the same level are relatively low; for example, if transmission is possible within the same LAN, there's no need to use an intercity network. On the other hand, when scheduling larger virtual machines or applications, the optimal scenario is to schedule between two clusters with high network bandwidth and low packet loss rate.

[0078] 3) Cluster capability module

[0079] The cluster capability module's main functions are divided into two parts: firstly, it can label various cluster capabilities through a combination of metric collection and manual maintenance; secondly, it provides functions for transforming and querying specific cluster capability data. When selecting a scheduling strategy, the suitability of the target cluster for scheduling is considered. In the scenario of this invention, when scheduling rendering virtual machines or tasks, it is necessary to consider whether the target cluster has the corresponding computing power or resources. Only clusters that meet the requirements can be considered as candidates for target scheduling.

[0080] 4) Strategy Calculation Module

[0081] The strategy calculation module is the core component of the system. This module receives scheduling requests, which include the virtual machine, the current cluster where the task is located, and the capacity requirements. Based on the parameters in the request and combined with data from the cluster capacity module, cluster load collection module, and cluster topology query module, it performs calculations according to the model proposed in this invention to determine the optimal scheduling cluster that meets the conditions and is optimal in the current context. The specific model calculation process is described as follows:

[0082] Assuming the current virtual machine or application resides in cluster S, the goal is to calculate the optimal cluster T to be scheduled in the current context using a scheduling algorithm. The method of this invention aims to minimize network transmission overhead between S and T, maximize bandwidth, and minimize the load on T. Furthermore, the capabilities of cluster T must meet the requirements of the scheduled application or virtual machine. Therefore, when evaluating the cost of scheduling, the following aspects need to be considered:

[0083]

[0084]

[0085] Network topology cost: Data is obtained through the interface provided by the cluster topology query module. During scheduling, the fewer network layers traversed between clusters S and T, the lower the cost. Therefore, the network topology cost TopologyCos(S,T) between clusters S and T is defined as follows:

[0086] TopologyCost(S,T)=α×Level(S,T),α∈(0,1)

[0087] Where α is a dynamic factor used to adjust the topology cost, and Level(S,T) is obtained from the cluster topology query module, representing the network level that needs to be traversed when communicating between clusters S and T. This invention assumes that the lower the level traversed when communicating between two clusters, the lower the cost. The cost order is: leased line > LAN > MAN > WAN > backbone network, and the corresponding values ​​are as follows:

[0088] Level(S,T) scene 1 S and T are connected by a dedicated line. 2 S and T are connected via a local area network. 3 S and T are connected via a metropolitan area network. 4 S and T are connected via a wide area network. 5 S and T are connected via a backbone network.

[0089] Network Quality Cost: During scheduling, the network quality cost NetQualityCost(S,T) between clusters S and T is defined as:

[0090] NetQualityCost(S,T)=β×BandWidthCost(S,T)+γ×QualityCost(S,T),β

[0091] ∈(0,1)γ∈(0,1)

[0092] Where β and γ are dynamic weighting factors, and BandWidthCost(S,T) is the network bandwidth between clusters S and T. This invention believes that the higher the bandwidth between clusters, the lower the cost during scheduling. Therefore, BandWidthCost(S,T) is defined as follows:

[0093]

[0094] BandWidth(S,T) is the bandwidth value between clusters S and T, which can be queried from the cluster topology query module. BandWidth(S,T)∈(0,+∞), and the unit is bps. The definition of BandWidthCost(S,T) means that the larger the network bandwidth between clusters, the lower the cost, and vice versa.

[0095] QualityCost(S,T) represents the network quality cost between clusters S and T. This invention uses the packet loss rate to measure the network quality between clusters, and QualityCost(S,T) is defined as follows:

[0096] QualityCost(S,T)=Quality(S,T)

[0097] Quality(S,T) represents the packet loss rate between clusters S and T. Quality(S,T)∈[0,1] and can be queried through the cluster topology module. By defining QualityCost(S,T), the lower the packet loss rate between clusters, the lower the cost, and vice versa.

[0098] Cluster load cost: During scheduling, it is desirable to schedule clusters with lower loads. In this invention, the cluster load rate is jointly measured by the utilization of CPU, memory, and GPU within the cluster. Therefore, the load cost LoadCost(T) of cluster T is defined as follows:

[0099] LoadCost(T)=δ×CPU(T)+ε×Mem(T)+ζ×GPU(T)

[0100] Where CPU(T) is the sum of the 95th percentile of CPU utilization across all machines in cluster T over a 10-minute period, Mem(T) is the sum of the 95th percentile of memory utilization across all machines in cluster T over a 10-minute period, and GPU(T) is the sum of the 95th percentile of GPU utilization across all machines in cluster T over a 10-minute period. δ, ε, and ζ are dynamic factors that adjust these three values. LoadCost(T) indicates that the cost decreases when the cluster load is low and increases conversely.

[0101] Cluster load conditions: During scheduling, the cluster T to be scheduled needs to meet the load conditions of the application to be scheduled. For example, the cluster T to be scheduled needs to have sufficient memory and CPU resources to meet the needs of the application or virtual machine to be scheduled. Therefore, when selecting the cluster T to be scheduled, it is necessary to consider not only the load cost of the cluster, but also whether the cluster T can meet the resource requirements of the application or virtual machine to be scheduled. Therefore, the number of resources of the cluster T to be scheduled needs to be greater than the number of resources required by the application or virtual machine to be scheduled. This invention only considers the number of CPU cores and memory reserves of the cluster during scheduling. The constraints for the application or virtual machine C to be scheduled are described as follows:

[0102] CPUCore(T) - CPUCore need (C)≥0

[0103] Mem(T)-Mem need (C)≥0

[0104] Where CPUCore(T) is the number of currently idle CPU cores in cluster T, and CPUCore need (C) represents the memory requirement of the application or virtual machine C, while Mem(T) represents the current available memory of cluster T. need (C) is the memory requirement of the application or virtual machine C.

[0105] Cluster Capability Requirements: The capabilities of the cluster T to be scheduled need to meet the requirements of the application or virtual machine to be scheduled. For rendering applications and virtual machines, in most cases, the target cluster needs to meet its GPU computing power requirements. Some applications also require the cluster to provide capabilities such as Ceph storage, or network-related requirements, such as network speed within the cluster or the availability of corresponding network devices. To address these capabilities, this invention collects information from each cluster through a cluster capability module and organizes it into a matrix according to a unified standard numbering system. During scheduling, the requirements of the application or virtual machine to be scheduled are abstracted into a capability requirement vector. Assuming the cluster capability matrix is ​​capacity... c Then the capacity vector of each cluster is defined as capacity.c =(s1,s2,...,s n ) T Where s1, s2, ..., s n Let s represent the n capabilities of cluster c, and s n = 0 or 1. Assuming we have a total of m clusters, we can collect the capability vectors of these clusters to form the capability matrix of all clusters, where s = 0 or 1. m,n Indicates whether the nth capability of cluster m is satisfied, if s m,n =1 indicates that the condition is met, otherwise it is not met, and the cluster capacity matrix is... m,n for

[0106]

[0107] Suppose that the capability requirement of an application or virtual machine C is vector R. C =(r1,r2,...,r n ) T , where r n representing the need for a certain ability, r n =1 represents the required ability s n Conversely, if the requirements are not met, no such calculation is needed. When selecting clusters that meet the capability requirements, only the following calculation is required:

[0108]

[0109] Assumption Then when This indicates that cluster c m If it can meet the capability requirements of the application or virtual machine C, it can be used as one of the target clusters during scheduling. Then, Res(T) is defined as the result of cluster T calculated according to the above steps, and NeedCount... C Let Res(T) represent the capacity of cluster T, and NeedCount_c represent the capacity required by a specific application or virtual machine C. Then, when scheduling an application or virtual machine C, the necessary and sufficient condition for T to be a candidate cluster for scheduling is:

[0110] Res(T)≥NeedCount C

[0111] The strategy calculation module ultimately calculates the cluster T that minimizes network topology cost, network quality cost, and cluster load cost while satisfying cluster load conditions and cluster capacity requirements.

[0112] minTopologyCost(S,T)+NetQualityCost(S,T)+LoadCost(T)

[0113] stCPUCore(T)≥CPUCore need (C)

[0114] Mem(T)≥Mem need (C)

[0115] Res(T)≥NeedCount C

[0116] T∈[1,m]

[0117] Where m is the number of candidate target scheduling clusters, and S and C are fixed, therefore CPUCore need (C) and Mem need (C) and NeedCount C All of these are definite, that is, the application or virtual machine S to be scheduled and its current cluster C are determined, and the memory, computing resources and cluster capacity requirements for scheduling are determined. The goal is to find a solution that satisfies the above problem, that is, the target cluster T.

[0118] 5) Execution Module

[0119] The main function of the execution module is to inform S and T of the scheduling target after the policy calculation module completes the calculation, so that the two clusters are ready to start scheduling applications and virtual machines. At the same time, it reconfirms the status of the two clusters. If the status does not meet the scheduling requirements, it will select the optimal cluster T again. The process will end only when the scheduling is successful.

[0120] In summary, the overall process and overall architecture of this invention are as follows: Figure 1 and Figure 2 As shown, the specific implementation is as follows:

[0121] 1) Assume there is currently one rendering virtual machine C on cluster S, and there are 3 clusters to choose from.

[0122] Let c1, c2, and c3 be the clusters, and each cluster has three capability items. Therefore, the capability matrix for the three clusters is:

[0123]

[0124] Where s m,n Indicates whether cluster m has the capability n, s m,n This indicates that cluster m uses the capability of n, and vice versa.

[0125] 2) Assume the capability requirement vector R of the virtual machine C to be scheduled. C =(r1r2r3) T ,but Then the solution can be found in Represents the capabilities of cluster c1, if This indicates that c1 can meet the requirements for scheduling virtual machines.

[0126] 3) The following integer programming problem can be solved using the method proposed in this invention:

[0127] minTopologyCost(S,T)+NetQualityCost(S,T)+LoadCost(T)

[0128] stCPUCore(T)≥CPUCore need (C)

[0129] Mem(T)≥Mem need (C)

[0130] Res(T)≥NeedCoun C

[0131] T∈[1,m]

[0132] TopologyCost(S,T) and NetQualityCost(S,T) can be obtained through the cluster topology module, while LoadCost(T) can be obtained through the cluster load module. Furthermore, when the virtual machine C to be scheduled is known, the CPU Core... need (C) and Mem need (C) can all be obtained by querying the cluster load module, so the cluster T that satisfies the above constraints and minimizes the scheduling cost can be solved.

[0133] After determining the optimal scheduling cluster T, the execution module reconfirms S, T, and the status. If the status meets expectations, the execution module will notify both clusters to begin scheduling.

[0134] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0135] The above are merely preferred embodiments of the present invention and do not limit the scope of the patent. Any equivalent structural or procedural transformations made based on the description and drawings of the present invention, or direct or indirect applications in other related technical fields, are similarly included within the scope of patent protection of the present invention.

Claims

1. A task scheduling system based on cluster optimization, characterized in that, The system includes: The load collection module, whose main function is to provide the load status of the corresponding cluster; The cluster topology query module's main function is to provide the network relationship between two clusters. The cluster capability module is used for scheduling decisions. Specifically, it labels various capabilities of the cluster through a combination of indicator collection and manual maintenance, and provides functions for transforming and querying specific cluster capability data. The strategy calculation module is used to receive a scheduling request, which includes the virtual machine, the current cluster where the task is located, and the capacity requirements. Based on the parameters in the request and the data in the cluster capacity module, cluster load collection module, and cluster topology query module, the module performs calculations according to the model to obtain the optimal scheduling cluster that meets the conditions and is the best in the current context. Execution Module: After the strategy calculation module completes the calculation, the execution module informs clusters S and T of the scheduling target, so that the two clusters are ready to start scheduling applications and virtual machines. At the same time, it reconfirms the status of the two clusters. If the status of the scheduling is not met, the optimal cluster T will be selected again. The process will end only when the scheduling is successful. S and T represent two clusters, respectively.

2. The task scheduling system based on cluster optimization according to claim 1, characterized in that, The load collection module is based on cluster monitoring data collected by the monitoring system, which is preprocessed and aggregated for use by the subsequent strategy calculation module. It provides the function of querying the load of a specific cluster and considers the usage of CPU, memory, disk and GPU in the cluster during scheduling.

3. A task scheduling system based on cluster optimization according to claim 2, characterized in that, The network relationship in the cluster topology query module refers to the network topology relationship, that is, what network the two clusters are connected through, including: whether the two clusters are in the same local area network, whether the two clusters are in the same metropolitan area network or need to go to the backbone network to connect, and whether the two clusters are connected by a dedicated line; in addition, network quality data between the two clusters is also required, including network bandwidth and packet loss rate.

4. A task scheduling system based on cluster optimization according to claim 3, characterized in that, When selecting a scheduling strategy, the cluster capability module must consider whether the target cluster has the corresponding capabilities. That is, when some rendering virtual machines or tasks are scheduled, it is necessary to consider whether the target cluster has the corresponding computing power or resources. Only clusters that meet the requirements can be candidates for target scheduling.

5. A task scheduling system based on cluster optimization according to claim 4, characterized in that, The policy calculation module measures scheduling parameters including network topology cost, network quality cost, cluster load cost, and cluster capability conditions.

6. A task scheduling system based on cluster optimization according to claim 5, characterized in that, The detailed method for determining network topology cost is as follows: Data is acquired through the interface provided by the cluster topology query module. During scheduling, the fewer network layers traversed between clusters S and T, the lower the cost. Therefore, the network topology cost TopologyCost(S, T) between clusters S and T is defined as follows: TopologyCost(S,T)=α×Level(S,T),α∈(0,1) Here, α is a dynamic factor used to adjust the topology cost. Level(S, T) is obtained from the cluster topology query module and represents the network level that needs to be traversed when communicating between clusters S and T. The lower the level traversed when communicating between two clusters, the lower the cost. The cost order is: Leased Line > Local Area Network > Metropolitan Area Network > Wide Area Network > Backbone Network. The corresponding values ​​include the following: Level(S, T) = 1, and S and T are connected by a dedicated line; Level(S,T) = 2, and S and T are connected via a local area network; Level(S,T)=3, and S and T are connected via a metropolitan area network; Level(S,T) = 4, and S and T are connected via a wide area network; Level(S,T) = 5, and S and T are connected through a backbone network.

7. A task scheduling system based on cluster optimization according to claim 6, characterized in that, The detailed method for determining the network quality cost is as follows: During scheduling, the network quality cost NetQualityCost(S, T) between clusters S and T is defined as: NetQualityCost(S,T)=β×BandWidthCost(S,T)+γ×QualityCost(S,T), β∈(0,1), γ∈(0,1) Where β and γ are dynamic weighting factors, and BandWidthCost(S, T) is the network bandwidth between clusters S and T. The higher the bandwidth between clusters, the lower the cost during scheduling. Therefore, BandWidthCost(S, T) is defined as follows: BandWidth(S,T) is the bandwidth value between clusters S and T, which can be queried from the cluster topology query module. BandWidth(S,T)∈(0,+∞) and the unit is bps. The definition of BandWidthCost(S,T) means that the larger the network bandwidth between clusters, the lower the cost, and vice versa. QualityCost(S, T) represents the network quality cost between clusters S and T. The network quality between clusters is measured by the packet loss rate. QualityCost(S, T) is defined as follows: QualityCost(S,T)=Quality(S,T) Quality(S,T) represents the packet loss rate between clusters S and T. Quality(S,T)∈[0,1] and can be queried through the cluster topology module. By defining QualityCost(S,T), the lower the packet loss rate between clusters, the lower the cost, and vice versa.

8. A task scheduling system based on cluster optimization according to claim 7, characterized in that, The method for determining the cluster load cost is as follows: During scheduling, clusters with lower loads are scheduled. The load rate of a cluster is jointly measured by the utilization of CPU, memory, and GPU within the cluster. The load cost LoadCost(T) of cluster T is defined as follows: LoadCost(T)=δ×CPU(T)+ε×Mem(T)+ζ×GPU(T) Where CPU(T) is the sum of the 95th percentile of CPU utilization of all machines in cluster T over 10 minutes, Mem(T) is the sum of the 95th percentile of memory utilization of all machines in cluster T over 10 minutes, GPU(T) is the sum of the 95th percentile of GPU utilization of all machines in cluster T over 10 minutes, δ, ε, and ζ are dynamic factors that adjust the three values; LoadCost(T) indicates that the cost decreases when the cluster load is low and increases when the load is high. The cluster load cost corresponds to the following cluster load conditions: During scheduling, the cluster T to be scheduled must be able to meet the load conditions of the scheduled application, and the number of resources in the cluster T to be scheduled must be greater than the number of resources required by the scheduled application or virtual machine. Considering the number of CPU cores and memory availability of the cluster, the constraints for the application or virtual machine C to be scheduled are described as follows: CPUCore(T)-CPUCo need (C)≥0 Mem(T)-Me need (C)≥0 Where CPUCore(T) is the number of currently idle CPU cores in cluster T. need (C) represents the memory requirement of the application or virtual machine C, and Mem(T) represents the current available memory of cluster T. need (C) is the memory requirement of the application or virtual machine C.

9. A task scheduling system based on cluster optimization according to claim 8, characterized in that, The cluster capability conditions are as follows: The capabilities of cluster T to be scheduled need to meet the requirements of the application or virtual machine to be scheduled. Information from each cluster is collected through the cluster capability module and organized into a matrix according to a unified standard numbering system. Simultaneously, during scheduling, the requirements of the application or virtual machine to be scheduled are abstracted into a capability requirement vector, where the cluster capability matrix is ​​the capacity. c The capacity vector of each cluster is defined as capacity. c = (s1, s2, ..., s n ) T , where s1, s2, ..., s n Let s represent the n capabilities of cluster c, and s n = 0 or 1, where m is the total number of clusters, and s m,n Indicates whether the nth capability of cluster m is satisfied, if s m,n =1 indicates that the condition is met, otherwise it is not met. The cluster capacity matrix is... m,n for The capability requirements of a certain application or virtual machine C are vector R. C = (r1, r2, ..., r n ) T , where r n representing the need for a certain ability, r n =1 represents the required ability s n Conversely, if the requirements are not met, no such calculation is needed. When screening clusters that meet the capability requirements, the following calculations are required: in Then when This indicates that cluster c m If a cluster can meet the capability requirements of an application or virtual machine C, it can be used as one of the target clusters during scheduling; then, Res(T) is defined as representing cluster T, and NeedCount is defined as... C The value representing the capability required by a specific application or virtual machine c, calculated according to the steps above, means that the necessary and sufficient condition for T to be a candidate cluster for scheduling application or virtual machine C is: Res(T)≥NeedCount C The strategy calculation module ultimately calculates the cluster T that minimizes network topology cost, network quality cost, and cluster load cost under the conditions of satisfying cluster load and cluster capacity. The content is as follows: min TopologyCost(S,T)+NetQualityCost(S,T)+LoadCost(T) s.t.CPUCore(T)≥CPUCore need (C) Mem(T)≥Mem need (C) Res(T)≥NeedCount C T∈[1, m] Where m is the number of candidate target scheduling clusters; TopologyCost(S,T) represents the network topology cost between clusters S and T, NetQualityCost(S,T) represents the network quality cost between clusters S and T, LoadCost(T) represents the load cost of cluster T; CPUCore(T) represents the number of idle CPU resources in cluster T, and CPUCore... need (C) represents the CPU resource requirements of the application or virtual machine C; Mem(T) represents the amount of free memory resources in cluster T. need (C) represents the memory requirements of the application or virtual machine C; Res(T) represents the capacity value of cluster T, NeedCount C This represents the capability requirement value of the application or virtual machine C.

10. A cluster-based optimal task scheduling method based on the system according to any one of claims 1-9, characterized in that, The method includes the following steps: Step S100: The cluster S containing the application or virtual machine C to be scheduled sends a request to the policy calculation module, including the resource and capability requirements of C. Step S200: After receiving the request, the strategy calculation module queries the corresponding data from the load collection module, cluster topology module, and cluster capability module. Step S300: Calculate the optimal target scheduling cluster T that meets the conditions according to the method proposed in the above system. Step S400: Then, the execution module informs the clusters on both sides of the result and reconfirms the status of the clusters on both sides. If the status does not meet the requirements, the calculation is repeated until the scheduling is successful.