Resource configuration method, apparatus and electronic device
By listening to MIG operation events and parsing instance information, the target strategy is selected and configured on the target GPU, solving the problem of complex MIG configuration in cloud computing platforms, achieving efficient and flexible MIG resource management, and reducing labor and time costs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INSPUR SUZHOU INTELLIGENT TECH CO LTD
- Filing Date
- 2023-01-31
- Publication Date
- 2026-06-26
AI Technical Summary
In cloud computing platforms, during the configuration of MIG instance resources, the number of GPUs that can be configured for MIG varies on different server nodes in the K8s cluster, and the MIG configuration scheme for each GPU may be different, resulting in complicated and costly manual configuration.
A resource configuration method and apparatus are provided. By listening to MIG operation events, parsing instance information, selecting a target policy from a pre-configured MIG policy set based on configuration parameters and identification information, and calling the target driver interface to configure the policy into the target GPU, the method supports MIG configuration for custom resource types.
It simplifies the MIG configuration process, reduces labor and time costs, provides a flexible and configurable approach, and minimizes intrusion into existing logic.
Smart Images

Figure CN116360977B_ABST
Abstract
Description
Technical Field
[0001] The embodiments of the present invention relate to the field of cloud computing technology, and in particular to a resource allocation method, apparatus and electronic device. Background Technology
[0002] In recent years, with the rapid development of cloud computing, hardware virtualization technology has also been continuously iterating and updating. For deep learning scenarios, the virtualization of graphics processing units (GPUs) has become particularly important, so that GPU devices in cloud computing clusters can be used by more users.
[0003] In the existing technology, some manufacturers have developed new GPU products that natively support virtualization partitioning technology in hardware to obtain multi-instance GPUs (MIGs). The partitioned GPU MIG instances can achieve data protection, fault isolation and independence, and service stability.
[0004] In cloud computing platforms, Kubernetes (K8s) clusters are typically used to manage MIG instances and to schedule and allocate pods using MIG instance resources. However, during the configuration of MIG instance resources, the number of GPUs that can be configured for MIG varies across different server nodes in the K8s cluster, and the MIG configuration scheme for each GPU may also differ. If the cluster is large, manual configuration is cumbersome and consumes too much manpower and time. Summary of the Invention
[0005] This application provides a resource allocation method, apparatus, and electronic device to solve some or all of the above-mentioned technical problems in the prior art.
[0006] Firstly, this application provides a resource allocation method, which includes:
[0007] When an operation event of a multi-instance image processor (MIG) is detected, the operation event is parsed to obtain instance information. The instance information includes at least the first identifier information of the node to be operated, the configuration action instruction, the configuration parameter information corresponding to the target image processor GPU in the node to be operated, and the second identifier information corresponding to the target policy to be configured for the target GPU.
[0008] When it is determined that the configuration action instruction is used to instruct the execution of the first configuration action, the target policy is selected from the pre-configured MIG policy set according to the configuration parameter information and the second identification information.
[0009] Based on the first identification information, determine the node to be operated;
[0010] Based on the configuration parameter information, the target driver interface is called to run the preset code logic, which is used to configure the target strategy into the target GPU.
[0011] Optionally, the MIG can be a custom resource type MIG.
[0012] Optionally, the configuration parameter information includes the target GPU manufacturer information and the target GPU third identification information;
[0013] Based on the configuration parameter information, the target driver interface is invoked to run preset code logic, which is used to configure the target strategy into the target GPU, including:
[0014] Determine the target driver interface based on the target GPU manufacturer information;
[0015] Based on the third identification information, the target GPU in the node to be operated is determined;
[0016] Call the target driver interface to run preset code logic to configure the target strategy into the target GPU.
[0017] Optionally, when it is determined that a configuration action instruction is used to instruct the execution of the first configuration action, a target policy is selected from the pre-configured MIG policy set based on the configuration parameter information and the second identification information, including:
[0018] Based on the third identification information, match the subset of MIG policies corresponding to the target GPU from the MIG policy set;
[0019] Based on the second identifier information, select the target policy from the MIG policy subset.
[0020] Optionally, the MIG policy set is represented as a configmap object.
[0021] Optionally, the MIG policy set includes: a first data structure and a second data structure;
[0022] The first data structure includes: at least one first type field, at least one second type field corresponding to each first type field, and at least one count combination field corresponding to each first type field;
[0023] The second data structure includes a policy subset corresponding to each first type field, wherein each policy in the policy subset is composed of the field value of any second type field corresponding to the first type field and the field value of any count combination field.
[0024] The first type field indicates the GPU type; the second type field indicates the sub-resource type corresponding to the first type field; and the count combination field indicates the quantity of each sub-resource type.
[0025] Optionally, before calling the target driver interface to run preset code logic to configure the target strategy to the target GPU, it also includes:
[0026] Create a processing task container to execute the target policy configuration task;
[0027] By using a task processing container, the target driver interface is called to run preset code logic, which is used to configure the target strategy into the target GPU.
[0028] Optionally, the first configuration action includes: creating a MIG pattern and configuring the MIG policy.
[0029] Optionally, when it is determined that a configuration action instruction is used to instruct the execution of a second configuration action, the method includes:
[0030] Replace the MIG policy already configured in the target GPU with the target policy.
[0031] Optionally, the MIG policy already configured in the target GPU can be replaced with the target policy, including:
[0032] Clear the configured MIG mode in the target GPU and delete the configured MIG policy in the target GPU;
[0033] Based on the second identifier information, the target driver interface is invoked to run the preset code logic, which is used to configure the target strategy into the target GPU.
[0034] Optionally, the second configuration action includes updating the MIG policy.
[0035] Optionally, when it is determined that a configuration action instruction is used to instruct the execution of a third configuration action, the method further includes:
[0036] Clear the configured MIG mode in the target GPU and delete the configured MIG policy in the target GPU.
[0037] Optionally, the third configuration action includes deleting the currently configured MIG policy.
[0038] Optionally, the instance information may also include configuration instance status information, and the method may further include:
[0039] Filter out GPUs with incomplete policy configuration;
[0040] When it is determined that the MIG policy in the first GPU is consistent with the target policy, the status information in the instance information corresponding to the first GPU is updated to "configuration successful". Here, the first GPU is any GPU among the GPUs that have not completed policy configuration.
[0041] Optionally, when it is determined that the MIG policy in the first GPU is inconsistent with the target policy, the running status of the code logic running in the first GPU is detected;
[0042] When the running status is "running completed", the status information is updated to "configuration failed".
[0043] Alternatively, when the running status is "not completed", the running status of the code logic running in the first GPU is checked again after a preset time interval;
[0044] And when the running status is "running completed", check again whether the MIG policy in the first GPU is consistent with the target policy.
[0045] Optionally, listen for MIG operation events, including:
[0046] Use a pre-registered MIG listener to monitor MIG operation events in real time.
[0047] Secondly, this application provides a resource allocation apparatus, which includes:
[0048] The monitoring module is used to monitor operation events of the multi-instance image processor (MIG).
[0049] The parsing module is used to parse the operation event and obtain the instance information when the listening module listens to the operation event of the multi-instance image processor MIG. The instance information includes at least the first identification information of the node to be operated, the configuration action instruction, the configuration parameter information corresponding to the target image processor GPU in the node to be operated, and the second identification information corresponding to the target strategy to be configured for the target GPU.
[0050] The processing module is used to determine the MIG configuration action to be executed based on the configuration action instructions;
[0051] The selection module is used to select a target policy from a pre-configured MIG policy set based on configuration parameter information and second identification information when the MIG configuration action to be executed is determined to be the first configuration action.
[0052] The processing module is also used to determine the node to be operated based on the first identification information; and to call the target driver interface to run preset code logic based on the configuration parameter information, so as to configure the target strategy into the target GPU.
[0053] Optionally, the MIG can be a custom resource type MIG.
[0054] Optionally, the configuration parameter information includes the target GPU manufacturer information and the target GPU third-party identification information;
[0055] The processing module is specifically used to determine the target driver interface based on the target GPU's manufacturer information; determine the target GPU in the node to be operated based on the third identification information; and call the target driver interface to run preset code logic to configure the target strategy into the target GPU.
[0056] Optionally, the selection module is also used to match a subset of MIG policies corresponding to the target GPU from the MIG policy set based on the third identification information; and to select a target policy from the MIG policy subset based on the second identification information.
[0057] Optionally, the MIG policy set is represented as a configmap object.
[0058] Optionally, the MIG policy set includes: a first data structure and a second data structure;
[0059] The first data structure includes: at least one first type field, at least one second type field corresponding to each first type field, and at least one count combination field corresponding to each first type field;
[0060] The second data structure includes a policy subset corresponding to each first type field, wherein each policy in the policy subset is composed of the field value of any second type field corresponding to the first type field and the field value of any count combination field.
[0061] The first type field indicates the GPU type; the second type field indicates the sub-resource type corresponding to the first type field; and the count combination field indicates the quantity of each sub-resource type.
[0062] Optionally, the processing module is also used to create a processing task container that executes the target policy configuration task;
[0063] By using a task processing container, the target driver interface is called to run preset code logic, which is used to configure the target strategy into the target GPU.
[0064] Optionally, the processing module is also used to replace the MIG policy already configured in the target GPU with the target policy when it is determined that the MIG configuration action to be executed is the second configuration action.
[0065] Optionally, a processing module is used to clear the configured MIG mode in the target GPU and delete the configured MIG policy in the target GPU.
[0066] Based on the second identifier information, the target driver interface is invoked to run the preset code logic, which is used to configure the target strategy into the target GPU.
[0067] Optionally, the processing module is also configured to clear the configured MIG mode in the target GPU and delete the configured MIG policy in the target GPU when it is determined that the MIG configuration action to be executed is the third configuration action.
[0068] Optionally, the instance information may also include the status information of the configuration instance;
[0069] The processing module is also used to filter GPUs that have not completed policy configuration;
[0070] When it is determined that the MIG policy in the first GPU is consistent with the target policy, the status information in the instance information corresponding to the first GPU is updated to "configuration successful". Here, the first GPU is any GPU among the GPUs that have not completed policy configuration.
[0071] Optionally, the processing module is also used to detect the running status of the code logic running in the first GPU when it is determined that the MIG policy in the first GPU is inconsistent with the target policy;
[0072] When the running status is "running completed", the status information is updated to "configuration failed".
[0073] Alternatively, when the running status is "not completed", the running status of the code logic running in the first GPU is checked again after a preset time interval;
[0074] And when the running status is "running completed", check again whether the MIG policy in the first GPU is consistent with the target policy.
[0075] Optionally, a listening module is provided, specifically for using a pre-registered MIG listener to monitor MIG operation events in real time.
[0076] Thirdly, an electronic device is provided, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus;
[0077] Memory, used to store computer programs;
[0078] When a processor executes a program stored in memory, it implements the steps of the resource allocation method of any embodiment of the first aspect.
[0079] Fourthly, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the steps of the resource allocation method as described in any embodiment of the first aspect.
[0080] The technical solutions provided in this application have the following advantages compared with the prior art:
[0081] The method provided in this application, when a MIG operation event is detected, parses the operation event to obtain instance information. The parsed instance information includes at least a first identifier of the operation node, indicating which GPU on which operation node the MIG configuration operation is to be performed; a configuration instruction, indicating what operation to perform on the operation node; configuration parameter information corresponding to the target GPU in the operation node; and a second identifier corresponding to the target policy to be configured on the target GPU. When it is determined that the configuration action instruction indicates the execution of a first configuration action, a target policy is selected from a pre-configured MIG policy set based on the configuration parameter information and the second identifier. Then, the operation node is determined based on the first identifier, and the target driver interface is called to execute code logic based on the configuration parameter information to configure the target policy onto the target GPU. Throughout the process, MIG configuration information is configured into the MIG instance. Then, by listening for instance change events, the target strategy is determined based on the instance information described above. A corresponding scheduling task is then generated to call the target driver interface to execute the underlying logic, completing the MIG processing on the target GPU. The entire process is simple and efficient, providing a flexible and configurable approach that significantly reduces the workload of configuring MIG on the GPU, lowering both labor and time costs. Furthermore, because this application uses a MIG strategy configuration set approach, the MIG configuration strategy set can be freely modified externally without altering the original logic within the operator component, minimizing the intrusion of changes in business requirements into the project's original logic. Attached Figure Description
[0082] Figure 1 This is a schematic flowchart of a resource allocation method provided in an embodiment of the present invention;
[0083] Figure 2 This is a schematic diagram of another resource allocation method provided in an embodiment of the present invention;
[0084] Figure 3 This is a schematic diagram of the data structure of the MIG strategy set provided in an embodiment of the present invention;
[0085] Figure 4 This is a schematic diagram of another resource allocation method provided in an embodiment of the present invention;
[0086] Figure 5 This is a schematic diagram of another resource allocation method provided in an embodiment of the present invention;
[0087] Figure 6 A simplified overall block diagram of the resource allocation method flow provided in the embodiments of the present invention;
[0088] Figure 7 This is a schematic diagram of another resource allocation method provided in an embodiment of the present invention;
[0089] Figure 8 This is a simplified flowchart of the overall process for updating the configuration status in the resource configuration method provided in this embodiment of the invention;
[0090] Figure 9 This is a schematic diagram of a resource allocation device provided in an embodiment of the present invention;
[0091] Figure 10 This is a schematic diagram of an electronic device structure provided in an embodiment of the present invention. Detailed Implementation
[0092] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0093] To facilitate understanding of the embodiments of the present invention, further explanations and descriptions will be provided below with reference to the accompanying drawings and specific embodiments. These embodiments do not constitute a limitation on the embodiments of the present invention.
[0094] To address the technical problems mentioned in the background section, this application provides a resource allocation method, as detailed in the embodiments below. Figure 1 As shown, Figure 1 This is a schematic diagram of a resource configuration method provided by an embodiment of the present invention. This method can be applied to a Kubernetes cluster. A Kubernetes operator component for managing MIG instances is created in this cluster. MIG configuration information is mapped to MIG instances. In a specific example, the MIG instance is a MIG instance with a custom resource definition (CRD), abbreviated as MIG CRD. In an optional example, the operator can pre-register a listener in the Kubernetes cluster and use the listener to monitor MIG CRD instance operation events in real time. The listener of the operator can be a controller included in the operator. After parsing the operation events, the corresponding operation logic is executed according to the instance information, thereby completing the MIG management of the underlying GPU.
[0095] Kubernetes, or K8s for short, is an open-source container orchestration engine developed by Google. It uses the number 8 to represent the eight characters of "ubernete". It supports automated deployment, massive scalability, and containerized application management. A k8sOperator is an application-specific controller that extends the functionality of the Kubernetes API to create, configure, and manage instances of complex applications on behalf of k8s users. It is built upon basic k8s resource and controller concepts but incorporates domain-specific or application-specific knowledge to automate the lifecycle of the applications it manages.
[0096] Before executing the method steps of the embodiments of this application, some preparatory work is required. Specifically, this includes creating a MIG crd.
[0097] Specifically, the MIG crd records the node information for the operation to be performed, and specifies the CPU's MIG configuration scheme, etc. Of course, it can also include other information; see the following example for details:
[0098] apiVersion:V1
[0099] kind:MIG
[0100] metadata:
[0101] name:example
[0102] spec:
[0103] node:node2
[0104] operate:CREATE
[0105] vendor: NVIDIA
[0106] migPlans:
[0107] -gpuID:0
[0108] migPlan:1
[0109] -gpuID:1
[0110] migPlan:1
[0111] The `apiVersion` field indicates the version information of the MIG CRD instance, and the `kind` field indicates the attribute, such as `MIG` in the example above. `metadata` refers to metadata, including: `name` (named "example" because it's just an example); `spec`; `node` refers to the node to be operated on, which can include multiple nodes; the example above only shows one node, `node 2`, referring to server node 2 in the cluster; `operate` indicates the configuration action command, such as `CREATE`, which creates the MIG pattern and configures the MIG policy. It can also include other configuration action commands, such as `UPDATE` or `DELETE` policy types, etc. The vendor is used to indicate the GPU manufacturer and to mark the vendor driver interface called during subsequent underlying MIG processing. For example, NVIDIA (GPU manufacturer) is used in this embodiment. The migPlans is used to indicate the MIG policy scheme. The MIG policy scheme includes a GPU ID, in which the MIG policy label and GPU card number are configured in the GPUID attribute to mark which GPU card is being processed for MIG, and a migPlan, in which the MIG policy label is configured in the migPlan attribute. This is the second identification information corresponding to the target policy mentioned below.
[0112] After creating the MIG crd instance, the following steps can be performed. These steps include:
[0113] Step 110: When an operation event of MIG is detected, parse the operation event and obtain instance information.
[0114] Specifically, the operator registers a MIG CRD listener with Kubernetes, allowing it to monitor any events related to the MIG CRD in the cluster. When an operation event involving MIG is detected, the operator parses the event to obtain instance information. This instance information includes at least the first identifier of the node to be operated on, the configuration action command, the configuration parameters of the target GPU in the node to be operated on, and the second identifier corresponding to the target policy to be configured on the target GPU.
[0115] The node to be operated on refers to a server node similar to the one in the Kubernetes cluster mentioned above. The first identification information indicates which node the operation event will apply to, configuring the action command. As described above, it indicates the configuration operation to be executed, such as instructing the creation of a MIG mode and determining which type of MIG policy to configure. The configuration parameters of the target GPU in the node to be operated on, and the role of the second identification information, will be described in detail below.
[0116] Step 120: When it is determined that the configuration action instruction is used to instruct the execution of the first configuration action, the target policy is selected from the pre-configured MIG policy set according to the configuration parameter information and the second identification information.
[0117] Specifically, the first configuration action is, for example, CREATE, which creates the MIG pattern and determines the MIG policy. Then, based on the configuration parameter information and the second identification information, a target policy can be selected from the pre-configured MIG policy set.
[0118] The configuration parameters may include the third identifier of the target GPU, while the second identifier indicates the specific target policy. Selecting a target policy from the pre-configured MIG policy set can be achieved through the following steps; see details below. Figure 2 As shown, the method includes the following steps:
[0119] Step 210: Based on the third identification information, match the subset of MIG policies corresponding to the target GPU from the MIG policy set.
[0120] Step 220: Select a target policy from the MIG policy subset based on the second identifier information.
[0121] Specifically, the third identification information, such as the GPU ID mentioned above, indicates the GPU type or GPU name. Then, a subset of MIG policies corresponding to the third identification information of the target GPU is matched from the MIG policy set. That is, the MIG policy set includes at least one subset of MIG policies, each subset corresponds to one GPU, and each subset includes multiple MIG policies. Furthermore, it includes the second identification information corresponding to each MIG policy. Therefore, the target policy can be selected from the MIG policy subset based on the second identification information.
[0122] In an optional example, the MIG policy set can be represented as a configmap object. The MIG policy set includes a first data structure and a second data structure:
[0123] The first data structure includes: at least one first type field, at least one second type field corresponding to each first type field, and at least one count combination field corresponding to each first type field;
[0124] The second data structure includes a policy subset corresponding to each first type field, wherein each policy in the policy subset is composed of the field value of any second type field corresponding to the first type field and the field value of any count combination field.
[0125] The first type field indicates the GPU type (corresponding to the third identification information); the second type field indicates the sub-resource type corresponding to the first type field; and the count combination field indicates the quantity of each seed resource type. The field values of the sub-resource types and the field values of the count combination field under each GPU type constitute the sub-policy set corresponding to that GPU type.
[0126] For details on the data structure of the MIG policy set, please refer to [link / reference]. Figure 3 As shown, the strategy set includes Figure 3 The first data structure on the left and Figure 3 The second data structure on the right. The first data structure includes at least one first-type field, at least one second-type field corresponding to each first-type field, and at least one count combination field corresponding to each first-type field. The second data structure consists of the MIG policy content itself, composed of the field values of any second-type field corresponding to the first-type field and the field values of any count combination field. Of course, the first data structure may include version information, attribute information, metadata information, etc., in addition to the above. See details. Figure 3 As shown:
[0127] The first data structure includes:
[0128] API Version: V1; Kind: configMap; Metadata: Name: migconfig; Data: config.Json (first type field), configuring the GPU type through the key in config.json within the configmap information. For example... Figure 3The key value can be A100-40G. The `strategy` attribute field (second type field) configures all possible MIG instances that this GPU can generate. For example, when the GPU type is A100-40G, the possible types of generated IMG instances could be 1g.5gb, 2g.10gb, 3g.20gb, 4g.20gb, and 7g.40gb, etc., which are the MIG sub-resource types mentioned above. The `plans` field (count combination field) configures the combination information in the strategy, for example... Figure 3 The "1" displayed as "70000" indicates the quantity of each seed resource type. Ultimately, based on the strategy and plans, a MIG strategy is formed. Multiple strategies constitute a strategy set, and a target strategy is selected from the MIG strategy set. Detailed strategy information for "70000" corresponding to 1g.5gb is available in [the relevant section]. Figure 3 This is specifically illustrated on the right; this is an IMG strategy. Similarly, Figure 3 The right side also displays detailed strategy information for combinations such as "13000" and "002000" corresponding to 1g.5gb. Of course, Figure 3 The right side only uses 1g.5gb as an example for illustration; other policy subsets are similar, so they are not shown here.
[0129] Step 130: Determine the node to be operated based on the first identification information.
[0130] Specifically, for example, if the first identifier is node1, then the node to be operated on is server 1 (node 1) in the Kubernetes cluster.
[0131] Step 140: Based on the configuration parameter information, call the target driver interface to run the preset code logic to configure the target strategy into the target GPU.
[0132] In an optional example, the configuration parameter information may include, in addition to the third identification information mentioned above, the target GPU manufacturer information.
[0133] Therefore, based on the configuration parameter information, the target driver interface is called to run the preset code logic to configure the target strategy into the target GPU. This can be achieved in the following way, see details below. Figure 4 As shown, the method includes the following steps:
[0134] Step 410: Determine the target driver interface based on the target GPU manufacturer information.
[0135] Specifically, different manufacturers have different driver interfaces. Therefore, it is necessary to determine the driver interface to be called based on the manufacturer information of the target GPU.
[0136] Step 420: Determine the target GPU in the node to be operated based on the third identification information.
[0137] In an optional example, the third identification information could be, for example, the CPU ID information in the MIG crd instance above, or other information such as the name or number of the target GPU.
[0138] Step 430: Call the target driver interface to run the preset code logic to configure the target strategy into the target GPU.
[0139] Specifically, the steps for calling the target driver interface to run the preset code logic and configuring the target strategy into the target GPU can be found in the currently mature technologies, and will not be elaborated further here.
[0140] In an optional example, when it is determined that a configuration action instruction is used to instruct the execution of a second configuration action, the method includes:
[0141] Replace the MIG policy already configured in the target GPU with the target policy.
[0142] Specifically, the second configuration action can be updating the MIG policy. When executing the method step of replacing the MIG policy already configured in the target GPU with the target policy, the MIG policy already configured in the target GPU can be directly replaced with the target policy.
[0143] However, considering that some hardware does not support direct replacement of one MIG policy with another, the method may also include the following steps, see details below. Figure 5 As shown, the method includes the following steps:
[0144] Step 510: Clear the configured MIG mode in the target GPU and delete the configured MIG policy in the target GPU.
[0145] Step 520: Based on the second identification information, call the target driver interface to run the preset code logic to configure the target strategy into the target GPU.
[0146] That is, restore the GPU to its initial state, then recreate the MIG mode and configure the target policy.
[0147] Optionally, when it is determined that a configuration action instruction is used to instruct the execution of a third configuration action, the method further includes:
[0148] Clear the configured MIG mode in the target GPU and delete the configured MIG policy in the target GPU.
[0149] Specifically, the third configuration action instructs the deletion of the currently configured MIG policy. Therefore, it simply clears the configured MIG mode and deletes the configured MIG policy from the target GPU.
[0150] The above includes the implementation process of three configuration actions; for details, please refer to [link / reference]. Figure 6 As shown, Figure 6 The simplified overall execution architecture diagram is shown below; the specific operation process will not be described in detail here.
[0151] In an optional example, the instance information also includes configuration instance state information. The Operator also has MIG state maintenance logic configured to maintain the MIG state. Therefore, the method also includes the following method steps, see details below. Figure 7 As shown, it includes:
[0152] Step 710: Filter GPUs that have not completed policy configuration.
[0153] Step 720: When it is determined that the MIG policy in the first GPU is consistent with the target policy, update the status information in the instance information corresponding to the first GPU to "configuration successful".
[0154] Specifically, after starting the operation, the Operator can initiate the MIG CRD instance state maintenance logic. Upon detecting MIG operation events, it periodically filters out incompletely configured MIG CRD instances in the cluster and checks whether the MIG policy in the GPU with incomplete policy configuration matches the target policy. When it is determined that the MIG policy in the first GPU matches the target policy, the status information in the instance information corresponding to the first GPU is updated to "configuration successful." Here, the first GPU can be any one of the GPUs with incomplete policy configuration.
[0155] Step 730: When it is determined that the MIG policy in the first GPU is inconsistent with the target policy, the running status of the code logic running in the first GPU is detected.
[0156] Step 740: When the running status is "running completed", update the status information to "configuration failed".
[0157] or,
[0158] Step 750: When the running status is not completed, after a preset time interval, the running status of the code logic running in the first GPU is checked again; and when the running status is completed, the MIG strategy in the first GPU is checked again to see if it is consistent with the target strategy.
[0159] In other words, if the running status is "running completed", and the MIG policy in the first GPU is inconsistent with the target policy, it means that the configuration has failed. Therefore, the status information needs to be updated to "configuration failed".
[0160] Alternatively, if the logic code has not finished running at this time, the configuration failure cannot be determined directly. It is necessary to wait for the configuration to be completed before further determining whether the MIG policy in the first GPU is consistent with the target policy. If they are consistent, the status information is updated to configuration success; otherwise, the status information is updated to configuration failure.
[0161] For detailed logical operation procedures, please refer to Figure 8 As shown, Figure 8 The following is a simplified illustration of the entire state update process. The specific operation procedures have already been explained in detail above, so they will not be repeated here.
[0162] The resource configuration method provided in this embodiment of the invention, when a MIG operation event is detected, parses the operation event to obtain instance information. The parsed instance information includes at least a first identifier of the operation node, indicating which GPU on which operation node the MIG configuration operation is to be performed; a configuration instruction, indicating what operation to perform on the operation node; configuration parameter information corresponding to the target GPU in the operation node; and a second identifier corresponding to the target policy to be configured on the target GPU. When it is determined that the configuration action instruction indicates the execution of a first configuration action, a target policy is selected from a pre-configured MIG policy set based on the configuration parameter information and the second identifier. Then, the operation node is determined based on the first identifier, and the target driver interface is called to execute code logic based on the configuration parameter information to configure the target policy on the target GPU. Throughout the process, MIG configuration information is configured into the MIG instance. Then, by listening for instance change events, the target strategy is determined based on the instance information described above. A corresponding scheduling task is then generated to call the target driver interface to execute the underlying logic, completing the MIG processing on the target GPU. The entire process is simple and efficient, providing a flexible and configurable approach that significantly reduces the workload of configuring MIG on the GPU, lowering both labor and time costs. Furthermore, because this application uses a MIG strategy configuration set approach, the MIG configuration strategy set can be freely modified externally without altering the original logic within the operator component, minimizing the intrusion of changes in business requirements into the project's original logic.
[0163] The above are several embodiments of resource configuration methods provided in this application. Other embodiments of resource configuration provided in this application will be described below. Please refer to the following for details.
[0164] Figure 9A resource configuration device provided in an embodiment of the present invention includes: a monitoring module 901, a parsing module 902, a processing module 903, and a selection module 904.
[0165] Among them, the listening module 901 is used to listen to the operation events of the multi-instance image processor (MIG);
[0166] The parsing module 902 is used to parse the operation event and obtain the instance information when the listening module 901 listens to the operation event of the multi-instance image processor MIG. The instance information includes at least the first identification information of the node to be operated, the configuration action instruction, the configuration parameter information corresponding to the target image processor GPU in the node to be operated, and the second identification information corresponding to the target strategy to be configured for the target GPU.
[0167] Processing module 903 is used to determine the MIG configuration action to be executed based on the configuration action instruction;
[0168] The selection module 904 is used to select a target policy from the pre-configured MIG policy set based on the configuration parameter information and the second identification information when it is determined that the MIG configuration action to be executed is the first configuration action.
[0169] The processing module 903 is also used to determine the node to be operated based on the first identification information; and to call the target driver interface to run preset code logic based on the configuration parameter information, so as to configure the target strategy into the target GPU.
[0170] Optionally, the MIG can be a custom resource type MIG.
[0171] Optionally, the configuration parameter information includes the target GPU manufacturer information and the target GPU third-party identification information;
[0172] The processing module 903 is specifically used to determine the target driver interface based on the target GPU manufacturer information; determine the target GPU in the node to be operated based on the third identification information; and call the target driver interface to run preset code logic to configure the target strategy into the target GPU.
[0173] Optionally, the selection module 904 is further configured to match a subset of MIG policies corresponding to the target GPU from the MIG policy set based on the third identification information; and to select a target policy from the MIG policy subset based on the second identification information.
[0174] Optionally, the MIG policy set is represented as a configmap object.
[0175] Optionally, the MIG policy set includes: a first data structure and a second data structure;
[0176] The first data structure includes: at least one first type field, at least one second type field corresponding to each first type field, and at least one count combination field corresponding to each first type field;
[0177] The second data structure includes a policy subset corresponding to each first type field, wherein each policy in the policy subset is composed of the field value of any second type field corresponding to the first type field and the field value of any count combination field.
[0178] The first type field indicates the GPU type; the second type field indicates the sub-resource type corresponding to the first type field; and the count combination field indicates the quantity of each sub-resource type.
[0179] Optionally, the processing module 903 is also used to create a processing task container for executing the target policy configuration task;
[0180] By using a task processing container, the target driver interface is called to run preset code logic, which is used to configure the target strategy into the target GPU.
[0181] Optionally, the first configuration action includes: creating a MIG pattern and configuring the MIG policy.
[0182] Optionally, the processing module 903 is also configured to replace the MIG policy already configured in the target GPU with the target policy when it is determined that the MIG configuration action to be executed is the second configuration action.
[0183] Optionally, the second configuration action includes updating the MIG policy.
[0184] Optionally, the processing module 903 is specifically used to clear the configured MIG mode in the target GPU and delete the configured MIG policy in the target GPU;
[0185] Based on the second identifier information, the target driver interface is invoked to run the preset code logic, which is used to configure the target strategy into the target GPU.
[0186] Optionally, the processing module 903 is further configured to clear the configured MIG mode in the target GPU and delete the configured MIG policy in the target GPU when it is determined that the MIG configuration action to be executed is the third configuration action.
[0187] Optionally, the third configuration action includes: deleting the currently configured MIG policy.
[0188] Optionally, the instance information may also include the status information of the configuration instance;
[0189] Processing module 903 is also used to filter GPUs that have not completed policy configuration;
[0190] When it is determined that the MIG policy in the first GPU is consistent with the target policy, the status information in the instance information corresponding to the first GPU is updated to "configuration successful". Here, the first GPU is any GPU among the GPUs that have not completed policy configuration.
[0191] Optionally, the processing module 903 is further configured to detect the running status of the code logic running in the first GPU when it is determined that the MIG policy in the first GPU is inconsistent with the target policy;
[0192] When the running status is "running completed", the status information is updated to "configuration failed".
[0193] Alternatively, when the running status is "not completed", the running status of the code logic running in the first GPU is checked again after a preset time interval;
[0194] And when the running status is "running completed", check again whether the MIG policy in the first GPU is consistent with the target policy.
[0195] Optionally, the listening module 901 is specifically used to listen to MIG operation events in real time using a pre-registered MIG listener.
[0196] The functions performed by each component in the resource configuration device provided in the embodiments of the present invention have been described in detail in any of the above method embodiments, and therefore will not be repeated here.
[0197] This invention provides a resource configuration device that, upon detecting a MIG (Configuration Instructions for Governing the Context) operation event, parses the operation event to obtain instance information. The parsed instance information includes at least a first identifier of the operation node, indicating which GPU on which operation node the MIG configuration operation is to be performed; a configuration instruction, indicating what operation to perform on the operation node; configuration parameter information corresponding to the target GPU in the operation node; and a second identifier corresponding to the target policy to be configured on the target GPU. When the configuration action instruction is determined to be an instruction to execute a first configuration action, a target policy is selected from a pre-configured MIG policy set based on the configuration parameter information and the second identifier. Then, the operation node is determined based on the first identifier, and the target driver interface is called to execute code logic based on the configuration parameter information to configure the target policy onto the target GPU. Throughout the process, MIG configuration information is configured into the MIG instance. Then, by listening for instance change events, the target strategy is determined based on the instance information described above. A corresponding scheduling task is then generated to call the target driver interface to execute the underlying logic, completing the MIG processing on the target GPU. The entire process is simple and efficient, providing a flexible and configurable approach that significantly reduces the workload of configuring MIG on the GPU, lowering both labor and time costs. Furthermore, because this application uses a MIG strategy configuration set approach, the MIG configuration strategy set can be freely modified externally without altering the original logic within the operator component, minimizing the intrusion of changes in business requirements into the project's original logic.
[0198] like Figure 10 As shown, this application provides an electronic device including a processor 111, a communication interface 112, a memory 113, and a communication bus 114, wherein the processor 111, the communication interface 112, and the memory 113 communicate with each other through the communication bus 114.
[0199] Memory 113 is used to store computer programs;
[0200] In one embodiment of this application, when the processor 111 executes a program stored in the memory 113, it implements the resource configuration method provided in any of the foregoing method embodiments, including:
[0201] When an operation event of a multi-instance image processor (MIG) is detected, the operation event is parsed to obtain instance information. The instance information includes at least the first identifier information of the node to be operated, the configuration action instruction, the configuration parameter information corresponding to the target image processor GPU in the node to be operated, and the second identifier information corresponding to the target policy to be configured for the target GPU.
[0202] When it is determined that the configuration action instruction is used to instruct the execution of the first configuration action, the target policy is selected from the pre-configured MIG policy set according to the configuration parameter information and the second identification information.
[0203] Based on the first identification information, determine the node to be operated;
[0204] Based on the configuration parameter information, the target driver interface is called to run the preset code logic, which is used to configure the target strategy into the target GPU.
[0205] Optionally, the MIG can be a custom resource type MIG.
[0206] Optionally, the configuration parameter information includes the target GPU manufacturer information and the target GPU third-party identification information;
[0207] Based on the configuration parameter information, the target driver interface is invoked to run preset code logic, which is used to configure the target strategy into the target GPU, including:
[0208] Determine the target driver interface based on the target GPU manufacturer information;
[0209] Based on the third identification information, the target GPU in the node to be operated is determined;
[0210] Call the target driver interface to run preset code logic to configure the target strategy into the target GPU.
[0211] Optionally, when it is determined that a configuration action instruction is used to instruct the execution of the first configuration action, a target policy is selected from the pre-configured MIG policy set based on the configuration parameter information and the second identification information, including:
[0212] Based on the third identification information, match the subset of MIG policies corresponding to the target GPU from the MIG policy set;
[0213] Based on the second identifier information, select the target policy from the MIG policy subset.
[0214] Optionally, the MIG policy set is represented as a configmap object.
[0215] Optionally, the MIG policy set includes: a first data structure and a second data structure;
[0216] The first data structure includes: at least one first type field, at least one second type field corresponding to each first type field, and at least one count combination field corresponding to each first type field;
[0217] The second data structure includes a policy subset corresponding to each first type field, wherein each policy in the policy subset is composed of the field value of any second type field corresponding to the first type field and the field value of any count combination field.
[0218] The first type field indicates the GPU type; the second type field indicates the sub-resource type corresponding to the first type field; and the count combination field indicates the quantity of each sub-resource type.
[0219] Optionally, before calling the target driver interface to run preset code logic to configure the target strategy to the target GPU, it also includes:
[0220] Create a processing task container to execute the target policy configuration task;
[0221] By using a task processing container, the target driver interface is called to run preset code logic, which is used to configure the target strategy into the target GPU.
[0222] Optionally, the first configuration action includes: creating a MIG pattern and configuring the MIG policy.
[0223] Optionally, when it is determined that a configuration action instruction is used to instruct the execution of a second configuration action, the method includes:
[0224] Replace the MIG policy already configured in the target GPU with the target policy.
[0225] Optionally, the second configuration action includes updating the MIG policy.
[0226] Optionally, the MIG policy already configured in the target GPU can be replaced with the target policy, including:
[0227] Clear the configured MIG mode in the target GPU and delete the configured MIG policy in the target GPU;
[0228] Based on the second identifier information, the target driver interface is invoked to run the preset code logic, which is used to configure the target strategy into the target GPU.
[0229] Optionally, when it is determined that a configuration action instruction is used to instruct the execution of a third configuration action, the method further includes:
[0230] Clear the configured MIG mode in the target GPU and delete the configured MIG policy in the target GPU.
[0231] Optionally, the third configuration action includes: deleting the currently configured MIG policy.
[0232] Optionally, the instance information may also include configuration instance status information, and the method may further include:
[0233] Filter out GPUs with incomplete policy configuration;
[0234] When it is determined that the MIG policy in the first GPU is consistent with the target policy, the status information in the instance information corresponding to the first GPU is updated to "configuration successful". Here, the first GPU is any GPU among the GPUs that have not completed policy configuration.
[0235] Optionally, when it is determined that the MIG policy in the first GPU is inconsistent with the target policy, the running status of the code logic running in the first GPU is detected;
[0236] When the running status is "running completed", the status information is updated to "configuration failed".
[0237] Alternatively, when the running status is "not completed", the running status of the code logic running in the first GPU is checked again after a preset time interval;
[0238] And when the running status is "running completed", check again whether the MIG policy in the first GPU is consistent with the target policy.
[0239] Optionally, listen for MIG operation events, including:
[0240] Use a pre-registered MIG listener to monitor MIG operation events in real time.
[0241] This application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the resource configuration method provided in any of the foregoing method embodiments.
[0242] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes the element.
[0243] The above are merely specific embodiments of the present invention, enabling those skilled in the art to understand or implement the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features claimed herein.
Claims
1. A resource allocation method, characterized in that, The method includes: When an operation event of a multi-instance image processor (MIG) is detected, the operation event is parsed to obtain instance information. The instance information includes at least the first identification information of the node to be operated, the configuration action instruction, the configuration parameter information corresponding to the target image processor GPU in the node to be operated, and the second identification information corresponding to the target strategy to be configured for the target GPU. When it is determined that the configuration action instruction is used to instruct the execution of the first configuration action, the target policy is selected from the pre-configured MIG policy set according to the configuration parameter information and the second identification information; The node to be operated is determined based on the first identification information; Based on the configuration parameter information, the target driver interface is invoked to run preset code logic in order to configure the target strategy into the target GPU.
2. The method according to claim 1, characterized in that, The configuration parameter information includes the manufacturer information of the target GPU and the third identification information of the target GPU; The step of calling the target driver interface to run preset code logic based on the configuration parameter information, in order to configure the target strategy into the target GPU, includes: Based on the manufacturer information of the target GPU, determine the target driver interface; Based on the third identification information, the target GPU in the node to be operated is determined; The target driver interface is invoked to run preset code logic, which is used to configure the target strategy into the target GPU.
3. The method according to claim 2, characterized in that, When it is determined that the configuration action instruction is used to instruct the execution of the first configuration action, the step of selecting the target policy from the pre-configured MIG policy set according to the configuration parameter information and the second identification information includes: Based on the third identification information, a subset of MIG policies corresponding to the target GPU is matched from the MIG policy set; The target policy is selected from the MIG policy subset based on the second identification information.
4. The method according to claim 3, characterized in that, The MIG strategy set includes: a first data structure and a second data structure; The first data structure includes: at least one first type field, at least one second type field corresponding to each first type field, and at least one count combination field corresponding to each first type field; The second data structure includes a policy subset corresponding to each of the first type fields, wherein each policy in the policy subset is composed of the field value of any second type field corresponding to the first type field and the field value of any count combination field. The first type field indicates the GPU type; the second type field indicates the sub-resource type corresponding to the first type field; and the count combination field indicates the quantity of each sub-resource type.
5. The method according to claim 2, characterized in that, Before calling the target driver interface to run preset code logic to configure the target strategy to the target GPU, the method further includes: Create a processing task container to execute the target policy configuration task; Using the processing task container, the target driver interface is called to run preset code logic, which is used to configure the target strategy into the target GPU.
6. The method according to claim 1, characterized in that, When it is determined that the configuration action instruction is used to instruct the execution of a second configuration action, the method includes: Replace the MIG policy already configured in the target GPU with the target policy.
7. The method according to claim 6, characterized in that, The step of replacing the MIG policy already configured in the target GPU with the target policy includes: Clear the configured MIG mode in the target GPU and delete the configured MIG policy in the target GPU; Based on the second identification information, the target driver interface is invoked to run preset code logic in order to configure the target strategy into the target GPU.
8. The method according to claim 1, characterized in that, When it is determined that the configuration action instruction is used to instruct the execution of a third configuration action, the method further includes: Clear the configured MIG mode in the target GPU and delete the configured MIG policy in the target GPU.
9. The method according to any one of claims 1-8, characterized in that, The instance information also includes configuration instance status information, and the method further includes: Filter out GPUs with incomplete policy configuration; When it is determined that the MIG policy in the first GPU is consistent with the target policy, the status information in the instance information corresponding to the first GPU is updated to "configuration successful". Here, the first GPU is any GPU among the GPUs that have not completed policy configuration.
10. The method according to claim 9, characterized in that, When it is determined that the MIG policy in the first GPU is inconsistent with the target policy, the method further includes: Detect the running status of the code logic running in the first GPU; When the running status is "run completed", the status information is updated to "configuration failed". or, When the running status is not completed, the running status of the code logic running in the first GPU is detected again after a preset time interval. And when the running status is "running completed", it checks again whether the MIG policy in the first GPU is consistent with the target policy.
11. A resource allocation device, characterized in that, The device includes: The monitoring module is used to monitor operation events of the multi-instance image processor (MIG). The parsing module is used to parse the operation event and obtain instance information when the listening module listens to the operation event of the multi-instance image processor (MIG). The instance information includes at least the first identification information of the node to be operated, the configuration action instruction, the configuration parameter information corresponding to the target image processor (GPU) in the node to be operated, and the second identification information corresponding to the target strategy to be configured for the target GPU. The processing module is used to determine the MIG configuration action to be executed based on the configuration action instruction; The selection module is used to select the target policy from the pre-configured MIG policy set according to the configuration parameter information and the second identification information when it is determined that the MIG configuration action to be executed is the first configuration action; The processing module is further configured to determine the node to be operated based on the first identification information; and to call the target driver interface to run preset code logic based on the configuration parameter information, so as to configure the target strategy into the target GPU.
12. An electronic device, characterized in that, It includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; Memory, used to store computer programs; A processor, when executing a program stored in memory, implements the steps of the resource allocation method according to any one of claims 1-10.
13. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the resource allocation method as described in any one of claims 1-10.