Universal ai heterogeneous computing power dynamic scheduling and unified management method and system
By constructing a unified resource description object and task stage profile, a heterogeneous compatibility matrix is generated for scheduling, which solves the problem of incompatibility of operating environments for cross-domain AI tasks, optimizes resource utilization and system stability, and achieves efficient AI task scheduling and resource management.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI BANGCHENG TECHNOLOGY DEVELOPMENT GROUP CO LTD
- Filing Date
- 2026-05-22
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies suffer from incompatibility issues in cross-domain AI task scheduling due to differences in chip architecture, driver stack, and operator support capabilities. They also lack phased resource modeling, leading to increased task completion latency and resource consumption. Furthermore, they lack memory fragmentation prediction and suppression, which affects system stability.
By constructing a unified resource description object and task stage profile, a heterogeneous compatibility matrix is generated, and domain-level and node-level scheduling is performed to achieve compatibility judgment and resource demand matching between task stages and candidate nodes. Furthermore, through memory fragmentation prediction and suppression control, scheduling decisions and resource utilization are optimized.
It improves the rationality of scheduling results, reduces the risk of failure caused by environmental incompatibility, reduces the time for repeated task loading and initialization, and improves resource utilization and system stability.
Smart Images

Figure CN122240340A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, specifically to a method and system for dynamic scheduling and unified management of heterogeneous AI computing power across the entire domain. Background Technology
[0002] With the rapid development of large-scale model training, fine-tuning, inference, distillation, evaluation, and multi-agent collaborative tasks, the demand for computing resources in AI tasks continues to grow. The deployment model of computing power has also evolved from a single data center and single chip architecture to a collaborative supply model across regions, clusters, vendors, and architectures. In practical applications, computing resources are typically distributed across multiple data centers, edge nodes, private clouds, public clouds, and industry-specific network environments, forming multiple computing domains. Within different computing domains, different types of processors such as CPUs, GPUs, NPUs, and FPGAs may be deployed simultaneously, exhibiting significant differences in chip architecture, driver versions, container runtimes, operator support capabilities, memory allocation methods, network topology, and storage configurations.
[0003] While existing technologies can achieve heterogeneous resource access, resource pooling, and unified monitoring, most remain at the level of resource display, task submission, basic orchestration, and alarm management. They typically allocate resources based solely on metrics such as CPU utilization, GPU utilization, memory utilization, node idle status, or task priority. Such solutions often treat AI tasks as a whole for coarse-grained processing, lacking joint modeling and control of differences in AI task execution stages, data location dependencies, runtime environment compatibility, state transition costs, and memory fragmentation trends.
[0004] In actual operation, the aforementioned existing technologies have at least the following shortcomings: First, differences exist in chip architecture, driver stacks, operator support capabilities, and checkpoint formats across different computing domains. When tasks switch across domains, recover from disasters, or migrate loads, issues such as incompatible operating environments, inability to directly reuse images, or failure to restore state can easily arise. Second, the scheduling process typically does not include model images, local data caches, and checkpoint positions in a unified evaluation, easily leading to repeated loading, transmission, and initialization, increasing task completion latency and resource consumption. Third, AI tasks at different execution stages have varying requirements for computing resources, network bandwidth, and other factors. The requirements for video memory capacity, topology affinity, and storage bandwidth are not the same, and existing solutions lack phased resource modeling, resulting in a mismatch between scheduling results and the actual running characteristics of tasks. Fourth, existing resource management systems mostly focus only on video memory utilization, without predicting and suppressing video memory fragmentation rate, remaining slice form, and fragmentation evolution trend, which can easily lead to situations where resources appear idle but cannot continue to load new tasks. Fifth, in scenarios such as node overheating, driver anomalies, link jitter, or hardware failures, existing solutions mostly rely on integer checkpoint restarts or task requeuing, resulting in long recovery latency, affecting the continuous execution capability of tasks and the overall throughput of the platform. Summary of the Invention
[0005] The purpose of this invention is to provide a method and system for dynamic scheduling and unified management of heterogeneous AI computing power across the entire domain, so as to solve the problems mentioned in the background art.
[0006] To achieve the above objectives, the present invention provides the following technical solution: In a first aspect, the present invention provides a method for dynamic scheduling and unified management of heterogeneous computing power across the entire AI domain, including: Obtain resource capability information and operating status information of heterogeneous computing nodes in multiple computing power domains, and construct a unified resource description object corresponding to each heterogeneous computing node; Analyze the AI task to be executed and generate a task stage profile that represents the resource requirements and operational constraints of each execution stage of the AI task. Based on the unified resource description object and the task stage profile, a heterogeneous compatibility matrix between the task stage and the candidate node is constructed. Based on the heterogeneous compatibility matrix, domain-level scheduling is performed on multiple candidate computing power domains to determine the main execution domain and backup execution domains. Node-level scheduling is then performed within the main execution domain to determine the target execution node. Control the target execution node to execute the AI task and collect operation feedback information during the task execution process; When the runtime feedback information meets the preset rescheduling conditions, a unified execution status package and an incremental status block corresponding to the AI task are generated, and the unified execution status package and the incremental status block are migrated to the recovery node so that the recovery node can recover and continue to execute the AI task. Memory fragmentation prediction and suppression control are performed based on the memory fragmentation status of heterogeneous computing nodes.
[0007] Preferably, the Uniform Resource Description Object includes at least one of the following: Node identifier, computing domain identifier, chip type, chip architecture, chip manufacturer, driver version, container runtime version, supported operator set, supported precision mode, total video memory, video memory bandwidth, video memory partitioning granularity, current video memory utilization, video memory fragmentation rate, number of CPU cores, main memory capacity, storage bandwidth, inter-node interconnect type, inter-domain link latency, model image cache status, dataset location identifier, checkpoint format support capability, device temperature, power consumption, and abnormal status identifier.
[0008] Preferably, the task phase profile includes at least one of the following: Task type, phase sequence, expected duration of each phase, computational resource requirements of each phase, video memory requirements of each phase, network requirements of each phase, required set of operators, required precision mode, required runtime environment, task interruptibility, state preservation granularity, recovery method requirements, and cache reuse preferences. The stage sequence includes at least one of the following: data preprocessing stage, model loading stage, forward computation stage, backward computation stage, parameter update stage, checkpoint saving stage, inference output stage, and cache maintenance stage.
[0009] Preferably, the heterogeneous compatibility matrix between the construction task phase and the candidate nodes includes: For each task stage and each candidate node, calculate the chip architecture compatibility, operator set compatibility, runtime environment compatibility, precision mode compatibility, checkpoint loading compatibility, image conversion requirements, checkpoint format conversion requirements, local model image reuse capability, local data cache reuse capability, and memory slice conflict risk between the task stage and the candidate node. Based on the chip architecture compatibility, operator set compatibility, runtime environment compatibility, precision mode compatibility, checkpoint loading compatibility, image conversion requirements, checkpoint format conversion requirements, local model image reuse capability, local data cache reuse capability, and video memory slice conflict risk, the heterogeneous compatibility matrix is generated. The matrix elements in the heterogeneous compatibility matrix include direct compatibility markers and additional cost markers. The direct compatibility markers indicate whether the corresponding task stage can be executed directly on the corresponding candidate node, and the additional cost markers indicate the additional costs associated with performing mirror transformation, state adaptation, or data transfer.
[0010] Preferably, the step of performing domain-level scheduling of multiple candidate computing power domains based on the heterogeneous compatibility matrix, and performing node-level scheduling within the main execution domain, includes: Based on the predicted queuing time, predicted execution time, cross-domain data transfer time, mirror conversion time and / or checkpoint conversion time, memory fragmentation penalty, energy consumption penalty, and cache reuse benefit, calculate the comprehensive scheduling cost of each candidate computing power domain. The candidate computing power domain with the lowest overall scheduling cost is determined as the primary execution domain, and the candidate computing power domain with the second lowest overall scheduling cost is determined as the backup execution domain. Within the main execution domain, candidate nodes are sorted based on the heterogeneous compatibility matrix, current node load, memory fragmentation rate, remaining slice shape, node interconnection topology, and task stage characteristics, and the target execution node is selected. The cache reuse benefit is related to at least one of the following: model image hit rate, dataset local hit rate, and checkpoint local hit rate.
[0011] Preferably, when the runtime feedback information meets the preset rescheduling conditions, generating a unified execution state package and an incremental state block corresponding to the AI task, and migrating the unified execution state package and the incremental state block to the recovery node, includes: Rescheduling is triggered when the actual throughput of a task deviates from the predicted throughput by more than a first threshold, the memory fragmentation rate of the target execution node exceeds a second threshold, the device temperature or power consumption exceeds a third threshold, the inter-domain link latency or jitter exceeds a fourth threshold, and at least one of the following events occurs: a node driver abnormality, hardware failure, abnormal job exit, or runtime incompatibility. The current task state is encapsulated into a unified execution state package, and an incremental state block is generated based on the task state that has changed since the most recent state record. The unified execution state package and the incremental state block are transmitted to the recovery node, so that the recovery node performs runtime environment matching, checkpoint loading, incremental state merging and memory mapping reconstruction based on the heterogeneous compatibility matrix; The unified execution state package includes at least one of the following: model weight index, optimizer state index, current training round or step, random number seed information, cache mapping information, intermediate tensor mapping summary, operator compatibility identifier, device memory mapping summary, the location of the most recent complete checkpoint, and the location of the most recent incremental state block.
[0012] Preferably, the memory fragmentation prediction and suppression control based on the memory fragmentation state of heterogeneous computing nodes includes: Based on the memory fragmentation rate, remaining slice shape, historical stage switching mode, and task loading and unloading sequence of heterogeneous computing power nodes, the evolution trend of memory fragmentation within a preset time window is predicted. When the prediction results indicate that memory fragmentation is deteriorating, perform at least one of the following control operations: delay the loading of new tasks, divert short-term tasks to adjacent nodes, move out interruptible tasks at stage boundaries, perform memory defragmentation, and adjust the slice specifications of subsequent tasks. Based on the actual execution results of the task, the expected duration of the task phase, the parameter weights in the comprehensive scheduling cost, and the node compatibility evaluation results are revised.
[0013] Secondly, this invention provides a global AI heterogeneous computing power dynamic scheduling and unified management system, comprising: The resource access and registration module is used to obtain resource capability information and operating status information of heterogeneous computing power nodes in multiple computing power domains, and construct a unified resource description object corresponding to each heterogeneous computing power node. The task analysis and stage profiling module is used to analyze the AI task to be executed and generate a task stage profiling that represents the resource requirements and operational constraints of each execution stage of the AI task. The heterogeneous compatibility matrix generation module is used to construct a heterogeneous compatibility matrix between task stages and candidate nodes based on the unified resource description object and the task stage profile. The scheduling decision module is used to perform domain-level scheduling of multiple candidate computing power domains based on the heterogeneous compatibility matrix, determine the main execution domain and the backup execution domain, and perform node-level scheduling within the main execution domain to determine the target execution node; The operation monitoring module is used to control the target execution node to execute the AI task and collect operation feedback information during the task execution process; The state encapsulation and migration recovery module is used to generate a unified execution state package and an incremental state block corresponding to the AI task when the running feedback information meets the preset rescheduling conditions, and migrate the unified execution state package and the incremental state block to the recovery node so that the recovery node can recover and continue to execute the AI task. The memory fragmentation prediction and suppression module is used to perform memory fragmentation prediction and suppression control based on the memory fragmentation status of heterogeneous computing nodes.
[0014] Thirdly, the present invention provides an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, it implements the method described in any one of the first aspects.
[0015] Fourthly, the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, it implements the method described in any one of the first aspects.
[0016] Compared with existing technologies, the beneficial effects of this invention are as follows: By constructing a unified resource description object, this invention achieves a unified representation of heterogeneous computing power nodes under different computing power domains, chip architectures, and operating environments, providing a foundation for unified management and collaborative scheduling of AI computing power resources across the entire domain. By generating task stage profiles, it is possible to finely characterize the resource requirements and operational constraints of AI tasks at different execution stages, ensuring that scheduling decisions align with the actual operational characteristics of the tasks. By constructing a heterogeneous compatibility matrix, it is possible to determine the compatibility relationship, conversion requirements, and migration costs between task stages and candidate nodes before scheduling, reducing the risk of scheduling failures caused by environmental incompatibility. Through a two-level dynamic scheduling mechanism combining domain-level scheduling and node-level scheduling, it is possible to comprehensively consider queuing time, execution time, data transmission time, conversion time, memory fragmentation status, energy consumption factors, and cache reuse benefits, improving the rationality of scheduling results. Through a unified execution state packet and incremental state block migration and recovery mechanism, it is possible to reduce the amount of state transmission and shorten recovery time when nodes are abnormal or their performance degrades. Through memory fragmentation prediction and suppression control, it is possible to reduce the situation where resources are idle but unusable, improving the utilization rate of heterogeneous computing power resources and the stability of system operation. Attached Figure Description
[0017] Figure 1 This is the overall flowchart of the global AI heterogeneous computing power dynamic scheduling and unified management method of the present invention.
[0018] Figure 2 This is a schematic diagram of the overall system architecture of the present invention.
[0019] Figure 3 Flowchart for generating heterogeneous compatible matrices.
[0020] Figure 4 This is a flowchart of a two-level scheduling process: domain-level scheduling and node-level scheduling.
[0021] Figure 5 A flowchart for the unified execution of state packets and incremental state block generation and recovery.
[0022] Figure 6 This is a flowchart of memory fragmentation prediction and suppression.
[0023] Figure 7 This is a block diagram of the system modules of the present invention.
[0024] Figure 8 This is a schematic diagram of the electronic device of the present invention. Detailed Implementation
[0025] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0026] In this specification, "global domain" refers to a deployment environment in which multiple computing power domains collaborate in computing power supply and task execution; "heterogeneous computing power node" refers to a computing node that uses different types of processors, different chip architectures, different driver environments, or different runtime environments; "unified resource description object" refers to a structured data object used to characterize the resource capabilities and operating status of heterogeneous computing power nodes; "task stage profile" refers to a structured data object used to characterize the resource requirements and operating constraints of AI tasks at different execution stages; "heterogeneous compatibility matrix" refers to a matrix structure used to characterize the compatibility relationship and additional costs between task stages and candidate nodes; "unified execution state package" refers to a standardized state encapsulation object used to migrate and restore task state; and "incremental state block" refers to a set of data blocks that have changed relative to the most recent state record.
[0027] like Figure 2 As shown, this embodiment provides a dynamic scheduling and unified management system for heterogeneous AI computing power, which includes at least a resource access and registration module 101, a task parsing and stage profiling module 102, a heterogeneous compatibility matrix generation module 103, a scheduling decision module 104, an operation monitoring module 105, a state encapsulation and migration recovery module 106, and a memory fragmentation prediction and suppression module 107.
[0028] The resource access and registration module 101 is used to access heterogeneous computing power nodes in multiple computing power domains, collect resource capability information and operating status information of each heterogeneous computing power node, and construct a unified resource description object corresponding to each heterogeneous computing power node.
[0029] The task analysis and stage profiling module 102 is used to analyze the AI task to be executed and generate a task stage profiling that represents the resource requirements and operational constraints of each execution stage of the AI task.
[0030] The heterogeneous compatibility matrix generation module 103 is used to construct a heterogeneous compatibility matrix between task stages and candidate nodes based on the unified resource description object and the task stage profile.
[0031] The scheduling decision module 104 is used to perform domain-level scheduling on multiple candidate computing power domains based on the heterogeneous compatibility matrix, determine the main execution domain and the backup execution domain, and perform node-level scheduling within the main execution domain to determine the target execution node.
[0032] The operation monitoring module 105 is used to control the target execution node to execute the AI task and collect operation feedback information during the task operation process.
[0033] The state encapsulation and migration recovery module 106 is used to generate a unified execution state package and an incremental state block corresponding to the AI task when the running feedback information meets the preset rescheduling conditions, and migrate the unified execution state package and the incremental state block to the recovery node so that the recovery node can recover and continue to execute the AI task.
[0034] The memory fragmentation prediction and suppression module 107 is used to perform memory fragmentation prediction and suppression control based on the memory fragmentation status of heterogeneous computing nodes.
[0035] In an optional implementation, the system may further include a unified management output interface for providing a unified resource view, unified alarm information, unified audit logs, and a unified policy configuration interface to upper-layer business systems. This part can be set as an extension module and does not affect the implementation of the core technical solution of this invention.
[0036] like Figure 1 As shown in the figure, this embodiment provides a method for dynamic scheduling and unified management of heterogeneous computing power in global AI, including the following steps.
[0037] Step S101: Obtain heterogeneous computing power node information and construct a unified resource description object. In this embodiment, the resource access and registration module 101 obtains the resource capability information and operating status information of heterogeneous computing power nodes from multiple computing power domains, and constructs a unified resource description object corresponding to each heterogeneous computing power node.
[0038] In a preferred embodiment, the unified resource description object includes at least one of the following: node identifier, computing domain identifier, chip type, chip architecture, chip manufacturer, driver version, container runtime version, supported operator set, supported precision mode, total video memory, video memory bandwidth, video memory partitioning granularity, current video memory utilization, video memory fragmentation rate, number of CPU cores, main memory capacity, storage bandwidth, inter-node interconnection type, inter-domain link latency, model image cache status, dataset location identifier, checkpoint format support capability, device temperature, power consumption, and abnormal status identifier.
[0039] Preferably, the resource capability information can be collected through domain adaptation agents deployed in each computing domain. These domain adaptation agents can establish connections with the device driver layer, container runtime, monitoring agent, network control unit, and storage management unit to obtain node static attributes and dynamic operating status. Furthermore, the operating status information can be updated using a combination of periodic collection and event-triggered reporting to improve the real-time performance and accuracy of the unified resource description object.
[0040] Step S102: Analyze the AI task and generate a task stage profile In this embodiment, the task parsing and stage profiling module 102 parses the AI task to be executed and generates a task stage profiling that represents the resource requirements and operational constraints of each execution stage of the AI task.
[0041] In a preferred embodiment, the task phase profile includes at least one of the following: task type, phase sequence, expected duration of each phase, computing resource requirements of each phase, video memory requirements of each phase, network requirements of each phase, required set of operators, required precision mode, required runtime environment, task interruptibility, state preservation granularity, recovery method requirements, and cache reuse preferences.
[0042] In one example, the stage sequence includes at least one of the following: data preprocessing stage, model loading stage, forward computation stage, backward computation stage, parameter update stage, checkpoint saving stage, inference output stage, and cache maintenance stage. For a training task, the stage sequence may include the model loading stage, forward computation stage, backward computation stage, parameter update stage, and checkpoint saving stage; for an online inference task, the stage sequence may include the model loading stage, inference output stage, and cache maintenance stage.
[0043] Preferably, the task stage profile can be statically generated based on the task configuration file, image metadata, and user-submitted parameters, or it can be dynamically corrected based on historical running data and online running feedback to improve the accuracy of stage estimation.
[0044] Step S103: Construct a heterogeneous compatibility matrix In this embodiment, the heterogeneous compatibility matrix generation module 103 constructs a heterogeneous compatibility matrix between the task stage and the candidate node based on the unified resource description object and the task stage profile.
[0045] like Figure 3As shown, for each task stage and each candidate node, the system calculates the chip architecture compatibility, operator set compatibility, runtime environment compatibility, precision mode compatibility, checkpoint loading compatibility, image conversion requirements, checkpoint format conversion requirements, local model image reuse capability, local data cache reuse capability, and memory slice conflict risk between the task stage and the candidate node.
[0046] In a preferred embodiment, the matrix elements in the heterogeneous compatibility matrix include direct compatibility markers and additional cost markers. The direct compatibility markers indicate whether the corresponding task stage can be directly executed on the corresponding candidate node; the additional cost markers indicate the additional costs associated with performing mirror transformations, state adaptations, or data transfers.
[0047] For example, when a candidate node and its corresponding task stage meet the direct execution conditions in terms of chip architecture, operator support, and runtime environment, the direct compatibility flag can be set to executable. If image conversion or checkpoint format conversion is still required, the corresponding cost can be recorded in the additional cost flag. If there are incompatible constraints that cannot be eliminated, the direct compatibility flag of the candidate node under the corresponding task stage can be set to non-executable, and it can be excluded in subsequent scheduling processes.
[0048] Step S104: Perform domain-level scheduling and node-level scheduling. After obtaining the heterogeneous compatibility matrix, the scheduling decision module 104 performs domain-level scheduling on multiple candidate computing power domains based on the heterogeneous compatibility matrix to determine the main execution domain and the backup execution domain, and performs node-level scheduling within the main execution domain to determine the target execution node.
[0049] like Figure 4 As shown, during the domain-level scheduling phase, the system calculates the comprehensive scheduling cost of each candidate computing power domain based on predicted queuing time, predicted execution time, cross-domain data transfer time, image conversion time and / or checkpoint conversion time, memory fragmentation penalty, energy consumption penalty, and cache reuse benefit. Preferably, the candidate computing power domain with the lowest comprehensive scheduling cost can be determined as the primary execution domain, and the candidate computing power domain with the second lowest comprehensive scheduling cost can be determined as the backup execution domain.
[0050] In one implementation, the cache reuse benefit is related to at least one of the model image hit rate, dataset local hit rate, and checkpoint local hit rate. For example, when a candidate computing domain has already cached the target model image and the dataset required by the task, the candidate computing domain can obtain a higher cache reuse benefit, thereby reducing the overall scheduling cost.
[0051] During the node-level scheduling phase, the system sorts candidate nodes within the main execution domain based on the heterogeneity compatibility matrix, current node load, memory fragmentation rate, remaining slice configuration, node interconnection topology, and task stage characteristics, and selects the target execution node. Preferably, candidate nodes with lower memory fragmentation rates, higher topology affinity, and higher local cache hit rates have higher priority.
[0052] Step S105: Execute the AI task and collect operational feedback information. After the target execution node is determined, the operation monitoring module 105 controls the target execution node to execute the AI task and collects operation feedback information during the task execution process.
[0053] In a preferred embodiment, the operational feedback information includes at least one of the following: actual task throughput, stage duration, video memory usage, video memory fragmentation rate, device temperature, device power consumption, link latency, link jitter, error logs, and abnormal events.
[0054] Furthermore, the system can maintain a corresponding running status record for each task, which records information such as the current execution stage, the location of the most recent state saving, the location of the most recent incremental state block, the current list of candidate recovery nodes, and the most recent scheduling parameters, for use in subsequent rescheduling and recovery.
[0055] Step S106: Trigger rescheduling and generate a unified execution state packet and incremental state block. When the running feedback information meets the preset rescheduling conditions, the unified execution state package and incremental state block corresponding to the AI task are generated by the state encapsulation and migration recovery module 106, and the unified execution state package and the incremental state block are migrated to the recovery node.
[0056] In a preferred embodiment, rescheduling is triggered when at least one of the following conditions is detected: the actual throughput of the task deviates from the predicted throughput by more than a first threshold; the memory fragmentation rate of the target execution node exceeds a second threshold; the device temperature or power consumption exceeds a third threshold; the inter-domain link latency or jitter exceeds a fourth threshold; or a node experiences a driver anomaly, hardware failure, abnormal job exit, or runtime incompatibility event.
[0057] After a rescheduling is triggered, the system encapsulates the current task state into a unified execution state package and generates an incremental state block based on the task state that has changed since the most recent state record.
[0058] In a preferred embodiment, the unified execution state package includes at least one of the following: model weight index, optimizer state index, current training round or step, random number seed information, cache mapping information, intermediate tensor mapping summary, operator compatibility identifier, device memory mapping summary, the location of the most recent full checkpoint, and the location of the most recent incremental state block.
[0059] Preferably, the incremental state block is used to store state data that has changed since the most recent state record, so as to reduce the amount of data transmitted and shorten the recovery time.
[0060] Step S107: Restore and continue executing the AI task on the recovery node. like Figure 5 As shown, the system transmits the unified execution status package and the incremental status block to the recovery node so that the recovery node can recover and continue to execute the AI task.
[0061] In a preferred embodiment, the recovery node performs runtime environment matching, checkpoint loading, incremental state merging, and memory mapping reconstruction based on the heterogeneous compatibility matrix. When the recovery node and the original execution node are in different chip architectures or different runtime environments, checkpoint format adaptation, image conversion, operator replacement, or precision mode mapping can be performed before task recovery.
[0062] Step S108: Perform memory fragmentation prediction and suppression control During task execution, the memory fragmentation prediction and suppression module 107 performs memory fragmentation prediction and suppression control based on the memory fragmentation status of heterogeneous computing nodes.
[0063] like Figure 6 As shown, the system predicts the evolution trend of memory fragmentation within a preset time window based on the memory fragmentation rate, remaining slice shape, historical stage switching mode, and task loading / unloading sequence of heterogeneous computing power nodes. When the prediction result indicates that memory fragmentation is deteriorating, at least one of the following control operations is executed: delaying the loading of new tasks, diverting short-term tasks to adjacent nodes, removing interruptible tasks at stage boundaries, performing memory consolidation, and adjusting the slice specifications of subsequent tasks.
[0064] For example, when the system predicts that the video memory slice of a target node will form multiple discrete small blocks due to frequent short task loading in the future, medium-sized tasks can be scheduled to adjacent nodes in advance, and some interruptible copies can be moved out when long tasks enter the checkpoint saving stage, so as to reduce the risk of subsequent video memory unavailability.
[0065] Furthermore, the system can also adjust the expected duration of task phases, the parameter weights in the comprehensive scheduling cost, and the node compatibility evaluation results based on the actual execution results of the task, so as to optimize the subsequent scheduling efficiency.
[0066] In a specific application scenario, a large model training task is submitted to the system of this invention. The system first obtains the unified resource description object of each heterogeneous computing power node from the North China computing power domain, East China computing power domain, and Southwest computing power domain; then it parses the training task and generates a task stage profile including the model loading stage, forward computation stage, backward computation stage, parameter update stage, and checkpoint saving stage.
[0067] The system constructs a heterogeneous compatibility matrix based on a unified resource description object and task stage profile, identifying that both the East China and Southwest computing domains can meet the execution requirements of the training task. The East China computing domain already caches the corresponding model image and training dataset, thus its cache reuse benefits are higher. The system further performs domain-level scheduling, determining the East China computing domain as the primary execution domain and the Southwest computing domain as the backup execution domain. Within the East China computing domain, nodes with lower memory fragmentation rates and better interconnect topology are selected as the target execution nodes.
[0068] During task execution, when the memory fragmentation rate of one of the target execution nodes continues to rise and the device temperature approaches a threshold, the system triggers rescheduling, generates a unified execution state packet and an incremental state block, and migrates them to a backup recovery node. After the recovery node completes state recovery based on the heterogeneous compatibility matrix, it continues to execute the training task, thereby reducing fault recovery time and data transfer overhead.
[0069] In another application scenario, online inference tasks are deployed in an edge computing domain. During execution, the system detects increased link jitter between the edge and central computing domains, and the driver layer of the edge nodes reports anomalies, meeting preset rescheduling conditions. The system then generates a unified execution state packet and an incremental state block, migrating them to a recovery node in the central computing domain. The recovery node performs image adaptation, cache mapping restoration, and inference state reconstruction based on the heterogeneity compatibility matrix, continuing to execute the inference task and thus improving service continuity.
[0070] In another application scenario, multiple short-duration inference tasks and a small number of long-duration training tasks run simultaneously within a certain computing power domain. As short-duration inference tasks are frequently loaded and unloaded, the memory slices of multiple nodes gradually form discrete small blocks. Based on historical task loading and unloading sequences and stage switching patterns, the system predicts that some nodes will experience a trend of memory fragmentation deterioration within a preset time window. Therefore, it preemptively diverts newly submitted medium-sized tasks to adjacent nodes and migrates out some interruptible replicas when training tasks enter the checkpoint saving phase, preventing subsequent tasks from entering a long queue due to insufficient slices.
[0071] like Figure 7 As shown, the present invention also provides a global AI heterogeneous computing power dynamic scheduling and unified management system, which includes: The resource access and registration module 10 is used to obtain resource capability information and operating status information of heterogeneous computing power nodes in multiple computing power domains, and to construct a unified resource description object corresponding to each heterogeneous computing power node. The task analysis and stage profiling module 20 is used to analyze the AI task to be executed and generate a task stage profiling that represents the resource requirements and operational constraints of each execution stage of the AI task. The heterogeneous compatibility matrix generation module 30 is used to construct a heterogeneous compatibility matrix between the task stage and the candidate node based on the unified resource description object and the task stage profile. The scheduling decision module 40 is used to perform domain-level scheduling on multiple candidate computing power domains based on the heterogeneous compatibility matrix, determine the main execution domain and the backup execution domain, and perform node-level scheduling within the main execution domain to determine the target execution node; The operation monitoring module 50 is used to control the target execution node to execute the AI task and collect operation feedback information during the task execution process; The state encapsulation and migration recovery module 60 is used to generate a unified execution state package and an incremental state block corresponding to the AI task when the running feedback information meets the preset rescheduling conditions, and migrate the unified execution state package and the incremental state block to the recovery node so that the recovery node can recover and continue to execute the AI task. The memory fragmentation prediction and suppression module 70 is used to perform memory fragmentation prediction and suppression control based on the memory fragmentation status of heterogeneous computing nodes.
[0072] like Figure 8 As shown, Figure 8 This is a structural block diagram of an electronic device for implementing a method for dynamic scheduling and unified management of heterogeneous AI computing power across the entire domain, according to an exemplary embodiment. The electronic device can be a server, control node, scheduling host, cloud platform management node, edge computing node, or other computing device with data processing and control capabilities. The electronic device includes a processor, memory, network interface, and input / output devices connected via a system bus.
[0073] The processor provides computing and control capabilities to execute the steps of the dynamic scheduling and unified management method for heterogeneous AI computing power across the entire domain. The memory includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores the operating system, computer programs, and data related to the unified resource description object, task stage profile, heterogeneous compatibility matrix, unified execution state package, and incremental state block. The internal memory provides an environment for the operation of the operating system and computer programs, and temporarily stores intermediate data and runtime status information during the scheduling process. The network interface communicates with heterogeneous computing power nodes, domain adaptation agents, storage nodes, monitoring nodes, and external business systems in multiple computing power domains to perform operations such as resource access, status acquisition, scheduling control, state migration, and resumption of execution. The input / output device receives task configurations, policy parameters, and control commands, and outputs scheduling results, runtime status information, alarm information, and unified management results.
[0074] When the computer program is executed by the processor, the electronic device performs the following functions: acquires resource capability information and operating status information of heterogeneous computing nodes in multiple computing domains, and constructs a unified resource description object; parses the AI task to be executed and generates a task stage profile; constructs a heterogeneous compatibility matrix based on the unified resource description object and the task stage profile; performs domain-level scheduling and node-level scheduling based on the heterogeneous compatibility matrix to determine the target execution node; collects operation feedback information during task execution; when the preset rescheduling conditions are met, generates a unified execution status package and an incremental status block and migrates them to the recovery node so that the recovery node can recover and continue to execute the AI task; and performs memory fragmentation prediction and suppression control based on the memory fragmentation status of the heterogeneous computing nodes.
[0075] It should be understood that Figure 8 The structures shown are merely illustrative related to the present application and do not constitute a limitation on the specific structure of the electronic device to which this application applies. In other embodiments, the electronic device may include more or fewer components than those shown in the figures, or combine certain components, or employ different component arrangements. Any electronic device structure capable of implementing the technical solution of this application should fall within the protection scope of this application.
[0076] In one exemplary embodiment, an electronic device is also provided, including: a processor; and a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the global AI heterogeneous computing power dynamic scheduling and unified management method described in any embodiment of this application.
[0077] In one exemplary embodiment, a computer-readable storage medium is also provided, on which a computer program is stored. When the computer program is executed by a processor, it enables an electronic device to implement the global AI heterogeneous computing power dynamic scheduling and unified management method described in any embodiment of this application.
[0078] The computer-readable storage medium may be a read-only memory (ROM), random access memory (RAM), programmable read-only memory (PROM), electrically programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory, magnetic tape, magnetic disk, optical disk, solid-state drive, or other non-transitory computer-readable storage medium capable of storing program code.
[0079] Those skilled in the art will understand that all or part of the steps in the methods of the above embodiments can be implemented by computer program instructions and related hardware. The computer program can be stored in a non-transitory computer-readable storage medium and, when executed by a processor, implements the corresponding steps of the methods described in the foregoing embodiments. The memory, database, or other media involved in the above storage medium can include both non-volatile memory and volatile memory. Non-volatile memory can include, for example, ROM, PROM, EPROM, EEPROM, or flash memory; volatile memory can include, for example, RAM. The RAM can include Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), etc.
[0080] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A method for dynamic scheduling and unified management of heterogeneous computing power in global AI, characterized in that, include: Obtain resource capability information and operating status information of heterogeneous computing nodes in multiple computing power domains, and construct a unified resource description object corresponding to each heterogeneous computing node; Analyze the AI task to be executed and generate a task stage profile that represents the resource requirements and operational constraints of each execution stage of the AI task. Based on the unified resource description object and the task stage profile, a heterogeneous compatibility matrix between the task stage and the candidate node is constructed. Based on the heterogeneous compatibility matrix, domain-level scheduling is performed on multiple candidate computing power domains to determine the main execution domain and backup execution domains. Node-level scheduling is then performed within the main execution domain to determine the target execution node. Control the target execution node to execute the AI task and collect operation feedback information during the task execution process; When the runtime feedback information meets the preset rescheduling conditions, a unified execution status package and an incremental status block corresponding to the AI task are generated, and the unified execution status package and the incremental status block are migrated to the recovery node so that the recovery node can recover and continue to execute the AI task. Memory fragmentation prediction and suppression control are performed based on the memory fragmentation status of heterogeneous computing nodes.
2. The method for dynamic scheduling and unified management of heterogeneous AI computing power across the entire domain according to claim 1, characterized in that, The Uniform Resource Description Object (URL) includes at least one of the following: Node identifier, computing domain identifier, chip type, chip architecture, chip manufacturer, driver version, container runtime version, supported operator set, supported precision mode, total video memory, video memory bandwidth, video memory partitioning granularity, current video memory utilization, video memory fragmentation rate, number of CPU cores, main memory capacity, storage bandwidth, inter-node interconnect type, inter-domain link latency, model image cache status, dataset location identifier, checkpoint format support capability, device temperature, power consumption, and abnormal status identifier.
3. The method for dynamic scheduling and unified management of heterogeneous AI computing power across the entire domain according to claim 1, characterized in that, The task phase profile includes at least one of the following: Task type, phase sequence, expected duration of each phase, computational resource requirements of each phase, video memory requirements of each phase, network requirements of each phase, required set of operators, required precision mode, required runtime environment, task interruptibility, state preservation granularity, recovery method requirements, and cache reuse preferences. The stage sequence includes at least one of the following: data preprocessing stage, model loading stage, forward computation stage, backward computation stage, parameter update stage, checkpoint saving stage, inference output stage, and cache maintenance stage.
4. The method for dynamic scheduling and unified management of heterogeneous AI computing power across the entire domain according to claim 1, characterized in that, The heterogeneous compatibility matrix between the construction task phase and candidate nodes includes: For each task stage and each candidate node, calculate the chip architecture compatibility, operator set compatibility, runtime environment compatibility, precision mode compatibility, checkpoint loading compatibility, image conversion requirements, checkpoint format conversion requirements, local model image reuse capability, local data cache reuse capability, and memory slice conflict risk between the task stage and the candidate node. Based on the chip architecture compatibility, operator set compatibility, runtime environment compatibility, precision mode compatibility, checkpoint loading compatibility, image conversion requirements, checkpoint format conversion requirements, local model image reuse capability, local data cache reuse capability, and video memory slice conflict risk, the heterogeneous compatibility matrix is generated. The matrix elements in the heterogeneous compatibility matrix include direct compatibility markers and additional cost markers. The direct compatibility markers indicate whether the corresponding task stage can be executed directly on the corresponding candidate node, and the additional cost markers indicate the additional costs associated with performing mirror transformation, state adaptation, or data transfer.
5. The method for dynamic scheduling and unified management of heterogeneous AI computing power across the entire domain according to claim 1, characterized in that, The process of performing domain-level scheduling of multiple candidate computing power domains based on the heterogeneous compatibility matrix, and performing node-level scheduling within the main execution domain, includes: Based on the predicted queuing time, predicted execution time, cross-domain data transfer time, mirror conversion time and / or checkpoint conversion time, memory fragmentation penalty, energy consumption penalty, and cache reuse benefit, calculate the comprehensive scheduling cost of each candidate computing power domain. The candidate computing power domain with the lowest overall scheduling cost is determined as the primary execution domain, and the candidate computing power domain with the second lowest overall scheduling cost is determined as the backup execution domain. Within the main execution domain, candidate nodes are sorted based on the heterogeneous compatibility matrix, current node load, memory fragmentation rate, remaining slice shape, node interconnection topology, and task stage characteristics, and the target execution node is selected. The cache reuse benefit is related to at least one of the following: model image hit rate, dataset local hit rate, and checkpoint local hit rate.
6. The method for dynamic scheduling and unified management of heterogeneous AI computing power across the entire domain according to claim 1, characterized in that, When the runtime feedback information meets the preset rescheduling conditions, a unified execution state package and an incremental state block corresponding to the AI task are generated, and the unified execution state package and the incremental state block are migrated to the recovery node, including: Rescheduling is triggered when the actual throughput of a task deviates from the predicted throughput by more than a first threshold, the memory fragmentation rate of the target execution node exceeds a second threshold, the device temperature or power consumption exceeds a third threshold, the inter-domain link latency or jitter exceeds a fourth threshold, and at least one of the following events occurs: a node driver abnormality, hardware failure, abnormal job exit, or runtime incompatibility. The current task state is encapsulated into a unified execution state package, and an incremental state block is generated based on the task state that has changed since the most recent state record. The unified execution state package and the incremental state block are transmitted to the recovery node, so that the recovery node performs runtime environment matching, checkpoint loading, incremental state merging and memory mapping reconstruction based on the heterogeneous compatibility matrix; The unified execution state package includes at least one of the following: model weight index, optimizer state index, current training round or step, random number seed information, cache mapping information, intermediate tensor mapping summary, operator compatibility identifier, device memory mapping summary, the location of the most recent complete checkpoint, and the location of the most recent incremental state block.
7. The method for dynamic scheduling and unified management of heterogeneous AI computing power across the entire domain according to claim 1, characterized in that, The memory fragmentation prediction and suppression control based on the memory fragmentation status of heterogeneous computing nodes includes: Based on the memory fragmentation rate, remaining slice shape, historical stage switching mode, and task loading and unloading sequence of heterogeneous computing power nodes, the evolution trend of memory fragmentation within a preset time window is predicted. When the prediction results indicate that memory fragmentation is deteriorating, perform at least one of the following control operations: delay the loading of new tasks, divert short-term tasks to adjacent nodes, move out interruptible tasks at stage boundaries, perform memory defragmentation, and adjust the slice specifications of subsequent tasks. Based on the actual execution results of the task, the expected duration of the task phase, the parameter weights in the comprehensive scheduling cost, and the node compatibility evaluation results are revised.
8. A dynamic scheduling and unified management system for heterogeneous AI computing power across the entire domain, characterized in that, include: The resource access and registration module is used to obtain resource capability information and operating status information of heterogeneous computing power nodes in multiple computing power domains, and construct a unified resource description object corresponding to each heterogeneous computing power node. The task analysis and stage profiling module is used to analyze the AI task to be executed and generate a task stage profiling that represents the resource requirements and operational constraints of each execution stage of the AI task. The heterogeneous compatibility matrix generation module is used to construct a heterogeneous compatibility matrix between task stages and candidate nodes based on the unified resource description object and the task stage profile. The scheduling decision module is used to perform domain-level scheduling of multiple candidate computing power domains based on the heterogeneous compatibility matrix, determine the main execution domain and the backup execution domain, and perform node-level scheduling within the main execution domain to determine the target execution node; The operation monitoring module is used to control the target execution node to execute the AI task and collect operation feedback information during the task execution process; The state encapsulation and migration recovery module is used to generate a unified execution state package and an incremental state block corresponding to the AI task when the running feedback information meets the preset rescheduling conditions, and migrate the unified execution state package and the incremental state block to the recovery node so that the recovery node can recover and continue to execute the AI task. The memory fragmentation prediction and suppression module is used to perform memory fragmentation prediction and suppression control based on the memory fragmentation status of heterogeneous computing nodes.
9. An electronic device, characterized in that, The method includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, it implements the method according to any one of claims 1 to 7.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the method described in any one of claims 1 to 7.