A load balancing-based power scheduling model inference acceleration method and device and storage medium
By optimizing load balancing and model pruning, power dispatch model resources are dynamically allocated, solving the problems of resource waste and latency in existing technologies, and achieving efficient emergency task response and stable power dispatch model inference.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- POWER DISPATCHING CONTROL CENT OF GUANGDONG POWER GRID CO LTD
- Filing Date
- 2026-03-20
- Publication Date
- 2026-06-19
AI Technical Summary
The existing power dispatch model does not dynamically allocate resources by combining task priority, business scenario complexity and real-time status of computing nodes, resulting in wasted computing resources and fluctuations in inference latency, and cannot meet the second-level response requirements of emergency tasks.
By acquiring hardware status data of the computing card and task status data of power scheduling tasks, load assessment and hierarchical load balancing are performed. Combined with the congestion level of the task queue and the overall load value, resources are dynamically allocated, and model pruning and memory pre-allocation are optimized to ensure that resources are tilted towards high-priority and high-complexity tasks, thereby reducing computing power waste.
It achieves efficient utilization of computing resources, ensures second-level response for emergency tasks, reduces latency fluctuations, and improves the inference speed and stability of the power dispatching model.
Smart Images

Figure CN122240315A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of power dispatching model technology, and in particular to a method, device and storage medium for accelerating power dispatching model inference based on load balancing. Background Technology
[0002] With the construction of new power systems and the intelligent upgrading of power dispatching operations, power dispatching models are being deeply applied in scenarios such as intelligent question answering, fault handling, and real-time dispatching decision-making, becoming a key technology supporting the safe and stable operation of power systems. However, as the real-time requirements of power dispatching operations continue to increase, the computing power demand in the model inference stage is experiencing explosive growth. This includes urgent tasks such as fault handling requiring millisecond-level responses, highly complex multi-round inference tasks, and non-real-time data statistical analysis tasks.
[0003] The existing power dispatch model does not dynamically allocate resources by combining task priority, business scenario complexity and real-time status of computing nodes during the inference process. This results in some computing cards being overloaded for a long time, while some nodes are idle. This not only wastes computing resources, but also causes large fluctuations in inference latency, which cannot meet the core requirement of second-level response for emergency tasks. Summary of the Invention
[0004] This invention provides a load-balanced power dispatching model inference acceleration method, device, and storage medium, which can effectively solve the problem of waste of computing resources caused by the lack of dynamic resource allocation in the prior art due to the failure to combine task priority, business scenario complexity, and real-time status of computing nodes.
[0005] One embodiment of the present invention provides a method for accelerating inference of a power dispatching model based on load balancing, comprising: Obtain the hardware status data of the computing power card and the task status data of the power dispatching task at the current moment; the task status data includes business scenario tags and task priority; Based on the hardware status data and the task status data, a load assessment is performed to obtain the congestion level of the task queue and the comprehensive load value of each computing node. Based on the congestion level of the task queue, the comprehensive load value, the task status data, and the task priority corresponding to the power dispatch task, a hierarchical load balancing is performed to obtain the task distribution result. Based on the task distribution results and the hardware status data, operator fusion and memory pre-allocation are performed on the computing power node links to generate model execution units adapted to the computing power nodes. Based on the business scenario tags and task priorities in the task status data, model pruning optimization is performed in the model execution unit. The pruned model execution unit and task distribution results are allocated to the corresponding computing power nodes so that the computing power nodes can infer the power dispatch task based on the task distribution results and the pruned model execution unit, obtain task execution data, and update the inferred hardware status data. Real-time monitoring of the computing card is performed based on the updated hardware status data and the task execution data.
[0006] Furthermore, the hardware status data includes: computing resource status data and video memory status data; the computing resource status data includes GPU utilization; the video memory status data includes video memory occupancy rate; and the task status data also includes task queue length. Based on the hardware status data and the task status data, a load assessment is performed to obtain the task queue congestion level and the comprehensive load value of each computing node, including: The comprehensive load value of each computing node is obtained by weighting the GPU utilization, the video memory usage, and the task queue length. The deviation between the overall load value and the task queue length is taken as the queue adaptation deviation. The congestion level of the task queue is obtained based on the ratio of the queue adaptation deviation to the preset queue length threshold.
[0007] Furthermore, the task status data also includes: the scheduling device identifier; the computing resource status data also includes: the current amount of resources occupied and the total resource capacity; Based on the congestion level of the task queue, the overall load value, the task status data, and the task priority corresponding to the power dispatch task, a hierarchical load balancing is performed to obtain the task distribution result, including: Based on the task priority corresponding to the power dispatch task and the comprehensive load value of each computing node, computing nodes with a comprehensive load value less than the preset load threshold are selected as low load nodes. The low load nodes are then sorted according to task priority and task queue congestion to obtain the sorting results. Power dispatching tasks with the same dispatching device identifier are routed to the same computing power node according to the hierarchical sorting results, so that each computing power node can allocate memory blocks to power dispatching tasks based on task priority and node load rate. The node load rate of the computing power node is determined based on the current occupied resources and the total resource capacity. The hierarchical sorting results and the memory block allocation results of the computing power nodes are used as the task distribution results.
[0008] Furthermore, the video memory status data also includes: total video memory capacity and video memory fragmentation rate; Based on the task distribution results and the hardware status data, operator fusion and memory pre-allocation are performed on the computing power node links to generate model execution units adapted to the computing power nodes, including: Based on the hierarchical sorting results, memory block allocation results, and operator types corresponding to the power dispatching tasks in the task distribution results, the continuously executed operators corresponding to each computing power node are selected, and the selected continuously executed operators are integrated into a composite operator used to characterize the target computing power node; wherein, the operator type of the target computing power node is consistent with the operator type corresponding to the power dispatching task. Based on the total video memory capacity and the video memory fragmentation rate, contiguous video memory blocks are allocated for composite operators and power scheduling tasks in the video memory block allocation results. The allocated contiguous memory blocks and composite operators are integrated to generate model execution units adapted to computing power nodes.
[0009] Furthermore, based on the business scenario tags and task priorities in the task status data, model pruning optimization is performed in the model execution unit, including: Based on the business scenario labels in the task status data, the business type corresponding to the power dispatching task is determined, and the network layer that is strongly related to the business type is selected as the target network layer in the model execution unit. The pruning rate corresponding to the target network layer is determined based on the task priority and the preset pruning rate table; The channels of the target network layer are pruned according to the pruning rate corresponding to the target network layer. Based on the pruned target network layer and channels, the pruned model execution unit is obtained.
[0010] Furthermore, the computing nodes perform inference on the power dispatching task based on the task distribution results and the pruned model execution unit to obtain task execution data, including: After receiving the pruned model execution unit and the task distribution results, the computing node determines the corresponding inference mode based on the task priority of the power scheduling task. When the task priority is high, the composite operator and the allocated contiguous memory block are directly invoked for inference to obtain the first execution data. If the task priority is not high priority, tasks in the same scenario are grouped in batches according to the business scenario label in the task status data. It is determined whether the length of the task queue after grouping is greater than the preset batch processing threshold. If so, the second execution data is obtained by reasoning based on the pruned target network layer and channel batch. Otherwise, the composite operator and continuous memory block are called separately to reason in the pruned target network layer to obtain the third execution data. The first execution data, the second execution data, and the third execution data are used as task execution data.
[0011] Furthermore, real-time monitoring of the computing card is performed based on the updated hardware status data and the task execution data, including: A Level 1 warning is generated if the updated GPU utilization is not less than the preset first monitoring threshold and the updated video memory usage is not less than the preset second monitoring threshold. The current computing node that generates a Level 1 warning is switched to a restricted state. In this restricted state, the current computing node stops receiving new power scheduling tasks and migrates low-priority power scheduling tasks to other computing nodes until the GPU utilization is less than the preset first monitoring threshold and the video memory usage is less than the preset second monitoring threshold, at which point it switches to the running state.
[0012] Furthermore, it also includes: If the number of computing nodes that generate a Level 1 warning exceeds a preset node number threshold, a Level 2 warning will be generated. When a Level 2 warning is generated, the power scheduling tasks of all computing nodes that switch to a restricted state will be migrated to the standby node.
[0013] As an improvement to the above solution, another embodiment of the present invention provides a load-balancing-based power dispatching model inference acceleration device, comprising: The data acquisition module is used to acquire the hardware status data of the computing card and the task status data of the power dispatching task at the current moment; the task status data includes business scenario tags and task priorities. The load assessment module is used to perform load assessment based on the hardware status data and the task status data to obtain the congestion level of the task queue and the comprehensive load value of each computing node. The load balancing module is used to perform hierarchical load balancing based on the congestion level of the task queue, the comprehensive load value, the task status data, and the task priority corresponding to the power dispatch task, so as to obtain the task distribution result. The computing node pre-allocation module is used to perform operator fusion and memory pre-allocation on the computing node links according to the task distribution results and the hardware status data, and generate model execution units adapted to the computing nodes. The task inference module is used to perform model pruning optimization in the model execution unit based on the business scenario tags and task priorities in the task status data. The pruned model execution unit and task distribution results are allocated to the corresponding computing power nodes so that the computing power nodes can infer the power dispatch task based on the task distribution results and the pruned model execution unit to obtain task execution data and update the inferred hardware status data. The computing power card monitoring module is used to monitor the computing power card in real time based on the updated hardware status data and the task execution data.
[0014] Another embodiment of the present invention provides a terminal device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor. When the processor executes the computer program, it implements a load balancing-based power dispatch model inference acceleration method as described in the above embodiments.
[0015] Another embodiment of the present invention provides a computer-readable storage medium including a stored computer program, wherein, when the computer program is executed, it controls the device where the computer-readable storage medium is located to execute the load balancing-based power dispatching model inference acceleration method described in the above embodiment.
[0016] By implementing this invention, at least the following beneficial effects are achieved: This invention provides a load-balancing-based method, device, and storage medium for accelerating power dispatch model inference. The method acquires hardware status data from computing cards and task status data from power dispatch tasks. The task status data includes business scenario tags and task priorities, supplementing the allocation criteria lacking in existing technologies. Business scenario tags quantify scenario complexity, task priorities clarify response urgency, and hardware status data reflects node capacity; these three elements together form the foundational data support for dynamic allocation. Load assessment yields the task queue congestion level and the comprehensive load value of each computing node, avoiding the ambiguous judgments of node status found in existing technologies. The comprehensive load value then integrates hardware status and task queue conditions. The task queue congestion level quantifies the task backlog on nodes, preventing the allocation of high-priority tasks to congested nodes and accurately reflecting the true node load. Layered load balancing is performed based on task queue congestion level, comprehensive load value, task status data, and task priorities, ensuring resources are tilted towards high-priority, high-complexity tasks while avoiding wasted resources on idle nodes. Linking task priorities with real-time node status fundamentally solves the imbalance between overloaded and idle nodes, ensuring precise matching of resources and task requirements. Operator fusion and memory pre-allocation, based on task distribution results and hardware status, optimize node links and resource configurations to reduce computing power waste. Model pruning optimization combines business scenario tags and task priorities to adapt the model to node computing power and task requirements, improving inference speed and reducing latency fluctuations. Finally, real-time updates of node hardware status ensure that subsequent resource allocation is based on the latest node status, avoiding long-term overload or idleness, further stabilizing inference latency, and guaranteeing continuous second-level response for urgent tasks. Therefore, this invention dynamically allocates resources by linking task priority, scenario complexity, and real-time node status, reducing the burden on overloaded nodes and activating idle nodes, significantly improving the utilization rate of computing power resources and meeting the requirements for timely response to urgent tasks. Attached Figure Description
[0017] Figure 1 This is a flowchart illustrating a method for accelerating inference of a power dispatching model based on load balancing, provided in an embodiment of the present invention. Figure 2 This is a schematic diagram of the structure of a load-balancing-based power dispatch model inference acceleration device provided in an embodiment of the present invention. Detailed Implementation
[0018] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0019] See Figure 1 To address the problem of wasted computing resources caused by the lack of dynamic resource allocation in existing technologies that do not consider task priority, business scenario complexity, and real-time status of computing nodes, an embodiment of the present invention provides a flowchart illustrating a power dispatching model inference acceleration method based on load balancing, comprising: S1. Obtain the hardware status data of the computing power card and the task status data of the power dispatching task at the current moment; the task status data includes business scenario tags and task priorities; Specifically, hardware status data refers to various hardware-related data during the operation of the computing card, including computing resource status data and video memory status data. This data is obtained through the native SDK of domestically produced computing cards and reflects the hardware capacity and operational status of the computing nodes. Task status data refers to the core attribute data of power dispatching tasks, including business scenario tags and task priorities, recorded through the inference framework callback mechanism. Business scenario tags identify the business type to which the task belongs, such as fault handling or equipment parameter query; task priorities are divided based on the urgency of the business, such as P0, P1, and P2.
[0020] To illustrate, hardware status data is collected through the native SDK of domestically produced computing cards to ensure data accuracy. Task status data is recorded through the inference framework callback mechanism, and data is transmitted in real time using the UDP protocol to provide basic data support for subsequent load assessment and scheduling.
[0021] S2. Perform load assessment based on the hardware status data and the task status data to obtain the congestion level of the task queue and the comprehensive load value of each computing node. Specifically, load assessment refers to the process of quantifying the load of computing nodes and the backlog of task queues based on hardware status data and task status data using a preset algorithm, outputting the task queue congestion level and the comprehensive load value of each computing node. The comprehensive load value represents a quantified value that integrates multiple dimensions such as GPU utilization, memory usage, and task queue length, used to accurately reflect the true load level of the computing nodes. The task queue congestion level represents an indicator that quantifies the task backlog of computing nodes, used to determine whether a node has the capacity to handle new tasks.
[0022] To illustrate, the acquired hardware status data and task status data are preprocessed, such as cleaning outliers and standardizing indicator formats. Then, a comprehensive load value is obtained through weighted calculation and other algorithms. The congestion level of the task queue is obtained by the ratio of the queue adaptation deviation to a preset threshold, thereby judging the task backlog of nodes and solving the problem of fuzzy judgment of node status in existing technologies.
[0023] Preferably, the hardware status data includes: computing resource status data and video memory status data; the computing resource status data includes GPU utilization; the video memory status data includes video memory occupancy rate; and the task status data further includes task queue length. Based on the hardware status data and the task status data, a load assessment is performed to obtain the task queue congestion level and the comprehensive load value of each computing node, including: The comprehensive load value of each computing node is obtained by weighting the GPU utilization, the video memory usage, and the task queue length. The deviation between the overall load value and the task queue length is taken as the queue adaptation deviation. The congestion level of the task queue is obtained based on the ratio of the queue adaptation deviation to the preset queue length threshold.
[0024] Specifically, computing resource status data, including GPU utilization, is the core data reflecting the computing power of the computing card and is crucial for quantifying the utilization of node computing power. Memory status data, including memory occupancy rate, reflects memory usage and directly affects task execution smoothness. Task queue length, representing the number of power dispatching tasks currently waiting to be executed on a given computing node, is a fundamental indicator for judging queue congestion and is a non-negative integer. Queue adaptation deviation refers to the difference between the contribution of task queue length to the overall load value and the standardized actual task queue length, used to quantify whether the impact of queue congestion on the overall load matches the actual situation. The preset queue length threshold, set as a baseline value (10) based on the parallel processing capabilities of domestic computing cards and the response requirements of power dispatching services, is used to standardize task queue length and determine congestion levels.
[0025] This study illustratively clarifies the computing resource status data and video memory status data in the hardware status data, as well as the core indicators GPU utilization and video memory occupancy. It also supplements the task queue length in the task status data, addressing the issues of vague load assessment indicators and insufficient data dimensions in existing technologies, thus providing specific input for accurate assessment. A weighted calculation method is adopted, with weights set based on the degree of dependence of power dispatching services on hardware resources. GPU utilization has a weight of 0.4, as computing power directly determines inference speed; video memory occupancy has a weight of 0.3, as insufficient video memory can lead to task blocking; and task queue length has a weight of 0.3, as queue accumulation significantly increases latency. The calculation formula is: Overall Load Value = GPU Utilization × 0.4 + Video Memory Occupancy × 0.3 + Task Queue Length × 0.3, ensuring the scientific rigor and accuracy of the load assessment. First, the task queue length is standardized to unify the metric dimensions. Then, the contribution of the task queue length to the overall load value is extracted, where the queue contribution = standardized queue length × 0.3. Next, the queue adaptation deviation is calculated. Finally, the congestion level is quantified by the ratio of the queue adaptation deviation to a preset queue length threshold. For example, a ratio ≥ 0.8 indicates severe congestion, 0.5-0.8 indicates moderate congestion, and < 0.5 indicates mild or no congestion, thus achieving accurate determination of the queue status. The weighted calculation method is adapted to the characteristics of power dispatching operations, and the overall load value can truly reflect the node load status, providing a scientific basis for task distribution. The quantitative logic of queue adaptation deviation and congestion level allows those skilled in the art to directly implement the solution, improving its operability.
[0026] In a preferred embodiment of the present invention, a preset queue length threshold is set to 10. The hardware status data and task status data of a certain computing node are as follows: GPU utilization 85%, video memory usage 75%, task queue length 9. The comprehensive load value is calculated as follows: 85×0.4+75×0.3+9×0.3=34+22.5+2.7=59.2; the queue adaptation deviation is calculated as follows: standardized queue length = (9 / 10)×100%=90%, queue contribution = 90%×0.3=27%, queue adaptation deviation = 27%-90%=-63%; the queue congestion level is calculated as follows: ratio = |-63%| / 10=6.3%, which is determined to be moderate congestion. Therefore, this node only undertakes low-priority tasks of P1 / P2 level and does not assign urgent tasks of P0 level for the time being.
[0027] S3. Perform hierarchical load balancing based on the congestion level of the task queue, the comprehensive load value, the task status data, and the task priority corresponding to the power dispatch task to obtain the task distribution result; Specifically, layered load balancing refers to the process of dynamically allocating tasks and resources based on task priority, queue congestion level, and overall load value through a three-tier architecture (front-end request distribution layer, computing power node allocation layer, and intra-node resource allocation layer). The task distribution result includes the execution basis, such as the target computing power node identifier, resource allocation scheme, and task execution order, and is the output of layered load balancing.
[0028] To illustrate, a three-tier load balancing architecture is constructed. The front-end request distribution layer distributes requests based on task priority and node load. The computing power node allocation layer allocates tasks to the optimal node. The intra-node resource allocation layer optimizes computing power allocation and finally generates task distribution results, achieving precise matching between tasks and resources.
[0029] The three-tier load balancing architecture is illustrated in Table 1: Table 1 Specifically, the task priority determination for power dispatching tasks is shown in Table 2: Table 2 Preferably, the task status data further includes: a scheduling device identifier; the computing resource status data further includes: the current amount of resources occupied and the total resource capacity; Based on the congestion level of the task queue, the overall load value, the task status data, and the task priority corresponding to the power dispatch task, a hierarchical load balancing is performed to obtain the task distribution result, including: Based on the task priority corresponding to the power dispatch task and the comprehensive load value of each computing node, computing nodes with a comprehensive load value less than the preset load threshold are selected as low load nodes. The low load nodes are then sorted according to task priority and task queue congestion to obtain the sorting results. Power dispatching tasks with the same dispatching device identifier are routed to the same computing power node according to the hierarchical sorting results, so that each computing power node can allocate memory blocks to power dispatching tasks based on task priority and node load rate. The node load rate of the computing power node is determined based on the current occupied resources and the total resource capacity. The hierarchical sorting results and the memory block allocation results of the computing power nodes are used as the task distribution results.
[0030] Specifically, the scheduling device identifier is a unique identifier for the device corresponding to the power dispatching task, such as a device ID. It is used to associate all related tasks of the same device and is core data for improving cache hit rate. Current resource usage represents the amount of hardware resources currently used by the computing node, including the number of GPU cores used and the amount of video memory used. It is the basic data for the computing node load rate. Total resource capacity is the upper limit of the computing node's hardware resources, such as the total number of GPU cores and the total video memory capacity, reflecting the node's maximum carrying capacity. Node load rate is a single-dimensional hardware load indicator, calculated as: Node load rate = (Current resource usage / Total resource capacity) × 100%, used to quickly determine the node's resource usage in a specific dimension. The preset load threshold is a benchmark value set based on the hardware carrying capacity of domestic computing cards and the second-level response requirements of power dispatching services. In this invention, it can be 70%, used to filter low-load nodes with the ability to undertake new tasks. Low-load nodes are computing nodes with a comprehensive load value less than the preset load threshold (70%), possessing the remaining capacity to undertake new tasks. The hierarchical sorting result represents the sorting of low-load nodes by task priority weight (60%) and queue congestion weight (40%), ensuring that high-priority tasks are allocated to the optimal nodes first. The memory block allocation result represents the configuration information such as the size and location of memory blocks allocated to tasks by the computing power nodes based on task priority and their own load rate. The core is to allocate contiguous memory blocks to high-priority tasks.
[0031] Schematic, the scheduling device identifier is used to associate tasks on the same device; the current occupied resources and total resource capacity are used to calculate the node load rate, enriching the decision dimensions of load balancing and solving the problem of insufficient decision-making basis in existing technologies. Based on the priority of power dispatching tasks, such as P0-level tasks requiring low-load nodes, and considering the comprehensive load value of each node, low-load nodes with the capacity to handle the task are selected to avoid assigning tasks to overloaded nodes and causing delays. Low-load nodes are sorted according to the rule of first sorting by task priority in descending order, and then by queue congestion level from lightest to heaviest within the same priority, ensuring that high-priority tasks are preferentially assigned to the optimal low-load and low-congestion nodes, guaranteeing the response speed of emergency tasks. Simultaneously, a hash consistency algorithm is used to achieve routing consistency. The specific steps are: constructing a node hash ring: assigning a unique identifier (IP + port) to each node, and mapping it to 0-2 using the MD5 hash function. 32The hash value is used to virtually add 3-5 replica nodes to each node to avoid hash ring skew; Device ID routing mapping: The same hash function is applied to the scheduling device identifier, and the nearest node is searched clockwise on the hash ring to route the task to that node; Consistency guarantee: When a node is expanded / taken offline, only the affected device IDs are remapped to ensure a cache hit rate of ≥30%. Computing nodes allocate GPU memory resources to tasks based on task priority and node load rate, generating GPU memory block allocation results to reduce GPU memory fragmentation. For example, P0-level tasks are allocated contiguous GPU memory blocks first, and larger GPU memory blocks are allocated when the load rate is low. Finally, the hierarchical sorting results and GPU memory block allocation results are integrated to clarify key information such as the target computing node identifier and GPU memory configuration, providing a direct basis for the subsequent generation of model execution units.
[0032] In a preferred embodiment of the present invention, a preset load threshold of 70% is set. A power dispatching system has three low-load nodes: Node A: overall load value 62%, mild queue congestion, total GPU cores 320, currently occupying 160 cores; Node B: overall load value 58%, moderate queue congestion, total GPU cores 320, currently occupying 128 cores; Node C: overall load value 53%, mild queue congestion, total GPU cores 320, currently occupying 96 cores. Sorted by task priority weight 60% + congestion weight 40%, the result is Node C > Node A > Node B. Two P1-level device parameter query tasks with device ID "D12345" are routed to Node C after MD5 hash function mapping. Node C's GPU load rate = (96 / 320) × 100% = 30%. Each of the two tasks is allocated a 1.8GB contiguous memory block, generating the memory block allocation result. The hierarchical sorting rules in this embodiment ensure that high-priority tasks are assigned to the optimal node first, and the response time of P0 level tasks is stable within 3 seconds; the hash consistency algorithm achieves the consistency of task routing within the same device, improves the cache hit rate, reduces redundant calculations, and reduces computing power consumption; the memory allocation based on priority and load rate reduces memory fragmentation and improves memory resource utilization.
[0033] S4. Based on the task distribution results and the hardware status data, perform operator fusion and memory pre-allocation on the computing power node links to generate model execution units adapted to the computing power nodes. Specifically, operator fusion refers to an optimization method that integrates multiple independently executed operators into a composite operator, reducing data transfer overhead between operators. Memory pre-allocation refers to a resource management method that pre-allocates contiguous memory blocks based on task requirements and hardware status, reducing memory fragmentation. Model execution unit refers to an execution carrier adapted to the hardware characteristics of the target computing node, integrating composite operators, pre-allocated memory configuration, and quantization parameters.
[0034] To illustrate, for the hardware architecture of the target computing node, such as the Ascend DVPP instruction set, operator fusion is performed on the computing node link, integrating continuous execution operators into composite operators, and memory pre-allocation is performed to allocate continuous video memory blocks, generating model execution units adapted to the node, thereby improving the compatibility of the model with domestic computing cards.
[0035] Preferably, the video memory status data further includes: total video memory capacity and video memory fragmentation rate; Based on the task distribution results and the hardware status data, operator fusion and memory pre-allocation are performed on the computing power node links to generate model execution units adapted to the computing power nodes, including: Based on the hierarchical sorting results, memory block allocation results, and operator types corresponding to the power dispatching tasks in the task distribution results, the continuously executed operators corresponding to each computing power node are selected, and the selected continuously executed operators are integrated into a composite operator used to characterize the target computing power node; wherein, the operator type of the target computing power node is consistent with the operator type corresponding to the power dispatching task. Based on the total video memory capacity and the video memory fragmentation rate, contiguous video memory blocks are allocated for composite operators and power scheduling tasks in the video memory block allocation results. The allocated contiguous memory blocks and composite operators are integrated to generate model execution units adapted to computing power nodes.
[0036] Specifically, the total video memory capacity is the maximum available capacity of the video memory of the computing node, serving as the upper limit for memory pre-allocation and directly determining the scale of tasks the node can handle. Video memory fragmentation rate is the percentage of scattered, underutilized space in the video memory. An excessively high fragmentation rate will prevent the allocation of contiguous video memory blocks, severely impacting inference efficiency. Continuously executed operators represent multiple independent operators sequentially invoked during the power dispatch task inference process, such as text parsing operators, knowledge association operators, and result generation operators; these are the core objects of operator fusion. Composite operators represent dedicated computing units that integrate continuously executed operators, adapted to the target computing node's hardware instruction set, significantly reducing data transfer overhead between operators. The target computing node represents the computing node determined by hierarchical sorting results to undertake the current task; its operator type must be consistent with the operator type corresponding to the power dispatch task.
[0037] To illustrate, based on the operator types in the task distribution results, continuously executed operators supported by the target computing power node are selected, such as text parsing, fault type identification, and disposal plan generation operators for scheduling tasks. The operator execution logic is reconstructed according to the hardware instruction set of the target computing power node, optimizing the parallel computing granularity and integrating them into composite operators to reduce data transfer between operators. Then, based on the total video memory capacity and video memory fragmentation rate, continuous video memory blocks are allocated to composite operators and power scheduling tasks, such as allocating 4GB of continuous video memory to a P0-level task for a 32GB video memory node. Simultaneously, a pre-allocation and dynamic reclamation mechanism is adopted, promptly reclaiming idle video memory after task execution to ensure a video memory fragmentation rate ≤30%. Finally, the allocated continuous video memory blocks and composite operators are integrated to ensure that the model execution unit adapts to the hardware characteristics of the target computing power node, directly calling the computing instruction set of domestic computing cards to perform inference, providing an efficient execution carrier for subsequent inference acceleration.
[0038] In this embodiment, the composite operator is adapted to the instruction set of domestic computing power cards, reducing data transmission overhead between operators and improving execution efficiency; the continuous memory block allocation and dynamic reclamation mechanism controls the memory fragmentation rate to below 30%, avoiding task blocking caused by excessive fragmentation; the model execution unit is precisely adapted to the hardware characteristics of the target computing power node, improving the resource utilization of domestic computing power cards.
[0039] S5. Based on the business scenario tags and task priorities in the task status data, perform model pruning optimization in the model execution unit, and allocate the pruned model execution unit and task distribution results to the corresponding computing power nodes, so that the computing power nodes can infer the power dispatch task according to the task distribution results and the pruned model execution unit, obtain task execution data, and update the inferred hardware status data. Specifically, model pruning optimization refers to reducing model complexity by trimming redundant network layers, channels, and weights based on business scenarios and task priorities. Task execution data includes a dataset containing information such as inference results, inference time, and inference accuracy; it is the output of computing nodes after executing inference tasks.
[0040] To illustrate, the core network layers of the model are selected based on business scenario labels, and pruning strategies are determined according to task priorities, with high priority models having low pruning rates and low priority models having high pruning rates. Redundant parts are trimmed and fine-tuned. The optimized model execution units and task distribution results are assigned to target nodes. After the nodes perform inference, they update the hardware status data, thereby improving inference speed while ensuring accuracy.
[0041] Preferably, based on the business scenario tags and task priorities in the task status data, model pruning optimization is performed in the model execution unit, including: Based on the business scenario labels in the task status data, the business type corresponding to the power dispatching task is determined, and the network layer that is strongly related to the business type is selected as the target network layer in the model execution unit. The pruning rate corresponding to the target network layer is determined based on the task priority and the preset pruning rate table; The channels of the target network layer are pruned according to the pruning rate corresponding to the target network layer. Based on the pruned target network layer and channels, the pruned model execution unit is obtained.
[0042] Specifically, the business type is a power dispatching task category categorized based on business scenario tags, including fault handling, equipment parameter query, procedure clause query, and data statistics. Keyword matching is used to determine the core network layer of the model. The target network layer is the network layer in the model execution unit that is strongly related to the current business type, such as the fault type identification sublayer corresponding to the fault handling business; this is the core part that needs to be retained after pruning and optimization. The preset pruning rate table is a pruning rate standard set based on task priority. High-priority tasks at level P0 correspond to a low pruning rate of ≤20%, while low-priority tasks at levels P1 and P2 correspond to a normal pruning rate of ≤40%, used to balance model accuracy and inference efficiency. The pruning rate represents the proportion of redundant network layers, channels, or weights pruned, and is a core parameter controlling model complexity. A higher pruning rate results in a lighter model, but it is necessary to ensure that the accuracy loss is ≤1%.
[0043] To illustrate, core keywords of business scenario tags are extracted from task status data, such as "tripping" and "overload" corresponding to fault handling business. These are compared with a preset set of scenario keywords, such as fault handling keywords: tripping, overload, short circuit. The matching degree is calculated; if the matching degree is ≥60%, it is determined to be the corresponding business type, clarifying the core direction of model optimization. Then, the Transformer architecture of the model execution unit is analyzed. Based on domain knowledge, network layers strongly related to the current business type are identified, such as the professional terminology recognition sublayer of the text encoding layer and the scenario matching sublayer of the knowledge association layer. Redundant network layers unrelated to scheduling business, such as general semantic understanding and sentiment analysis, are marked to provide clear targets for pruning. A preset pruning rate table is queried, and the pruning rate is determined according to task priority. P0-level tasks need to retain more core parameters to ensure accuracy, with a pruning rate ≤20%; P1 / P2-level tasks can be moderately pruned to improve efficiency, with a pruning rate ≤40%, achieving a balance between accuracy and efficiency. Next, the importance score of each channel in the target network layer is calculated. The importance score is calculated as the L1 norm of the channel weight multiplied by the business correlation coefficient, which is labeled by domain experts as 0.8-1.0. Channels are then sorted by importance score, and redundant channels with scores below a threshold are pruned. For the retained channels, the weights with absolute values below 1e-5 are reset to 0, forming a sparse matrix adapted to the sparse computing instruction set of domestic computing cards. Finally, the pruned target network layer is incrementally fine-tuned using a scheduling domain-labeled dataset to recover the accuracy loss caused by pruning. The fine-tuned sparse matrix is then integrated with the retained core network layer to obtain the pruned model execution unit, adapted to the hardware characteristics of the target computing node.
[0044] This embodiment dynamically adjusts the pruning rate according to priority, balancing the accuracy of urgent tasks with the efficiency of non-urgent tasks, resulting in a highly flexible technical solution. After pruning, the model parameter size is reduced, memory usage is decreased, and inference speed is improved, making it suitable for nodes with weaker computing power. The sparse matrix structure is compatible with the sparse computing instruction set of domestic computing cards, further improving inference efficiency.
[0045] Preferably, the computing nodes perform inference on the power dispatching task based on the task distribution results and the pruned model execution unit to obtain task execution data, including: After receiving the pruned model execution unit and the task distribution results, the computing node determines the corresponding inference mode based on the task priority of the power scheduling task. When the task priority is high, the composite operator and the allocated contiguous memory block are directly invoked for inference to obtain the first execution data. If the task priority is not high priority, tasks in the same scenario are grouped in batches according to the business scenario label in the task status data. It is determined whether the length of the task queue after grouping is greater than the preset batch processing threshold. If so, the second execution data is obtained by reasoning based on the pruned target network layer and channel batch. Otherwise, the composite operator and continuous memory block are called separately to reason in the pruned target network layer to obtain the third execution data. The first execution data, the second execution data, and the third execution data are used as task execution data.
[0046] Specifically, the inference mode refers to the execution mode determined based on task priority, including individual inference mode and batch inference mode. High priority refers to P0 level tasks, corresponding to urgent business such as fault handling, with a response requirement of seconds (≤3s), requiring individual inference to avoid grouping delays. The preset batch processing threshold is a batch trigger threshold set based on the parallel processing capability of domestic computing cards; in this embodiment, it is 5. Batch inference is triggered when the queue length of low-priority tasks in the same scenario is ≥ the threshold. Batch inference means grouping low-priority tasks of the same business type and priority into groups of ≤32, adapting to the parallel capability of domestic computing cards, and executing inference through batch calculation instructions of domestic computing cards, with the core being to improve throughput. Individual inference means that high-priority tasks execute inference independently, without being grouped with other tasks, ensuring response speed.
[0047] Schematic, after receiving the pruned model execution unit and task distribution results, the computing node selects the inference mode according to the task priority. High-priority tasks at level P0 use a separate inference mode to avoid delays caused by grouping; low-priority tasks at levels P1 / P2 use a batch inference mode to improve overall processing efficiency. High-priority task inference directly calls the composite operator and pre-allocated contiguous memory blocks, adapts to the pruned sparse matrix structure, calls the sparse computing instruction set of the domestic computing card, and executes inference separately to ensure second-level response. Low-priority task inference can first query the intelligent cache system. If the cache contains historical inference results with the same scheduling device identifier and the same business type, the result is directly extracted as the basic execution data without occupying computing resources. If the cache is not hit, tasks of the same scenario are grouped according to the business scenario label, with each group having ≤32 tasks. Then, it is determined whether the length of the task queue after grouping is ≥ the preset batch processing threshold, which is 5 in this embodiment. If yes, the composite operator and contiguous memory blocks are called, and batch inference is performed using the batch computing instructions of the domestic computing card to obtain the second execution data. If no, the composite operator is called separately to execute inference to obtain the third execution data. The first execution data (high-priority inference results), the second execution data (batch inference results), and the third execution data (low-priority individual inference results) are integrated with performance metrics such as inference time, accuracy, and peak memory usage to generate complete task execution data.
[0048] In a preferred embodiment of the present invention, the computing node receives three types of tasks: Task 1: P0 level, fault handling; Tasks 2-6: P1 level, device parameter query, queue length 5; Task 7: P2 level, data statistics, queue length 1. Task 1 uses a single inference mode, Tasks 2-6 use a batch inference mode, and Task 7 uses a low-priority single inference mode. Task 1 calls the fault handling composite operator and a 4GB contiguous memory block, adapts to the sparse matrix, calls the Ascend sparse computation instruction, and the inference time is 2.7s, obtaining the first execution data. Tasks 2-6 query cache miss, group into group 1, queue length 5 ≥ threshold 5, call the parameter query composite operator and a 3GB contiguous memory block, and use the Ascend batch computation instruction, the inference time is 4.5ms, obtaining the second execution data. Task 7 query cache miss, calls the data statistics composite operator and a 1GB contiguous memory block separately, the inference time is 7.8ms, obtaining the third execution data. The dataset integrates inference results, latency, accuracy, and peak memory usage across three task types to form a complete dataset. In this embodiment, high-priority P0 tasks are inferred independently, with a stable response time of less than 3 seconds, meeting the urgent operational needs of power dispatch. Low-priority tasks are inferred in batches using intelligent caching, increasing throughput by 2-3 times and reducing computing resource consumption by 30%. Dynamic switching of inference modes adapts to different task requirements, making the technical solution highly practical.
[0049] S6. Real-time monitoring of the computing card is performed based on the updated hardware status data and the task execution data.
[0050] Specifically, real-time monitoring of computing power cards refers to the process of monitoring the operating status of computing power nodes, triggering abnormal alarms, and handling them based on updated hardware status data and task execution data.
[0051] To illustrate, the updated hardware status data and task execution data are extracted and compared with preset thresholds to trigger tiered alarms; anomalies are handled through task migration, backup node startup, and other methods, while full-link logs are recorded to form a closed-loop optimization mechanism to ensure stable system operation.
[0052] Preferably, real-time monitoring of the computing card is performed based on the updated hardware status data and the task execution data, including: A Level 1 warning is generated if the updated GPU utilization is not less than the preset first monitoring threshold and the updated video memory usage is not less than the preset second monitoring threshold. The current computing node that generates a Level 1 warning is switched to a restricted state, so that the current computing node stops receiving new power scheduling tasks and migrates low-priority power scheduling tasks to other computing nodes until the GPU utilization is less than the preset first monitoring threshold and the video memory usage is less than the preset second monitoring threshold, then it is switched to the running state. Preferably, it further includes: If the number of computing nodes that generate a Level 1 warning exceeds a preset node number threshold, a Level 2 warning will be generated. When a Level 2 warning is generated, the power scheduling tasks of all computing nodes that switch to a restricted state will be migrated to the standby node.
[0053] Specifically, the preset first monitoring threshold represents the alarm trigger threshold for GPU utilization, set at 90% based on the hardware capacity limit of domestic computing cards. Exceeding this threshold indicates that the node's computing resources are nearing saturation. The preset second monitoring threshold represents the alarm trigger threshold for video memory utilization, set at 95% based on the total video memory capacity and business security redundancy. Exceeding this threshold indicates that video memory resources are about to be exhausted. Level 1 warning is the alarm level when a single node's hardware load is close to its limit or slightly abnormal, affecting only a single node and requiring no backup node activation. Level 2 warning is the alarm level when multiple nodes are abnormal or there is a serious hardware failure, affecting the entire cluster and requiring the emergency activation of backup nodes to avoid business interruption. The preset node number threshold is the threshold for the number of Level 1 warning nodes that trigger Level 2 warnings; in this embodiment, it is 2, used to determine cluster-level anomalies. The restricted state represents the temporary operating state of a Level 1 warning node, the core of which is to stop receiving new tasks and migrate low-priority tasks until the load returns to normal. The operating state represents the normal operating state of the computing node, which can normally receive and execute various power scheduling tasks. Backup nodes refer to pre-deployed redundant computing power nodes, which are used to take over all tasks of faulty or overloaded nodes when a level 2 warning occurs, ensuring that business is not interrupted.
[0054] To illustrate, GPU utilization and memory usage are extracted from the updated hardware status data. When GPU utilization is greater than or equal to a preset first monitoring threshold (90%) and memory usage is greater than or equal to a preset second monitoring threshold (95%), a Level 1 warning is generated to accurately identify the risk of single-node overload. The number of computing nodes that generate Level 1 warnings is counted. When the number is greater than or equal to a preset node number threshold, a Level 2 warning is generated to identify cluster-level anomaly risks and prevent the spread of faults. The computing nodes that generate Level 1 warnings are switched to a restricted state and stopped receiving new power scheduling tasks. The low-priority P2 tasks of this node are migrated to low-load nodes until GPU utilization is less than 90% and memory usage is less than 95%, then switched back to running state to quickly reduce the load on a single node. When handling Level 2 warnings, a backup computing node is immediately activated, and all power scheduling tasks of restricted nodes, including P0 / P1 / P2 levels, are migrated to the backup node. The faulty node is isolated and maintenance personnel are notified for repair. The resource allocation weight is adjusted to increase the resource ratio of high-priority tasks to 70% to ensure that core business is not interrupted. Finally, the entire monitoring data is recorded in JSON format, including monitoring indicator values, alarm triggering, handling results, and task execution details. The log retention period is ≥365 days, and it supports multi-dimensional queries by task ID, computing node number, and time dimension for compliance auditing and fault tracing.
[0055] In a preferred embodiment of the present invention, a preset first monitoring threshold of 90%, a preset second monitoring threshold of 95%, and a preset node quantity threshold of 2 are set, and the system deploys 1 backup node. After the update, Node A's GPU utilization was 93% and memory usage was 96%, meeting the dual threshold conditions and generating a Level 1 warning. Node B's GPU utilization was 94% and memory usage was 97%, also generating a Level 1 warning. Since two nodes generated Level 1 warnings, reaching the preset node number threshold, a Level 2 warning was generated. When nodes A and B switched to restricted state, they stopped receiving new tasks and migrated P2-level tasks to low-load nodes C and D. Node A's load decreased to 72% GPU utilization and 68% memory usage, and it switched back to running state. Node B remained in restricted state due to a memory read / write error. During the Level 2 warning handling, a backup node was activated, migrating all tasks from node B, including one P0-level fault handling task and three P1-level device parameter query tasks, to the backup node. Node B was isolated, and maintenance personnel were notified for repair. The backup node executed the migration task normally, with the P0-level task inference taking 2.9 seconds without service interruption. Finally, the monitoring indicators, alarm trigger times, handling steps, and task migration details of nodes A and B were recorded in JSON format, and the logs were saved to the audit database. The tiered alarm mechanism accurately identifies anomalies at different levels, with a false alarm rate of less than 5%, thus improving monitoring reliability.
[0056] By implementing this embodiment, hardware status data of the computing power card and task status data of power dispatching tasks are obtained. The task status data includes business scenario tags and task priorities, supplementing the allocation basis missing in existing technologies. Business scenario tags quantify scenario complexity, task priorities clarify response urgency, and hardware status data reflects node carrying capacity. These three together constitute the basic data support for dynamic allocation. Through load assessment, the congestion level of the task queue and the comprehensive load value of each computing power node are obtained, avoiding the ambiguous judgment of node status in existing technologies. Then, the comprehensive load value integrates hardware status and task queue status. The congestion level of the task queue quantifies the task backlog on nodes, preventing the allocation of high-priority tasks to congested nodes and accurately reflecting the true load of nodes. Based on the congestion level of the task queue, the comprehensive load value, task status data, and task priorities, hierarchical load balancing is performed to ensure that resources are tilted towards high-priority, high-complexity tasks, while avoiding the waste of idle node resources. By linking task priorities with real-time node status, the imbalance of allocation between some nodes being overloaded and others being idle is fundamentally solved, ensuring accurate matching of resources and task requirements. Operator fusion and memory pre-allocation, based on task distribution results and hardware status, optimize node links and resource configurations to reduce computing power waste. Model pruning optimization combines business scenario tags and task priorities to adapt the model to node computing power and task requirements, improving inference speed and reducing latency fluctuations. Finally, real-time updates of node hardware status ensure that subsequent resource allocation is based on the latest node status, avoiding long-term overload or idleness, further stabilizing inference latency, and guaranteeing continuous second-level response for urgent tasks. Therefore, this invention dynamically allocates resources by linking task priority, scenario complexity, and real-time node status, reducing the burden on overloaded nodes and activating idle nodes, significantly improving the utilization rate of computing power resources and meeting the requirements for timely response to urgent tasks.
[0057] See Figure 2 This is a schematic diagram of the structure of a load-balancing-based power dispatch model inference acceleration device according to an embodiment of the present invention, comprising: The data acquisition module is used to acquire the hardware status data of the computing card and the task status data of the power dispatching task at the current moment; the task status data includes business scenario tags and task priorities. The load assessment module is used to perform load assessment based on the hardware status data and the task status data to obtain the congestion level of the task queue and the comprehensive load value of each computing node. The load balancing module is used to perform hierarchical load balancing based on the congestion level of the task queue, the comprehensive load value, the task status data, and the task priority corresponding to the power dispatch task, so as to obtain the task distribution result. The computing node pre-allocation module is used to perform operator fusion and memory pre-allocation on the computing node links according to the task distribution results and the hardware status data, and generate model execution units adapted to the computing nodes. The task inference module is used to perform model pruning optimization in the model execution unit based on the business scenario tags and task priorities in the task status data. The pruned model execution unit and task distribution results are allocated to the corresponding computing power nodes so that the computing power nodes can infer the power dispatch task based on the task distribution results and the pruned model execution unit to obtain task execution data and update the inferred hardware status data. The computing power card monitoring module is used to monitor the computing power card in real time based on the updated hardware status data and the task execution data.
[0058] This invention provides a load-balanced power dispatch model inference acceleration device. The device acquires hardware status data of the computing power card and task status data of the power dispatch task at the current moment through a data acquisition module. The task status data includes a business scenario tag and task priority. A load assessment module performs load assessment based on the hardware status data and task status data to obtain the task queue congestion level and the comprehensive load value of each computing power node. A load balancing module performs hierarchical load balancing based on the task queue congestion level, the comprehensive load value, the task status data, and the task priority corresponding to the power dispatch task to obtain the task distribution result. A computing power node pre-allocation module allocates resources according to the task priority. The task distribution results and the hardware status data are used to perform operator fusion and memory pre-allocation on the computing node links to generate model execution units adapted to the computing nodes. In the task inference module, based on the business scenario tags and task priorities in the task status data, model pruning optimization is performed in the model execution units. The pruned model execution units and task distribution results are then allocated to the corresponding computing nodes, so that the computing nodes can infer the power dispatch task based on the task distribution results and the pruned model execution units to obtain task execution data and update the inferred hardware status data. Finally, in the computing card monitoring module, the computing card is monitored in real time based on the updated hardware status data and the task execution data.
[0059] By acquiring hardware status data from computing cards and task status data from power dispatching tasks—including business scenario tags and task priorities—the allocation basis missing in existing technologies is supplemented. Business scenario tags quantify scenario complexity, task priorities clarify response urgency, and hardware status data reflects node carrying capacity; these three together form the foundational data support for dynamic allocation. Load assessment yields the task queue congestion level and the comprehensive load value of each computing node, avoiding the ambiguous judgments of node status in existing technologies. The comprehensive load value then integrates hardware status and task queue conditions; the task queue congestion level quantifies the task backlog on nodes, preventing the allocation of high-priority tasks to congested nodes and accurately reflecting the true node load. Based on task queue congestion level, comprehensive load value, task status data, and task priority, hierarchical load balancing ensures resources are tilted towards high-priority, high-complexity tasks while avoiding wasted resources on idle nodes. Linking task priority with real-time node status addresses the root cause of the imbalance between overloaded and idle nodes, ensuring precise matching of resources and task requirements. Operator fusion and memory pre-allocation, based on task distribution results and hardware status, optimize node links and resource configurations to reduce computing power waste. Model pruning optimization combines business scenario tags and task priorities to adapt the model to node computing power and task requirements, improving inference speed and reducing latency fluctuations. Finally, real-time updates of node hardware status ensure that subsequent resource allocation is based on the latest node status, avoiding long-term overload or idleness, further stabilizing inference latency, and guaranteeing continuous second-level response for urgent tasks. Therefore, this invention dynamically allocates resources by linking task priority, scenario complexity, and real-time node status, reducing the burden on overloaded nodes and activating idle nodes, significantly improving the utilization rate of computing power resources and meeting the requirements for timely response to urgent tasks.
[0060] It should be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Furthermore, in the accompanying drawings of the device embodiments provided by this invention, the connection relationships between modules indicate that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines. Those skilled in the art can understand and implement this without any creative effort.
[0061] Those skilled in the art will understand that, for convenience and brevity, the specific working process of the device described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0062] Another embodiment of the present invention provides a terminal device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor. When the processor executes the computer program, it implements a load-balancing-based power dispatch model inference acceleration method as described in the above embodiments. The terminal device may be a desktop computer, laptop, handheld computer, cloud server, or other computing device. The terminal device may include, but is not limited to, a processor and a memory.
[0063] The processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor. The processor is the control center of the terminal device, connecting all parts of the terminal device via various interfaces and lines.
[0064] The memory can be used to store the computer program. The processor implements various functions of the terminal device by running or executing the computer program stored in the memory and calling data stored in the memory. The memory may mainly include a program storage area and a data storage area. The program storage area may store the operating system, at least one application program required for a function, etc.; the data storage area may store data created based on the use of the mobile phone, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as hard disk, RAM, plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, at least one disk storage device, flash memory device or other volatile solid-state storage device.
[0065] Another embodiment of the present invention provides a computer-readable storage medium including a stored computer program, wherein, when the computer program is executed, it controls the device where the computer-readable storage medium is located to execute the load balancing-based power dispatching model inference acceleration method described in the above embodiment.
[0066] The storage medium is a computer-readable storage medium, and the computer program is stored in the computer-readable storage medium. When the computer program is executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium can include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc.
[0067] The above description represents the preferred embodiments of the present invention. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principles of the present invention, and these improvements and modifications are also considered to be within the scope of protection of the present invention.
Claims
1. A load-balanced power dispatch model inference acceleration method, characterized in that, include: Obtain the current hardware status data of the computing power card and the task status data of the power scheduling task; The task status data includes business scenario tags and task priority; Based on the hardware status data and the task status data, a load assessment is performed to obtain the congestion level of the task queue and the comprehensive load value of each computing node. Based on the congestion level of the task queue, the comprehensive load value, the task status data, and the task priority corresponding to the power dispatch task, a hierarchical load balancing is performed to obtain the task distribution result. Based on the task distribution results and the hardware status data, operator fusion and memory pre-allocation are performed on the computing power node links to generate model execution units adapted to the computing power nodes. Based on the business scenario tags and task priorities in the task status data, model pruning optimization is performed in the model execution unit. The pruned model execution unit and task distribution results are allocated to the corresponding computing power nodes so that the computing power nodes can infer the power dispatch task based on the task distribution results and the pruned model execution unit, obtain task execution data, and update the inferred hardware status data. Real-time monitoring of the computing card is performed based on the updated hardware status data and the task execution data.
2. The method for accelerating inference of a power dispatching model based on load balancing as described in claim 1, characterized in that, The hardware status data includes: computing resource status data and video memory status data; the computing resource status data includes GPU utilization; the video memory status data includes video memory occupancy rate; the task status data also includes task queue length. Based on the hardware status data and the task status data, a load assessment is performed to obtain the task queue congestion level and the comprehensive load value of each computing node, including: The comprehensive load value of each computing node is obtained by weighting the GPU utilization, the video memory usage, and the task queue length. The deviation between the overall load value and the task queue length is taken as the queue adaptation deviation. The congestion level of the task queue is obtained based on the ratio of the queue adaptation deviation to the preset queue length threshold.
3. The method for accelerating inference of a power dispatching model based on load balancing as described in claim 2, characterized in that, The task status data also includes: the scheduling device identifier; the computing resource status data also includes: the current amount of resources occupied and the total resource capacity. Based on the congestion level of the task queue, the overall load value, the task status data, and the task priority corresponding to the power dispatch task, a hierarchical load balancing is performed to obtain the task distribution result, including: Based on the task priority corresponding to the power dispatch task and the comprehensive load value of each computing node, computing nodes with a comprehensive load value less than the preset load threshold are selected as low load nodes. The low load nodes are then sorted according to task priority and task queue congestion to obtain the sorting results. Power dispatching tasks with the same dispatching device identifier are routed to the same computing power node according to the hierarchical sorting results, so that each computing power node can allocate memory blocks to power dispatching tasks based on task priority and node load rate. The node load rate of the computing power node is determined based on the current occupied resources and the total resource capacity. The hierarchical sorting results and the memory block allocation results of the computing power nodes are used as the task distribution results.
4. The method for accelerating inference of a power dispatching model based on load balancing as described in claim 3, characterized in that, The video memory status data also includes: total video memory capacity and video memory fragmentation rate; Based on the task distribution results and the hardware status data, operator fusion and memory pre-allocation are performed on the computing power node links to generate model execution units adapted to the computing power nodes, including: Based on the hierarchical sorting results, memory block allocation results, and operator types corresponding to the power dispatching tasks in the task distribution results, the continuously executed operators corresponding to each computing power node are selected, and the selected continuously executed operators are integrated into a composite operator used to characterize the target computing power node; wherein, the operator type of the target computing power node is consistent with the operator type corresponding to the power dispatching task. Based on the total video memory capacity and the video memory fragmentation rate, contiguous video memory blocks are allocated for composite operators and power scheduling tasks in the video memory block allocation results. The allocated contiguous memory blocks and composite operators are integrated to generate model execution units adapted to computing power nodes.
5. The method for accelerating inference of a power dispatching model based on load balancing as described in claim 4, characterized in that, Based on the business scenario tags and task priorities in the task status data, model pruning optimization is performed in the model execution unit, including: Based on the business scenario labels in the task status data, the business type corresponding to the power dispatching task is determined, and the network layer that is strongly related to the business type is selected as the target network layer in the model execution unit. The pruning rate corresponding to the target network layer is determined based on the task priority and the preset pruning rate table; The channels of the target network layer are pruned according to the pruning rate corresponding to the target network layer. Based on the pruned target network layer and channels, the pruned model execution unit is obtained.
6. The method for accelerating inference of a power dispatching model based on load balancing as described in claim 5, characterized in that, The computing nodes perform inference on the power dispatching task based on the task distribution results and the pruned model execution units to obtain task execution data, including: After receiving the pruned model execution unit and the task distribution results, the computing node determines the corresponding inference mode based on the task priority of the power dispatch task. When the task priority is high, the composite operator and the allocated contiguous memory block are directly invoked for inference to obtain the first execution data. If the task priority is not high priority, tasks in the same scenario are grouped in batches according to the business scenario label in the task status data. It is determined whether the length of the task queue after grouping is greater than the preset batch processing threshold. If so, the second execution data is obtained by reasoning based on the pruned target network layer and channel batch. Otherwise, the composite operator and continuous memory block are called separately to reason in the pruned target network layer to obtain the third execution data. The first execution data, the second execution data, and the third execution data are used as task execution data.
7. The method for accelerating inference of a power dispatching model based on load balancing as described in claim 2, characterized in that, Real-time monitoring of the computing card is performed based on the updated hardware status data and the task execution data, including: A Level 1 warning is generated if the updated GPU utilization is not less than the preset first monitoring threshold and the updated video memory usage is not less than the preset second monitoring threshold. The current computing node that generates a Level 1 warning is switched to a restricted state. In this restricted state, the current computing node stops receiving new power scheduling tasks and migrates low-priority power scheduling tasks to other computing nodes until the GPU utilization is less than the preset first monitoring threshold and the video memory usage is less than the preset second monitoring threshold, at which point it switches to the running state.
8. The method for accelerating inference of a power dispatching model based on load balancing as described in claim 7, characterized in that, Also includes: If the number of computing nodes that generate a Level 1 warning exceeds a preset node number threshold, a Level 2 warning will be generated. When a Level 2 warning is generated, the power scheduling tasks of all computing nodes that switch to a restricted state will be migrated to the standby node.
9. A power dispatching model inference acceleration device based on load balancing, characterized in that, include: The data acquisition module is used to acquire the hardware status data of the computing card and the task status data of the power dispatching task at the current moment. The task status data includes business scenario tags and task priority; The load assessment module is used to perform load assessment based on the hardware status data and the task status data to obtain the congestion level of the task queue and the comprehensive load value of each computing node. The load balancing module is used to perform hierarchical load balancing based on the congestion level of the task queue, the comprehensive load value, the task status data, and the task priority corresponding to the power dispatch task, so as to obtain the task distribution result. The computing node pre-allocation module is used to perform operator fusion and memory pre-allocation on the computing node links according to the task distribution results and the hardware status data, and generate model execution units adapted to the computing nodes. The task inference module is used to perform model pruning optimization in the model execution unit based on the business scenario tags and task priorities in the task status data. The pruned model execution unit and task distribution results are allocated to the corresponding computing power nodes so that the computing power nodes can infer the power dispatch task based on the task distribution results and the pruned model execution unit to obtain task execution data and update the inferred hardware status data. The computing power card monitoring module is used to monitor the computing power card in real time based on the updated hardware status data and the task execution data.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium includes a stored computer program, wherein, when the computer program is executed, it controls the device where the computer-readable storage medium is located to perform a load-balancing-based power dispatch model inference acceleration method as described in any one of claims 1 to 8.