One-stop large model intelligent agent development and operation platform integrating computing power scheduling and model management

By deploying in-memory computing units within high-bandwidth memory and combining them with policy networks and adaptive conformal prediction modules, the problems of resource waste and memory bandwidth bottlenecks in large model inference clusters are solved, achieving efficient computing power scheduling and model management, and improving system stability and decision confidence.

CN122019199BActive Publication Date: 2026-06-26SHANDONG UNIVALSOFT JOINT- CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANDONG UNIVALSOFT JOINT- CO LTD
Filing Date
2026-04-15
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing scheduling schemes for large model inference clusters are prone to resource waste when there are load fluctuations or uneven node performance, and memory bandwidth becomes the main bottleneck for inference latency. The stability of scheduling decisions and resource constraints are not adequately guaranteed.

Method used

In-memory computing units are deployed within the storage cores of high-bandwidth memory to perform near-data attention calculations and collect local computing power status indicators. A global computing power status table is generated through the computing power scheduling control node. Task allocation is performed in conjunction with the policy network and the adaptive conformal prediction module, establishing a closed-loop collaborative operation and maintenance mechanism at both the hardware and software levels.

Benefits of technology

It effectively alleviates the memory bandwidth bottleneck, improves the adaptive response capability of scheduling decisions and the long-term stability of resource constraints, reduces end-to-end latency, and enhances the confidence quantification capability of scheduling decisions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122019199B_ABST
    Figure CN122019199B_ABST
Patent Text Reader

Abstract

The application discloses a one-stop large model intelligent agent development operation and maintenance platform fusing computing power scheduling and model management, relates to the technical field of artificial intelligence, and comprises: a heterogeneous computing power resource pool containing multiple inference nodes and generating a node-level computing power state summary record; a computing power scheduling control node used for collecting the node-level computing power state summary records of all the inference nodes to obtain a global computing power state table, and taking the difference between a scheduling throughput component and a queue depth value punishment component of a resource constraint virtual queue as an instant reward signal; and a self-adaptive shape-preserving prediction confidence test module used for selecting a task distribution scheme to be issued or executing a deterministic rollback scheduling based on the global computing power state table according to whether a confidence interval radius exceeds a preset radius threshold. The application can effectively relieve a memory bandwidth bottleneck in an ultra-long context inference scene and improve the adaptive response capability of scheduling decisions to real-time load changes and environmental drift.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to an industrial internet platform, specifically a one-stop development and maintenance platform for large-scale intelligent agents that integrates computing power scheduling and model management. Background Technology

[0002] With the continuous expansion of large-scale model inference services, how to achieve efficient resource scheduling and operation and maintenance management in heterogeneous computing power clusters has become a core issue of concern in the industry. Existing large-scale model inference cluster scheduling schemes can be broadly classified into three categories. The first category is rule-based static scheduling schemes, including methods such as round-robin scheduling, weighted round-robin, and consistent hashing. These schemes are simple to implement and have low overhead, but they cannot perceive the real-time load status of each inference node. In scenarios with load fluctuations or uneven node performance, this can easily lead to some nodes being overloaded while others are idle, resulting in resource waste. The second category is reinforcement learning-based dynamic scheduling schemes, which generate task allocation decisions based on the real-time status of the cluster through a policy network, possessing adaptive adjustment capabilities. However, existing reinforcement learning scheduling schemes typically take maximizing scheduling throughput as the sole optimization objective, lacking long-term stability guarantees for resource constraints such as memory usage and response latency. This causes the policy network to frequently trigger resource overruns in the pursuit of short-term throughput, resulting in insufficient system stability. The third category is threshold-based operation and maintenance monitoring schemes, which trigger scaling up or down or task migration operations when resource indicators exceed preset thresholds. Such solutions are passive response mechanisms, unable to predict the reliability of scheduling decisions in advance, nor can their risks be quantitatively assessed before the scheduling scheme is issued and executed. Furthermore, at the hardware level, existing large-model inference architectures generally rely on the main computing chip to centrally execute all attention calculations. Key-value caches need to be moved from high-bandwidth memory to the main computing chip to participate in computation. In scenarios with extremely long contexts, the transmission bandwidth of the memory interface becomes the main bottleneck for inference latency. Although in-memory computing technology has shown potential in academic research to offload computation to the memory side to reduce data movement, existing solutions have not yet formed a closed-loop linkage between the hardware-level computing power status indicators generated by the in-memory computing unit and the upper-level scheduling decision system. The real-time load awareness capability at the hardware level cannot be effectively utilized by the scheduling strategy at the software level. In summary, existing technologies still have shortcomings in terms of computing power awareness accuracy, long-term stability assurance of resource constraints, and confidence quantification of scheduling decisions. Summary of the Invention

[0003] The purpose of this invention is to provide a one-stop development and operation platform for large model intelligent agents that integrates computing power scheduling and model management. It can effectively alleviate the memory bandwidth bottleneck in ultra-long context reasoning scenarios and improve the adaptive response capability of scheduling decisions to real-time load changes and environmental drift.

[0004] To address the aforementioned technical problems, this invention provides a one-stop development and operation platform for large-scale intelligent agents that integrates computing power scheduling and model management, comprising:

[0005] The heterogeneous computing power resource pool contains multiple inference nodes. Each inference node is equipped with a main computing chip and a high-bandwidth memory. The high-bandwidth memory has in-memory computing units deployed within its storage core. The in-memory computing units are used to perform near-data attention calculations for large model inference and periodically collect local computing power status indicators. The main computing chip of each inference node summarizes the local computing power status indicators of all in-memory computing units in the node and generates a node-level computing power status summary record.

[0006] The computing power scheduling and control node is used to collect the node-level computing power state summary records of all inference nodes to obtain a global computing power state table, maintain a resource-constrained virtual queue for each inference node, and deploy a scheduling agent containing a policy network, a value network, and a throughput prediction network. The policy network outputs a task allocation scheme based on a scheduling state vector composed of the global computing power state table, the attribute information of the tasks to be scheduled, and the queue depth value of the resource-constrained virtual queue. The instant reward signal is the difference between the scheduling throughput component and the penalty component of the queue depth value of the resource-constrained virtual queue.

[0007] The adaptive conformal prediction confidence verification module runs in the computing power scheduling control node. It maintains a calibration sample sliding buffer and performs adaptive conformal prediction based on the absolute difference between the total number of inference tasks actually completed in each scheduling cycle and the number of tasks predicted to be completed by the throughput prediction network. It obtains the confidence interval radius and selects a task allocation scheme or performs deterministic backoff scheduling based on the global computing power status table based on whether the confidence interval radius exceeds the preset radius threshold.

[0008] Furthermore, the in-memory computing unit includes a dot product array, an exponential approximation function unit based on a piecewise linear lookup table, and a set of local feature registers.

[0009] Furthermore, the near-data attention computation is executed in three stages: In the first stage, the dot product array of each in-memory computing unit reads the key vector slices residing in the local memory, performs element-wise multiplication on the query vector and each key vector, and accumulates them level by level along the vector dimension to obtain a local attention score sequence. The maximum value of the local attention score sequence is taken as the local extreme value scalar and sent back to the main computing chip. The main computing chip takes the maximum value of all local extreme value scalars as the global extreme value scalar and broadcasts it to all in-memory computing units. In the second stage, each in-memory computing unit subtracts the global extreme value scalar from each score in the local attention score sequence to obtain an offset score sequence, and uses exponential near-data... The function-like unit performs exponential mapping on the offset fraction sequence to obtain a local exponential value sequence, sums the local exponential value sequence to obtain a local exponential accumulation scalar, performs element-wise multiplication of each exponential value in the local exponential value sequence with the value vector at the corresponding position in the local memory and accumulates them to obtain a local weighted value vector, and sends the local exponential accumulation scalar and the local weighted value vector back to the main computing chip; in the third stage, the main computing chip accumulates all local weighted value vectors element-wise to obtain a global weighted value vector, sums all local exponential accumulation scalars to obtain a global exponential accumulation scalar, and divides each element of the global weighted value vector by the global exponential accumulation scalar to obtain the final output vector of the current attention head.

[0010] Furthermore, the local computing power status indicators include three items: the first item is the storage row activation count, which is the cumulative number of times the in-memory computing unit triggers storage row activation operations in the current acquisition cycle; the second item is the computing channel occupancy time ratio, which is the ratio of the number of clock cycles in which each computing channel of the in-memory computing unit is in an active computing state in the current acquisition cycle to the total number of clock cycles in the acquisition cycle; the third item is the number of bytes transmitted through the interface, which is the total number of bytes of data sent by the in-memory computing unit to the main computing chip through the high-bandwidth memory interface in the current acquisition cycle; the node-level computing power status summary record is a record generated by taking the arithmetic mean of the above three indicator values ​​of all in-memory computing units under the inference node.

[0011] Furthermore, the resource-constrained virtual queue includes a video memory usage virtual queue and a response latency virtual queue. The video memory usage virtual queue is updated as follows: the actual video memory usage of the inference node in the current scheduling period is divided by the platform's preset video memory capacity threshold to obtain the normalized video memory usage rate. The difference between the normalized video memory usage rate and 1 is taken. When the difference is greater than 0, the difference is appended to the current queue depth value of the video memory usage virtual queue. When the difference is not greater than 0, the absolute value of the difference is deducted from the current queue depth value and the lower limit is truncated to 0. The response latency virtual queue is updated in the same way. Its normalized index is the response latency of the inference node in the current scheduling period divided by the platform's preset response latency upper limit value.

[0012] Furthermore, the scheduling state vector is a fixed-dimensional vector, constructed as follows: the node-level computing power status summary records of all inference nodes in the global computing power status table are arranged in order of node number; the current task queue to be scheduled is arranged in descending order of context token number, and the context token number and concurrent request number of the first M tasks are taken. The empty positions with less than M tasks are filled with 0. Then, the current queue depth value of the resource constraint virtual queue of all inference nodes is concatenated, where M is the upper limit of the number of task slots preset by the platform.

[0013] Furthermore, after receiving the scheduling state vector, the policy network outputs a probability distribution for each task to be scheduled, which is then allocated to each inference node. The computing power scheduling control node performs one random sampling for each task to be scheduled based on the probability distribution to determine the target inference node. After all tasks are sampled, a task allocation scheme is formed. The scheduling throughput component is the total number of inference tasks actually completed in the current scheduling cycle, and the penalty component is the sum of the current queue depth values ​​of the resource constraint virtual queues of all inference nodes multiplied by a fixed adjustment coefficient. The instantaneous reward signal is the result of subtracting the penalty component from the scheduling throughput component. The scheduling agent performs gradient updates on the policy network and the value network according to the pruning and update rules of the near-end policy optimization algorithm.

[0014] Furthermore, the fixed capacity of the calibration sample sliding buffer is N. After each scheduling cycle is completed, the absolute difference between the total number of inference tasks actually completed and the number of tasks predicted by the throughput prediction network is written as the inconsistency value to the end of the calibration sample sliding buffer and the frontmost record is removed. The adaptive conformal prediction is executed as follows: the existing N inconsistency values ​​in the calibration sample sliding buffer are assigned exponential decay coefficients according to time order so that recent samples obtain larger coefficient values. All inconsistency values ​​are arranged in ascending order, and the corresponding exponential decay coefficients are accumulated item by item in the order of arrangement until the sum of the accumulated values ​​first reaches or exceeds the product of the preset confidence level and the sum of all exponential decay coefficients. The inconsistency value at this moment is taken as the conformal prediction critical value, and the conformal prediction critical value is used as the confidence interval radius of the current scheduling cycle.

[0015] Furthermore, the execution method of deterministic rollback scheduling is as follows: read the average value of the activation count of the storage row in the node-level computing power status summary record of each inference node from the global computing power status table as the current load of the inference node, arrange the tasks to be scheduled in descending order of context token number, and assign each task to the inference node with the smallest current load in turn. After each assignment is completed, the context token number of the corresponding task is appended to the current load of the receiving inference node until all tasks are assigned and then sent out for execution.

[0016] Furthermore, when the confidence interval radius exceeds the preset radius threshold, the computing power scheduling control node marks the current scheduling period as a low confidence period. Within the training batch collected by the current policy version of the scheduling agent, the state transition samples generated in the low confidence period are multiplied by an increased loss scaling factor when calculating the policy gradient, so that the policy network increases the policy adjustment range for high uncertainty state regions in the current batch gradient update.

[0017] The one-stop large-scale model intelligent agent development and operation platform integrating computing power scheduling and model management of the present invention has the following beneficial effects: By deploying in-memory computing units within the storage cores of high-bandwidth memory, the dot product operation and exponential mapping operation, which involve the largest amount of data in attention calculation, are moved to the vicinity of the physical residence location of the key-value cache. This allows most of the data to be consumed within the storage core, requiring only the return of intermediate results with dimensions far lower than the original key-value cache to the main computing chip. This significantly reduces the amount of data transmitted through the high-bandwidth memory interface, alleviates the bandwidth bottleneck in ultra-long context inference scenarios, and significantly reduces the end-to-end latency of a single attention calculation. Simultaneously, during the execution of attention calculation, the in-memory computing unit synchronously collects three computing power status indicators using hardware counters: the storage row activation count, the ratio of computation channel occupancy time, and the number of bytes transmitted through the interface. These indicators are aggregated into a node-level summary by the main computing chip of the inference node and then directly used by the scheduling agent. This allows scheduling decisions to be based on load information directly observed at the hardware level, resulting in higher perception accuracy and lower acquisition latency compared to schemes that rely solely on software-level statistical indicators. At the scheduling decision level, this invention maintains virtual queues for memory usage and response latency for each inference node. This transforms the accumulated resource constraint violations into a penalty component that adaptively increases over time in the immediate reward signal. This allows the policy network to implicitly meet the long-term stability requirements of multi-dimensional resource constraints while maximizing scheduling throughput, avoiding the problem of frequent resource overruns when pursuing short-term throughput gains. At the scheduling execution level, this invention introduces an adaptive conformal prediction confidence verification mechanism. Before task allocation, the confidence level of the current decision is independently quantified. When the confidence interval radius exceeds a threshold, it automatically switches to deterministic backoff scheduling based on in-memory computing hardware load indicators. This provides predictable fallback protection for the system under high uncertainty conditions. Simultaneously, low-confidence signals are fed back to the policy network's training process to drive continuous improvement, forming a three-layer closed-loop collaborative operation and maintenance mechanism of hardware awareness, dynamic scheduling, and confidence quantification. Attached Figure Description

[0018] Figure 1 The figure shows the 95th percentile response latency comparison curve and the memory over-limit rate comparison curve for the three scheduling methods provided in the embodiments of the present invention within 500 consecutive scheduling cycles; (a) in the figure is the response latency comparison curve; (b) in the figure is the memory over-limit rate comparison curve.

[0019] Figure 2 A schematic diagram showing the curve of the relative approximation error of the output value of the exponential approximation function unit relative to the true exponential function value as a function of the input value, provided in an embodiment of the present invention.

[0020] Figure 3 A schematic diagram illustrating the process of determining weighted quantiles in adaptive conformal prediction provided in an embodiment of the present invention;

[0021] Figure 4 This is a schematic diagram showing the evolution curves of the confidence interval radius output by the adaptive conformal prediction method and the fixed-weight conformal prediction method provided in the embodiments of the present invention as a function of the scheduling period. Detailed Implementation

[0022] A one-stop development and operation platform for large-scale intelligent agents that integrates computing power scheduling and model management, including:

[0023] The heterogeneous computing power resource pool contains multiple inference nodes. Each inference node is equipped with a main computing chip and a high-bandwidth memory. The high-bandwidth memory has in-memory computing units deployed within its storage core. The in-memory computing units are used to perform near-data attention calculations for large model inference and periodically collect local computing power status indicators. The main computing chip of each inference node summarizes the local computing power status indicators of all in-memory computing units in the node and generates a node-level computing power status summary record.

[0024] The computing power scheduling and control node is used to collect the node-level computing power state summary records of all inference nodes to obtain a global computing power state table, maintain a resource-constrained virtual queue for each inference node, and deploy a scheduling agent containing a policy network, a value network, and a throughput prediction network. The policy network outputs a task allocation scheme based on a scheduling state vector composed of the global computing power state table, the attribute information of the tasks to be scheduled, and the queue depth value of the resource-constrained virtual queue. The instant reward signal is the difference between the scheduling throughput component and the penalty component of the queue depth value of the resource-constrained virtual queue.

[0025] The adaptive conformal prediction confidence verification module runs in the computing power scheduling control node. It maintains a calibration sample sliding buffer and performs adaptive conformal prediction based on the absolute difference between the total number of inference tasks actually completed in each scheduling cycle and the number of tasks predicted to be completed by the throughput prediction network. It obtains the confidence interval radius and selects a task allocation scheme or performs deterministic backoff scheduling based on the global computing power status table based on whether the confidence interval radius exceeds the preset radius threshold.

[0026] The following detailed description, in conjunction with specific implementation methods, provides a detailed explanation of the architectural composition of the heterogeneous computing power resource pool, the hardware structure of the in-memory computing unit, the phased execution process of near-data attention computing, and the layered collection process of computing power status indicators.

[0027] In a typical deployment scenario, a heterogeneous computing resource pool comprises multiple inference nodes, each connected to a computing scheduling and control node via a high-speed interconnect network. Each inference node is equipped with one main computing chip and at least one set of high-bandwidth memory. The main computing chip handles tasks such as fully connected layer computation, layer normalization computation, and cross-memory chip data merging during model inference. The high-bandwidth memory employs a multi-layer stacked packaging structure, consisting of several memory chips vertically stacked and interconnected with the bottom logic chip via through-silicon vias. In one implementation, one set of high-bandwidth memory contains eight memory chips, each internally divided into several independent storage partitions, each with its own row buffer and column decoding path. In actual deployment, the number of memory chips and the granularity of storage partitions can be adjusted according to the context length and model parameter scale of the inference task. For example, in a scenario with a context length of 128,000 tokens, a high-bandwidth memory containing 16 memory chips can be selected to accommodate larger key-value cache shards.

[0028] One in-memory compute unit is deployed within each memory chip. This in-memory compute unit is positioned adjacent to the row buffer of the memory array, allowing it to read data residing in the local memory via the shortest physical path. This avoids bandwidth bottlenecks and power consumption associated with data transmission through through-silicon vias (TSVs) to the main compute chip. This deployment location is chosen based on the data access characteristics of attention computation: during attention computation, the amount of data in the query vector is much smaller than that in the key and value vectors. Key-value caches can reach tens of gigabytes in ultra-long context scenarios. If all key-value caches were moved to the main compute chip for computation, the transmission bandwidth of the high-bandwidth memory interface would become the main bottleneck for inference latency. By placing the dot product operation near the physical location of the key-value cache, most data can be consumed within the memory chip, requiring only the return of intermediate results with dimensions far lower than the original key-value cache, thus reducing the amount of data transmitted through the interface by one to two orders of magnitude.

[0029] Each in-memory computation unit contains three functional components: a dot product array, an exponential approximation function unit based on a piecewise linear lookup table, and a set of local feature registers.

[0030] The dot product array consists of multiple multiply-accumulate units arranged in a pipelined manner. In one implementation, the dot product array contains 16 multiply-accumulate units, each performing a multiplication of one pair of floating-point values ​​within a single clock cycle and accumulating the multiplication to a local accumulator. The 16 multiply-accumulate units are arranged sequentially along the pipeline, allowing the dot product operation of a 128-dimensional vector pair to be completed within 8 clock cycles. The bit width of the dot product array and the number of multiply-accumulate units can be configured according to the attention head dimension of the model; for example, 8 multiply-accumulate units can be configured when the attention head dimension is 64, and 32 multiply-accumulate units can be configured when the attention head dimension is 256, keeping the number of cycles for the dot product operation of a single vector pair within a reasonable range. Regarding computational precision, the multiply-accumulate units support 16-bit floating-point input and 32-bit floating-point internal accumulation to maintain the numerical precision of the attention score while controlling hardware area.

[0031] The exponential approximation function unit performs exponential mapping operations locally within the in-memory computing unit, without needing to send intermediate attention scores back to the main computing chip for execution. Since implementing a complete transcendental function operation circuit at the edge of the memory array incurs excessive area overhead, the exponential approximation function unit employs a piecewise linear lookup table scheme for approximation. Specifically, the input interval of the exponential function is pre-divided into multiple equal-width sub-intervals. Within each sub-interval, a linear function approximates the true value of the exponential function. Each linear function is determined by a slope value and an intercept value. The slope and intercept values ​​for all sub-intervals are pre-calculated and stored in the lookup table. During runtime, after receiving an input value, the exponential approximation function unit first reads the corresponding slope and intercept values ​​from the lookup table based on the sub-interval index into which the input value falls. Then, it performs one multiplication and one addition operation to obtain the exponential approximation output value. In one implementation, the input interval is set to a range from -16 to 0 (the attention score, after global extremum offset, is never greater than 0, so the input value always falls within this range), divided into 64 equal-width sub-intervals, each with a width of 0.25. The lookup table stores 64 sets of slope and intercept values. With this configuration, the maximum relative approximation error of the exponential approximation function unit does not exceed 1.5%, meeting the accuracy requirements of attention calculation. In scenarios with higher accuracy requirements, the number of sub-intervals can be increased to 128 or 256 to reduce the approximation error; in scenarios more sensitive to area and power consumption, the number of sub-intervals can be reduced to 32, accepting a slightly higher approximation error. As an optional implementation, the exponential approximation function unit can also use a quadratic polynomial lookup table scheme instead of a piecewise linear lookup table scheme. In this case, each sub-interval stores one quadratic coefficient, one linear coefficient, and one constant term, performing two multiplications and two additions during runtime, achieving a lower approximation error with the same number of sub-intervals.

[0032] refer to Figure 2 The horizontal axis represents the input value of the exponential approximation function unit, and the vertical axis represents the relative approximation error, expressed as a percentage. A 1.5% error baseline is also marked in the graph for reference. Three characteristics can be observed from the curve trend. The first characteristic is that the relative approximation error exhibits periodic sawtooth-like fluctuations along the input value direction, with each sawtooth corresponding to a sub-interval. This is an inherent characteristic of piecewise linear approximation: at the two ends of each sub-interval, i.e., at the nodes of the lookup table, the linear function intersects the true exponential function precisely, with zero error; near the midpoint of the sub-interval, the linear function deviates furthest from the convex arc of the exponential function, and the error reaches its peak within that sub-interval. The second characteristic is that the relative approximation error is larger in intervals where the input value is close to 0 and smaller in intervals where the input value is far from 0. This phenomenon is caused by the fact that the curvature of the exponential function is larger in intervals close to 0, meaning the function curve is more drastic, and the linear function's approximation deviation is greater within sub-intervals of equal width; while in intervals far from 0, the exponential function value approaches zero, the function curve is almost flat, and the accuracy of the linear approximation is naturally higher. The third characteristic is that increasing the number of sub-intervals significantly reduces the peak error: the peak error with a 32-sub-interval configuration is approximately 6% to 7%, the peak error with a 64-sub-interval configuration drops to approximately 1.2% to 1.5%, and the peak error with a 128-sub-interval configuration further decreases to approximately 0.3% to 0.4%. Taking the 64-sub-interval configuration as an example, its maximum relative approximation error across the entire domain does not exceed 1.5% of the error baseline, meeting the accuracy requirements of attention calculation. When the number of sub-intervals increases from 64 to 128, the improvement in error is significantly smaller compared to increasing from 32 to 64, exhibiting a diminishing marginal effect. Therefore, in actual deployments, 64 sub-intervals are chosen as the default configuration, achieving a balance between lookup table storage overhead (64 sets of slope and intercept values, totaling 128 values) and approximation accuracy. For accuracy-sensitive applications, the configuration can be switched to 128 sub-intervals, while for applications sensitive to area and power consumption, the configuration can be downgraded to 32 sub-intervals.

[0033] The local feature registers are a set of dedicated registers, independent of the data path of the dot product array. They are used to cache the collected results of computing power status indicators without interfering with the attention calculation. The reading and writing of the local feature registers are driven by the hardware counter logic inside the in-memory computing unit, and do not occupy the instruction issue bandwidth of the main computing chip.

[0034] When performing attention computation in a large model inference task, the main computing chip of the inference node first divides the key vector and value vector corresponding to the current inference request into multiple slices according to their sequence positions, and writes them into the local storage of each in-memory computing unit. The slice division method is as follows: assuming the context of the current inference request contains... The high-bandwidth memory of the inference node contains [number] token positions. For each in-memory computing unit, contiguous blocks of memory reside in its local storage. The key vector and value vector corresponding to each token position, where The total number of tokens in the current context. This represents the total number of in-memory computing units. This represents the floor operation. Each key vector and value vector has a dimension of 1. , This represents the head dimension of the current attention head. In one implementation, It is 32000. The value is 8, and each in-memory computation unit resides a key vector and value vector for 4000 token locations. It is 128.

[0035] Once the key-value shards are ready, the main compute chip broadcasts the query vector of the current inference request to all in-memory compute units via the high-bandwidth memory interface. The dimension of the query vector is also [missing information]. Its data volume is much smaller than the total amount of key-value cache, and the bandwidth usage of the broadcast transmission interface is negligible.

[0036] Attention calculation is then divided into three stages, executed collaboratively between the in-memory computing unit and the main computing chip. The fundamental reason for splitting attention calculation into three stages, rather than completing all operations at once within the in-memory computing unit, is that the normalization operation in the attention mechanism requires global information on the attention scores for all token positions. However, each in-memory computing unit can only access the attention scores within its local slice and cannot independently calculate the correct normalization result. If each in-memory computing unit performs normalization independently on its local slice, the normalization denominator for each slice is only a partial sum of the exponent values ​​within that slice, not the global sum of the exponent values ​​for all token positions. Mathematically, this is not equivalent to the output of the standard attention mechanism, leading to deviations in the calculation results. Therefore, attention calculation requires two global synchronizations between the in-memory computing units: the first synchronization is used to determine the global extremum to ensure the numerical stability of the exponent mapping; the second synchronization is used to aggregate the local exponent sums and local weighted values ​​of each slice to complete global normalization. The specific execution process of the three stages is as follows.

[0037] The first stage is the local dot product calculation and global extremum determination stage. Each in-memory computing unit's dot product array sequentially reads the key vectors of each token position from local memory, performs element-wise multiplication between the query vector and each key vector, and accumulates the results along the vector dimension to obtain a scalar dot product value. This scalar dot product value is the original attention score between the query vector and the current key vector. After traversing all local token positions, the in-memory computing unit obtains a sequence of local attention scores, the length of which is equal to the number of locally residing token positions. Subsequently, the in-memory computing unit traverses the local attention score sequence, compares each score, and retains the maximum value as the local extremum scalar. Each in-memory computing unit sends its local extremum scalar back to the main computing chip. The main computing chip collects all the local extremum scalars sent back by the in-memory computing units, selects the maximum value as the global extremum scalar, and broadcasts the global extremum scalar back to all in-memory computing units via the high-bandwidth memory interface. The purpose of determining the global extremum scalar is to subtract the global extremum scalar from each attention score when performing exponential mapping in subsequent stages, so that the input values ​​of the exponential mapping are all no greater than 0, and the output values ​​of the exponential mapping are constrained to the range of 0 to 1, thereby avoiding numerical overflow caused by directly performing exponential mapping on large positive values.

[0038] The second stage is the local exponential mapping and local weighted accumulation stage. After receiving the global extreme value scalar broadcast by the main computing chip, each in-memory computing unit subtracts the global extreme value scalar from each score in the local attention score sequence to obtain an offset score sequence. Each value in the offset score sequence is not greater than 0. The in-memory computing unit uses the exponential approximation function unit to perform exponential mapping on each offset score in the offset score sequence one by one to obtain a local exponential value sequence. The in-memory computing unit sums all elements in the local exponential value sequence to obtain one local exponential accumulation scalar. Simultaneously with the summation, the in-memory computing unit uses each exponential value in the local exponential value sequence as a scaling factor for the corresponding token position value vector, performs element-wise multiplication of each exponential value with the corresponding token position value vector in the local storage, and accumulates all scaled value vectors one by one to obtain a single dimensional value. The local weighted vector. The in-memory computing unit sends both the local exponential summation scalar and the local weighted vector back to the main computing chip. During this stage, each in-memory computing unit sends back one scalar value plus one dimension. The vector in the head dimension With a configuration of 128 and using a 16-bit floating-point format, the amount of data returned is approximately 258 bytes, which is more than three orders of magnitude less than the amount of data required to transmit all key-value caches to the main computing chip for centralized computation.

[0039] The third stage is the global normalization stage. After receiving the local weighted vector and local exponential accumulation scalar returned by all in-memory computing units, the main computing chip performs the following merging operation: Each locally weighted vector is summed element-wise to obtain a single dimension. The global weighted vector; Summing the local exponential scalars yields a global exponential scalar, which mathematically equals the sum of the exponential mappings at all token positions. Finally, dividing each element of the global weighted vector by the global exponential scalar yields the final output vector of the current attention head. The final output vector has a dimension of... This will be passed to subsequent layers of the larger model for processing.

[0040] It should be noted that the above three stages can be executed in parallel head by head in the multi-head attention mechanism: if the model contains multiple attention heads, the key-value caches of different attention heads can reside in different in-memory computing unit groups, and the three-stage attention calculation of each head can be carried out independently and in parallel on their respective in-memory computing unit groups. There is no data dependency between heads, thereby making full use of the parallel bandwidth of multiple storage chips in high-bandwidth memory.

[0041] While the attention calculation continues, each in-memory computing unit synchronously collects three local computing power status indicators at a fixed collection period and writes them to its local feature register. The length of the fixed collection period can be set according to the scheduling control granularity, and in one implementation, it is set to 10 milliseconds. The collection of computing power status indicators is completed by independent hardware counter logic within the in-memory computing unit, which does not occupy the computing resources of the dot product array and has no additional impact on the execution latency of the attention calculation.

[0042] The first local computing power status indicator is the storage row activation count, defined as the cumulative number of times the in-memory computing unit triggers a storage row activation operation within the current acquisition cycle. A storage row activation operation refers to the operation where the storage array reads data from a storage unit at a specified row address into the row buffer. Each time the dot product array reads a key vector or value vector, if the storage row containing the target data is different from the row address cached in the current row buffer, a row activation operation needs to be triggered. Therefore, the storage row activation count directly reflects the data access intensity of the current in-memory computing unit within the acquisition cycle: a higher activation count indicates that the dot product array is intensively accessing key-value data in multiple different storage rows, resulting in a heavier local computing load; a lower activation count indicates that the current in-memory computing unit has a lighter computational task or better data locality. In terms of hardware implementation, an edge counter is deployed at the output of the storage array's row decoder. Each time a row activation operation is triggered, the counter value is incremented by 1. At the end of the acquisition cycle, the current counter value is written to the local feature register, and the counter is cleared.

[0043] The second local computing power status indicator is the computation channel occupancy time ratio, defined as the ratio of the number of clock cycles during which each computation channel of the in-memory computing unit is in an active computing state within the current acquisition cycle to the total number of clock cycles in the acquisition cycle. A computation channel refers to the parallel computing path formed by the multiply-accumulate units in the dot product array. When a multiply-accumulate unit is performing a multiplication or accumulation operation, the corresponding computation channel is in an active computing state; when a multiply-accumulate unit is waiting for data to be loaded from the row buffer or is idle due to pipeline bubbles, the corresponding computation channel is in an inactive state. The computation channel occupancy time ratio reflects the computational efficiency utilization of the dot product array: a ratio close to 1 indicates that the array has almost no idle waiting time, and computing resources are fully utilized; a low ratio indicates that the array frequently pauses due to data waiting. In terms of hardware implementation, a 1-bit active flag signal is set for each multiply-accumulate unit of the dot product operation array. The signal is sampled in each clock cycle and accumulated to the active cycle counter. At the end of the acquisition cycle, the value of the active cycle counter is divided by the product of the total number of clock cycles in the acquisition cycle and the number of multiply-accumulate units. The quotient is the ratio of the calculation channel occupancy time. Then, the ratio is written to the local feature register and the counter is cleared to zero.

[0044] The third local computing power status indicator is the number of bytes transmitted via the interface, defined as the total number of bytes of data sent by the in-memory computing unit to the main computing chip via the high-bandwidth memory interface during the current acquisition cycle. This indicator reflects the data output pressure from the in-memory computing unit to the main computing chip: a higher number of bytes transmitted indicates that a large number of intermediate results (such as local extreme value scalars, local exponential accumulation scalars, and local weighted vectors) are being transmitted back to the main computing chip via the through-silicon via (TSV) and bottom logic chip, resulting in significant bandwidth usage on the interface path. In terms of hardware implementation, a byte counter is deployed in the data output buffer between the in-memory computing unit and the TSV interface. Each time a data packet is output to the interface, the byte length of that data packet is accumulated into the byte counter. At the end of the acquisition cycle, the current value of the byte counter is written to the local feature register, and the byte counter is cleared.

[0045] At the end of each acquisition cycle, the main computing chip of each inference node reads the local feature registers of all in-memory computing units under its jurisdiction through the internal register access path. It then takes the arithmetic mean of the storage row activation counts, the arithmetic mean of the computing channel occupancy time ratios, and the arithmetic mean of the number of bytes transmitted via the interface, and encapsulates these three averages into a single node-level computing power status summary record. Using node-level average aggregation instead of unit-by-unit reporting reduces the number of data entries that the computing power scheduling control node needs to process without losing macro-level load trend information. In a cluster with 32 inference nodes, each containing 8 in-memory computing units, unit-by-unit reporting requires processing 256 records, while node-level aggregation only requires processing 32 records. This keeps the state vector dimension of the subsequent scheduling agent within a reasonable range, reducing the input complexity of the policy network. In scenarios with higher diagnostic requirements for load distribution uniformity, the standard deviation of each indicator among all in-memory computing units within the node can be additionally appended to the node-level computing power status summary record to provide the scheduling agent with information on the degree of load imbalance within the node.

[0046] At the start of each scheduling cycle, the platform's computing power scheduling control node reads the node-level computing power status summary records of all inference nodes through the management network, and aggregates them to obtain a global computing power status table. Each record in the global computing power status table corresponds to one inference node and includes three average indicators and the corresponding inference node number. The length of the scheduling cycle is not shorter than the length of the acquisition cycle to ensure that at least one fully updated computing power status snapshot can be obtained in each scheduling cycle. In one implementation, the acquisition cycle is set to 10 milliseconds, the scheduling cycle is set to 100 milliseconds, and the global computing power status table is refreshed 10 times in each scheduling cycle. The computing power scheduling control node takes the result of the most recent refresh at the start of the scheduling cycle as the global computing power status table for the current scheduling cycle.

[0047] The following detailed explanations, in conjunction with specific implementation methods, cover the global computing power state aggregation process of the computing power scheduling control node, the maintenance and update mechanism of the resource-constrained virtual queue, the network structure and training process of the scheduling agent, the method for constructing the scheduling state vector, the synthesis logic of the instant reward signal, and the complete execution process and deterministic backoff scheduling strategy of the adaptive conformal prediction confidence test.

[0048] At the start of each scheduling cycle, the computing power scheduling and control node initiates a read request to all inference nodes via the management network. Each inference node returns a node-level computing power status summary record generated at the end of the most recent collection cycle. After receiving the data returned by all inference nodes, the computing power scheduling and control node arranges the data in ascending order of inference node number to form a global computing power status table. The global computing power status table is a two-dimensional table structure with the number of rows equal to the total number of inference nodes. Each row contains four fields: inference node number, average storage row activation count, average computation channel occupancy time ratio, and average number of bytes transmitted via interface. In a cluster with 32 inference nodes, the global computing power status table has 32 rows, each with four fields, and the data volume is only a few hundred bytes, making the management network transmission overhead negligible. The global computing power status table is treated as a static snapshot within each scheduling cycle and is not updated within the scheduling cycle until the start of the next scheduling cycle when data is collected again.

[0049] The computing power scheduling control node maintains two virtual queues for each inference node: one for memory usage and one for response latency. These virtual queues are not physical queues that actually store data to be processed; instead, they are scalar values ​​called queue depth values, used to measure the extent to which the corresponding resource metric has cumulatively exceeded the capacity threshold over several scheduling cycles. The reason for introducing the virtual queue mechanism is that directly writing resource constraints as hard penalties would cause the scheduling policy to oscillate violently near the constraint boundaries. The virtual queue depth value, however, has integral memory characteristics: as resources continuously exceed limits, the queue depth value monotonically increases, gradually amplifying the penalty; as resources fall below the threshold, the queue depth value gradually decays, gradually weakening the penalty. This progressive penalty mechanism provides the policy network with smooth gradient signals, enabling it to learn robust resource allocation strategies over multiple scheduling cycles, rather than overreacting to constraints in a single cycle.

[0050] The virtual memory usage queue is updated once at the end of each scheduling cycle. The update process is as follows: The computing power scheduling control node obtains the actual memory usage of the inference node in the current scheduling cycle, divides the actual memory usage by the platform's preset memory capacity threshold, and obtains the normalized memory usage rate. The memory capacity threshold is not the total physical memory capacity of the inference node, but a safety waterline set by the platform's operations and maintenance personnel based on operational experience. It is usually set to 80% to 90% of the total physical memory capacity to reserve buffer space for sudden memory demand. In one implementation, the total physical memory capacity of the inference node is 80 gigabytes, and the memory capacity threshold is set to 64 gigabytes. When the actual memory usage is 70 gigabytes, the normalized memory usage rate is 70 divided by 64, which is approximately equal to 1.09. Then, the difference between the normalized memory usage rate and 1 is taken. When the difference is greater than 0, it indicates that the memory usage in the current scheduling cycle exceeds the safe threshold, and the difference is added to the current queue depth value of the virtual queue for memory usage. When the difference is not greater than 0, it indicates that the memory usage is within a safe range, and the absolute value of the difference is subtracted from the current queue depth value, with the lower limit of the subtraction result truncated to 0, meaning the queue depth value will not be negative. Using the above numerical example, if the difference is 0.09, the current queue depth value increases from 0.15 in the previous cycle to 0.24; if the difference is -0.05, the current queue depth value decreases from 0.24 to 0.19. Through normalization, the change in queue depth value is always calculated in a dimensionless ratio space based on 1, avoiding the problem of numerical scale imbalance caused by differences in physical dimensions of different resource indicators in subsequent reward synthesis.

[0051] The response latency virtual queue is updated in the same way as the memory usage virtual queue, the only difference being the source of the normalization metric. The normalization metric for the response latency virtual queue is the 95th percentile response latency of the inference node within the current scheduling cycle divided by the platform's preset response latency upper limit. The reason for choosing the 95th percentile instead of the average as the latency metric is that in high-concurrency inference scenarios, the average response latency is not sensitive to a small number of extremely long requests and may mask the tail latency deterioration; while when the inference node is under high load, it is precisely the requests with the longest queuing time that best reflect the node's congestion level. The 95th percentile response latency captures "the worst experience for the vast majority of requests," more accurately triggering the growth of the virtual queue depth value, driving the policy network to promptly transfer new tasks to idle nodes. In one implementation, the platform presets a response latency cap of 200 milliseconds. If an inference node processes 500 inference requests within the current scheduling cycle, the latency of the 500 requests, arranged in ascending order, is the 95th percentile response latency at the 475th position. For more stringent service quality scenarios, the percentile threshold can be adjusted to the 99th percentile.

[0052] The computing power scheduling control node deploys a scheduling agent. The agent's responsibility is to generate a task allocation scheme based on the current global computing power status and the information of tasks to be scheduled in each scheduling cycle, and to continuously improve the allocation strategy through continuous interaction with the environment. The scheduling agent internally contains three independent feedforward neural networks: a policy network, a value network, and a throughput prediction network. The three networks share the same input but each has independent network parameters.

[0053] Before explaining the network structure, we will first describe the method for constructing the scheduling state vector. The scheduling state vector is a numerical representation of the environmental state observed by the scheduling agent in each scheduling cycle. Its dimension remains fixed throughout all scheduling cycles to meet the requirement of consistent input dimension for feedforward neural networks. The construction process involves concatenating the following three numerical segments sequentially:

[0054] The first segment is the expanded value of the global computing power status table. The node-level computing power status summary records of all inference nodes in the global computing power status table are arranged in ascending order of node number, and each record is expanded into three values ​​(average storage row activation count, average computation channel occupancy time ratio, and average interface transmission bytes). These values ​​are then concatenated sequentially to form the first segment. If the cluster contains... If there are inference nodes, then the length of the first segment is . ,in This represents the total number of inference nodes.

[0055] The second section contains the attribute information of the tasks to be scheduled. All tasks in the current queue are sorted in descending order of context token count, and the first few tasks after the sorting are selected. The context token count and concurrent request count for each task. The context token count reflects the key-value cache capacity required by a single task, playing a decisive role in the consumption of GPU memory resources; the concurrent request count reflects the request density from the same task source, directly impacting the utilization rate of the computing channel and the interface transmission volume. The context token count and concurrent request count for each task are arranged in pairs, with the first... A total of tasks generated A numerical value. When the number of tasks in the task queue to be scheduled is insufficient. When the interval is reached, fill the empty positions with 0 to ensure that the length of the second segment is always 0. . The maximum number of task slots preset for the platform should not be less than the maximum number of tasks that the cluster may receive simultaneously within a single scheduling cycle. In one implementation, It is 32. The value is 64, but in practice, the number of tasks to be scheduled in most scheduling cycles does not exceed 50. The strategy of prioritizing large tasks after sorting them in descending order of context token count is to ensure that the task consuming the most GPU memory always occupies the most informative front position in the state vector, even if the number of tasks is insufficient. When filling zero values, they only appear at the end of the sequence and do not obscure the feature information of critical tasks.

[0056] The third segment represents the queue depth value of the resource-constrained virtual queue. The current queue depth values ​​of the memory usage virtual queue and the response latency virtual queue for all inference nodes are concatenated sequentially according to node number, totaling [amount missing]. A number.

[0057] The three numerical segments above, when concatenated sequentially, form a dimension of... The scheduling state vector. For 32, In the implementation with 64, the dimension of the scheduling state vector is 288.

[0058] The policy network is structured as a 3-layer fully connected feedforward network. The input layer has a receiving dimension of... The scheduling state vector is generated. The first hidden layer contains 256 neurons, and the second hidden layer contains 256 neurons. Modified linear units are used as activation functions between hidden layers. The dimension of the output layer is determined based on the decision space of task allocation: the policy network needs to output the probability distribution of each task to be scheduled, which determines its allocation to each inference node. Therefore, for each task, the output... The numerical values ​​are converted into a probability distribution using a normalized exponential function. In actual implementation, the policy network... Each task slot outputs a probability distribution. For empty task slots filled with 0, the scheduling control node skips them without assigning a slot. For each non-empty task, the computing power scheduling control node performs one random sampling based on the probability distribution output by the policy network. The sampling result is the target inference node number for that task. After sampling all non-empty tasks, the task allocation scheme for the current scheduling cycle is formed. Random sampling, rather than directly taking the maximum probability, is used to maintain sufficient exploration capability during the training of the policy network: deterministic greedy choices cause the policy network to converge quickly to a local optimum, while random sampling allows the policy network to try lower-probability but potentially better allocation combinations, thus discovering a globally better scheduling strategy over long-term training. In the actual operation phase after the policy network training converges, to reduce the random fluctuation of the allocation results, a temperature coefficient can be introduced into the normalized exponential function. Setting the temperature coefficient to a small value between 0.1 and 0.5 will make the probability distribution tend to concentrate, improving the stability of the allocation scheme without completely eliminating exploration.

[0059] The value network has a similar structure to the policy network, both being three fully connected feedforward networks with 256 neurons in each hidden layer. The input is the same scheduling state vector, and the output is a single scalar value used to estimate the expected cumulative reward obtained by continuing to execute according to the current policy in the current state. In proximal policy optimization algorithms, the value network provides a baseline for policy gradient calculation, reduces the variance of gradient estimation, and thus accelerates the convergence process of the policy network.

[0060] The throughput prediction network is a two-layer fully connected feedforward network. The first hidden layer contains 128 neurons, and the output layer contains one neuron. The activation function is also a modified linear unit (no activation function is applied to the output layer to allow arbitrary non-negative values). The throughput prediction network receives the scheduling state vector as input and outputs a scalar value representing the prediction of the total number of inference tasks to be completed in the current scheduling cycle. The core difference between the throughput prediction network and the value network is that the value network estimates the expected cumulative return over multiple future scheduling cycles, including the discount factor for the decay of long-term returns, and its output value does not directly correspond to the observables of a single cycle. The throughput prediction network estimates the measurable physical metric for the current single scheduling cycle, i.e., the total number of inference tasks actually completed. The reason for separating the two instead of reusing the value network output is that the subsequent adaptive conformal prediction confidence test requires comparing the predicted value with the actual observation value within the same cycle to calculate the inconsistency degree. This requires that the predicted value and the actual observation value strictly correspond in terms of physical semantics and time range, which the expected cumulative return of the value network does not meet. The throughput prediction network is trained using a mean squared error loss function. The total number of inference tasks actually completed at the end of each scheduling cycle is used as the supervision label, and the mean squared error between this label and the predicted number of completed tasks output by the throughput prediction network at the beginning of that cycle is used as the loss value for gradient descent updates. The throughput prediction network, policy network, and value network synchronously perform parameter updates at the end of each scheduling cycle.

[0061] The immediate reward signal for the scheduling agent is composed of two components. The first component is the scheduling throughput component, which is the total number of inference tasks actually completed in the current scheduling cycle. The total number of inference tasks actually completed is the number of tasks that have completed the entire process from receiving a request to returning an inference result within the time window of the current scheduling cycle, excluding tasks that are still in the queue or still being executed. The second component is the penalty component, which is calculated by summing the current queue depth of the virtual queue for memory usage of all inference nodes and the current queue depth of the virtual queue for response latency, and then multiplying the sum of all virtual queue depths by a fixed adjustment coefficient. It is a positive real number used to adjust the balance between the goal of maximizing throughput and the constraint of resource stability. The larger the value, the more the policy network tends to prioritize satisfying resource constraints and avoid the virtual queue depth value from continuously increasing during training, even at the cost of sacrificing some throughput. The smaller the value, the more the policy network tends to maximize throughput and has a higher tolerance for resource overruns. In one implementation, Set it to 0.5. In practice, you can first set it to... Set the value to a small value and observe the trend of the virtual queue depth during training: if the queue depth continues to rise and cannot converge, gradually increase it. If the queue depth value is consistently close to 0 but the throughput is significantly low, then appropriately reduce it. The immediate reward signal is the result of subtracting the penalty component from the scheduling throughput component. The integral memory characteristic of the virtual queue depth value causes the penalty component to amplify cycle by cycle when resources are continuously exceeded. Even if the exceedance in a single cycle is small, the cumulative effect will significantly suppress the immediate reward signal after several cycles, forcing the policy network to proactively adjust the allocation scheme. This mechanism essentially transforms long-term resource constraints into an adaptive penalty term in the immediate reward signal. Without explicitly solving a constraint optimization problem, the policy network implicitly learns a scheduling strategy that satisfies resource constraints simply by maximizing the cumulative expected value of the immediate reward signal.

[0062] At the end of each scheduling cycle, the scheduling agent performs gradient updates on the policy network and value network according to the pruning update rules of the near-end policy optimization algorithm. The core idea of ​​the near-end policy optimization algorithm is to limit the step size of each policy network update to prevent the policy from deviating too far from the current policy after a single update, which could lead to performance collapse. Specifically, at the end of each scheduling cycle, the computing power scheduling control node calculates the ratio of the probability of the policy network's output of the current state-action pair under the new parameters to the probability of the output under the old parameters. When the probability ratio exceeds the pruning range parameter centered at 1, the algorithm will perform a gradient update. When the interval is of radius, the probability ratio is truncated to the interval boundary. The product of the truncated probability ratio and the advantage function is used as the loss value of the policy network. The advantage function is defined as the actual instantaneous reward signal obtained in the current scheduling cycle plus a discount factor. Multiply by the value network's output value for the next scheduling cycle state and then subtract the value network's output value for the current scheduling cycle state, where This is a discount factor, a real number ranging from 0 to 1, which controls the rate at which the impact of future returns on the current decision decays. In one implementation, the clipping range parameter... Set to 0.2, discount factor The value network is set to 0.99. The value network update uses the mean squared error loss function, with the instantaneous reward signal and discount factor for the current scheduling period. The target value is calculated by multiplying the sum of the value network's output values ​​for the next scheduling cycle state, and the mean square error between this sum and the value network's output values ​​for the current scheduling cycle state is used as the loss value. Both the policy network and the value network perform gradient updates once at the end of each scheduling cycle, using an adaptive moment estimation optimizer. In one implementation, the learning rate is set to 0.0003.

[0063] The following describes the complete execution process of the adaptive conformal prediction confidence test.

[0064] The computing power scheduling control node maintains a fixed capacity of The calibration sample sliding buffer is a first-in-first-out circular storage structure that stores the most recently used samples in chronological order. Each completed scheduling cycle corresponds to an inconsistency value. In one implementation, The value is set to 200. The inconsistency value is generated and written to the end of the calibration sample sliding buffer after each scheduling cycle, while the first record in the buffer is removed to ensure the buffer always contains the most recent record. Records of each scheduling cycle.

[0065] The inconsistency value for each scheduling cycle is generated as follows: the absolute difference between the total number of inference tasks actually completed in that scheduling cycle and the number of predicted completed tasks output by the throughput prediction network at the beginning of that scheduling cycle. The inconsistency value characterizes the prediction deviation of the throughput prediction network under a specific scheduling state: when the throughput prediction network has high prediction accuracy for a certain type of scheduling state, the inconsistency value approaches 0; when the scheduling environment experiences an unprecedented load pattern or unexpected situations such as hardware degradation of the inference node, the prediction deviation of the throughput prediction network increases, and the inconsistency value increases accordingly.

[0066] In the current scheduling cycle, before the strategy network in step 2 has output the task allocation scheme but has not yet issued it for execution, the computing power scheduling control node starts the adaptive conformal prediction confidence test to determine whether the current task allocation scheme is reliable and decide the subsequent execution path accordingly.

[0067] Traditional conformal prediction methods use fixed-order quantiles on the calibration sample set as confidence interval boundaries. Their effective coverage relies on the commutativity assumption between calibration and current samples—that is, the calibration and current samples should come from the same distribution and be order-independent. However, in real-world scheduling systems, cluster load patterns change over time, and inference nodes may experience performance drift due to hardware aging or maintenance, making the commutativity assumption difficult to hold in the long term. If a fixed quantile method with equal weights for all samples in the buffer is still used, older samples from earlier periods will not reflect recent environmental characteristics, leading to overly wide or narrow confidence intervals and a loss of discriminative ability for the current environmental state. Therefore, this scheme introduces an exponential decay coefficient to assign different importance to calibration samples at different time points, giving more influence to recent samples in quantile calculations, while the influence of older samples gradually decreases over time. This approach allows the confidence interval radius to track the drift trend of the environment without assuming a fixed distribution: when the prediction deviation of the recent scheduling cycle generally increases, the high inconsistency value with high recent weight will push up the quantile, so that the confidence interval radius can be expanded in time; conversely, when the recent prediction deviation decreases, the confidence interval radius can also be narrowed accordingly.

[0068] refer to Figure 3 The horizontal axis represents the sliding buffer of the calibration samples. The inconsistency values ​​are numbered in ascending order, and the vertical axis represents the accumulated value obtained by adding the exponential decay coefficients corresponding to each inconsistency value along the ascending order. In this embodiment... Set to 200, attenuation base Set to 0.995, default signal level. The value is set to 0.9. The graph shows the curve of the cumulative exponential decay coefficient increasing with the permutation order. This curve monotonically increases from the initial value at position 1 and eventually approaches the sum of all exponential decay coefficients. The figure also plots the confidence level threshold, i.e. and The product of these factors is represented by a horizontal reference line. The intersection of the cumulative curve and the confidence level threshold is the critical sequence position. The inconsistency value corresponding to this position is determined as the conformal prediction critical value and directly used as the confidence interval radius for the current scheduling cycle. The difference between weighted quantiles and traditional equal-weighted quantiles can be intuitively understood from the figure: in the traditional equal-weighted quantile method, each sample contributes equally to the accumulation process, and the critical quantile position always falls on the _____. The coefficients of recent samples are much larger than those of distant samples when using an exponential decay coefficient. Regardless of their position in the sorted sequence, the larger coefficients of recently entered inconsistency values ​​accelerate the growth of the accumulated value. If multiple recent inconsistency values ​​are large (e.g., due to scheduling environment drift causing increased prediction bias in the throughput prediction network), these large values ​​will be in later positions in the ascending order. However, because their corresponding exponential decay coefficients are large, the accumulation curve grows rapidly before reaching these later positions, making the critical position... Moving to a later position corresponds to a larger conformal prediction critical value, ultimately expanding the confidence interval radius. This mechanism ensures that during environmental drift, the confidence interval radius adaptively expands to reflect the true level of current prediction uncertainty, rather than being diluted by a large number of long-term stationary samples. The specific critical sequence number and corresponding conformal prediction critical value are marked in the figure for reference and verification during implementation.

[0069] The specific execution process of adaptive conformal prediction is as follows. Assume the calibration samples in the sliding buffer are arranged from furthest to nearest time. The inconsistency values ​​are respectively ,in This represents the inconsistency value of the earliest item to enter the buffer. This represents the inconsistency value generated in the most recent scheduling cycle. An exponential decay coefficient is assigned to each inconsistency value. , The calculation method is as follows: ,in The attenuation base is a real number greater than 0 and not exceeding 1. Number the time position of the inconsistency value in the buffer. This represents the buffer capacity. The number of samples from the most recent period is... Corresponding coefficients The largest coefficient among all samples; the sample from the furthest period. Corresponding coefficients ,along with As it increases, it approaches 0. The value of determines the decay rate of the sample in the long term: The closer it is to 1, the slower the decay, and the longer the long-term sample still retains a large influence, which is suitable for scenarios with relatively stable load patterns. The smaller the value, the faster the decay. The confidence interval is mainly determined by samples from a few recent periods, making it suitable for scenarios with frequent fluctuations in load patterns. In one implementation, Set to 0.995, when When the coefficient is 200, the coefficient of the furthest sample is approximately 0.37, still retaining some influence but significantly lower than that of the recent samples. In scenarios with drastic changes in load patterns, it can be... Adjust to 0.98 or 0.95 to accelerate the forgetting of long-term samples.

[0070] After obtaining all exponential decay coefficients, The inconsistency values ​​are sorted in ascending order of magnitude to obtain the sorted sequence. ,in Indicates the number after arrangement The inconsistency values ​​at each location, with the subscripts in parentheses. Indicates the order after arrangement. Each The original time position before arrangement is numbered as follows: Its corresponding exponential decay coefficient is still 1. Arrange the items in ascending order and sum the corresponding exponential decay coefficients one by one: first take... corresponding As the starting value for accumulation, it is then added one by one. corresponding , corresponding And so on. Let the sum be up to the nth... When the cumulative sum first reaches or exceeds a certain position. The value of , where The preset confidence level is set to a real number greater than 0 and less than 1 (in one implementation, it is set to 0.9). The sum of all exponential decay coefficients Take this moment Inconsistency values ​​corresponding to each location The conformal prediction critical value is used as the confidence interval radius for the current scheduling period.

[0071] refer to Figure 4 The horizontal axis represents the scheduling cycle number, and the vertical axis represents the confidence interval radius. The figure divides the 400 scheduling cycles into a stable period of 200 cycles and a drift period of 200 cycles to simulate the real-world scenario of the scheduling environment switching from a stable load mode to a fluctuating load mode. The figure also marks the preset radius threshold. In this embodiment, it is set to 15. When the confidence interval radius of a certain scheduling period exceeds... At this point, the computing power scheduling control node abandons the task allocation scheme output by the policy network and instead executes deterministic backoff scheduling. During the stable period, the confidence interval radii of both methods remain at a low level and show similar trends, neither reaching the radius threshold. This indicates that the prediction bias of the throughput prediction network is small when the load pattern is stable, and the calibration effect of the two methods is not significantly different. When the scheduling environment enters the drift period around the 200th cycle, the performance of the two methods diverges significantly. The confidence interval radius of the adaptive conformal prediction method rises rapidly after the drift occurs, climbing to a level exceeding the radius threshold within dozens of scheduling cycles, triggering deterministic backoff scheduling to provide a safety net. This rapid response capability stems from the weighted amplification effect of the exponential decay coefficient on recent samples: although the large inconsistency values ​​generated in the early stage of the drift are only a minority in the buffer, their corresponding exponential decay coefficients are much larger than the coefficients of the early stable samples, which rapidly raise the conformal prediction critical value in the weighted quantile calculation. The confidence interval radius of the fixed-weight conformal prediction method also shows an upward trend, but the rate of increase is significantly lagging. This is because, in the equal-weight method, newly entering drift period samples in the buffer share the same weight as the majority of stationary period samples in the buffer. The small inconsistency values ​​of a large number of stationary period samples dilute the quantile calculation, thus delaying the expansion of the confidence interval radius. This lag means that in the critical window of the early drift stage, the fixed-weight method still outputs a relatively small confidence interval radius. The computing power scheduling control node misjudges the policy network decision as reliable and continues to issue potentially unreasonable task allocation schemes, increasing the risk of resource overruns. The triggered rollback scheduling area marked in the figure clearly shows that the adaptive method initiates deterministic rollback scheduling earlier and more frequently than the fixed-weight method, thus providing more timely safety guarantees in the early stages of environmental drift.

[0072] The physical meaning of the confidence interval radius is: under the drift characteristics of the current load environment, the probability that the deviation of the throughput prediction network from the prediction value of the current scheduling period is greater than the confidence interval radius does not exceed [a certain percentage]. The smaller the confidence interval radius, the more reliable the prediction results are, indicating that the recent prediction bias of the throughput prediction network under similar load patterns is concentrated in a narrow range. Conversely, the larger the confidence interval radius, the more volatile the recent prediction bias is, indicating that the allocation decisions made by the policy network under the current state lack a reliable prediction basis.

[0073] The computing power scheduling control node then determines whether the confidence interval radius exceeds a preset radius threshold. . The settings should match the business's tolerance for scheduling jitter: Too small a value will lead to frequent triggering of backoff scheduling, thus losing the adaptive advantage of reinforcement learning scheduling; If the value is too large, the allocation scheme of the policy network will still be executed when the prediction is obviously unreliable, increasing the risk of resource overrun. In one implementation, The throughput prediction network is set to predict 20% of the median number of tasks to be completed in the most recent 200 scheduling cycles. That is, if the median prediction is 100 tasks, then... It is 20. It can also be set to a fixed constant, such as 15 or 25, depending on the specific business scenario.

[0074] When the confidence interval radius does not exceed the preset radius threshold, the computing power scheduling control node determines that the scheduling agent's decision confidence in the current state is sufficient, and directly distributes the task allocation scheme output by the policy network to each target inference node for execution. After receiving the task list assigned to itself, each target inference node arranges the tasks into the inference execution queue according to its local scheduling policy.

[0075] When the confidence interval radius exceeds a preset threshold, the computing power scheduling control node determines that the scheduling agent's decision confidence in the current state is insufficient, abandons the task allocation scheme output by the policy network, and instead executes deterministic rollback scheduling. The deterministic rollback strategy, rather than relying on the policy network's probabilistic sampling, is adopted because in high-uncertainty states, the policy network's probability distribution may deviate significantly from a reasonable allocation scheme, and continuing sampling carries the risk of concentrating a large number of tasks on already heavily loaded nodes. Deterministic rollback scheduling uses hardware load indicators directly observed at the in-memory computing unit level as the allocation basis, without relying on any neural network predictions, providing a predictable safety net in case of policy network failure.

[0076] The execution process of deterministic rollback scheduling is as follows: The computing power scheduling control node reads the average value of the storage row activation count from the node-level computing power status summary record of each inference node from the global computing power status table, and uses the average value of the storage row activation count as the current load of the inference node. The average value of the storage row activation count comes directly from the hardware counter of the in-memory computing unit, without any neural network processing, and is the most primitive indicator reflecting the current attention computing intensity of the inference node. The higher the average value of the storage row activation count, the more intensively the inference node is performing attention computing, and the smaller the margin for allocating new tasks. The tasks to be scheduled are arranged in descending order of context token number. The reason for arranging in descending order is that tasks with larger context token numbers occupy more key-value cache space and consume more resources on the inference node. Prioritizing the allocation of large tasks can avoid the situation in the later stages of the allocation process where all nodes are close to full load and cannot accommodate large tasks. Each task in the arrangement is taken one by one and allocated to the inference node with the smallest current load. After each allocation is completed, the context token number of the corresponding task is appended to the current load of the receiving inference node. The context token count is appended to the load instead of being directly updated using the storage row activation count. This is because the task has not yet started execution, and the in-memory computation units have not yet generated new storage row activation operations. Therefore, the task's context token count can only be used as an approximate estimate of the additional load that the task will bring. There is a positive correlation between the context token count and the number of storage row activations: the more context tokens, the more storage rows reside in the key-value cache shard, and the more row activation operations are triggered when the dot product array reads the key vector. After all tasks are allocated, they are distributed to each target inference node for execution.

[0077] When the radius of the confidence interval exceeds a preset radius threshold, triggering deterministic backoff scheduling, the computing power scheduling control node simultaneously marks the current scheduling period as a low-confidence period. The role of marking low-confidence periods is reflected in the subsequent training and updates of the policy network: within the training batch collected by the scheduling agent for the current policy version, there are several state transition samples generated from consecutive scheduling periods (each sample contains a scheduling state vector, the action output by the policy network, an immediate reward signal, and the scheduling state vector for the next period), some of which come from low-confidence periods and some from normal-confidence periods. For state transition samples generated in low-confidence periods, an increased loss scaling factor is multiplied when calculating the policy gradient, amplifying the contribution of these samples to the policy network parameter updates. The physical meaning of increasing the loss scaling factor is that low-confidence periods represent that the policy network's allocation scheme in the corresponding state region is not trusted by the conformal prediction test of the throughput prediction network, indicating that the policy network's strategy for this type of state is not yet mature and needs to allocate more learning resources during training to accelerate improvement. In one implementation, the increase in the loss scaling factor is set to 2 to 5 times that of normal samples. It is important to emphasize that the aforementioned loss scaling operation is only performed within the same training batch collected by the scheduling agent using the current policy version, and does not involve replaying state transition samples collected from previous policy versions, in order to maintain the policy gradient consistency requirement of the near-end policy optimization algorithm. In an optional implementation, if the scheduling agent is replaced by an off-policy reinforcement learning algorithm that supports experience replay (such as the soft actor-critic algorithm), then state transition samples generated during low confidence periods can be stored in the experience replay buffer with an increased sampling priority for a long time, and repeatedly sampled and learned in subsequent training rounds.

[0078] During the platform's cold start phase, i.e., before the calibration sample sliding buffer is full... During the initial run of this record, the execution method of the adaptive conformal prediction confidence test is adjusted as follows: when the number of existing records in the buffer is less than the preset minimum number of calibration samples... Time (in one implementation) (Set to 30), the computing power scheduling control node does not perform conformal prediction verification, and always adopts deterministic backoff scheduling; when the number of records reaches But insufficient When necessary, replace all existing records in the current buffer. The above adaptive conformal prediction process is executed on each record, with the allocation method of the exponential decay coefficient remaining unchanged; only the effective capacity of the buffer is temporarily less than [a certain value]. As the scheduling cycle progresses, the buffer gradually fills up until... After one record is completed, the system enters the normal sliding update mode.

[0079] The entire process described above constitutes a three-layer closed-loop collaborative operation and maintenance mechanism: in-memory computing hardware layer computing power awareness, reinforcement learning dynamic scheduling, and adaptive conformal prediction confidence quantification. While performing attention calculations, the in-memory computing unit uses a hardware counter to perceive the computing power status in real time, and after hierarchical aggregation, provides environmental observations for the scheduling agent. The scheduling agent uses a policy network combined with a penalty mechanism of a resource-constrained virtual queue to generate a task allocation scheme that balances throughput and resource stability. Adaptive conformal prediction verification provides a statistical confidence judgment independent of the policy network before the allocation scheme is executed. In high-uncertainty states, it automatically switches to deterministic backoff scheduling as a fallback guarantee and feeds back low-confidence signals to the policy network during training to drive its continuous improvement.

[0080] refer to Figure 1In Figure (a), the scheduling cycle is plotted on the horizontal axis, and the 95th percentile response latency is plotted on the vertical axis, with the unit being milliseconds. The figure also shows the platform's preset upper limit for response latency, which is set to 200 milliseconds in this embodiment. During normal load periods, the 95th percentile response latency of the method of this invention remains stably within the range of 100 to 140 milliseconds, far below the upper limit. When the cluster experiences a load surge between the 250th and 350th scheduling cycles, the 95th percentile response latency of the method of this invention rises to approximately 150 to 170 milliseconds, still not exceeding the 200-millisecond upper limit. This performance is attributed to the integral penalty mechanism of the Lyapunov virtual queue: when the 95th percentile response latency of some inference nodes approaches the upper limit, the queue depth of the virtual response latency queue begins to accumulate and increase, the penalty component increases accordingly, and the immediate reward signal is suppressed. This drives the policy network to transfer newly arrived inference tasks to inference nodes with more latency margin in subsequent scheduling cycles, thus keeping the overall latency within an acceptable range during load surges. The method of using only near-end policy optimization scheduling without introducing virtual queue constraints results in a sharp increase in the 95th percentile response latency to the 220-280 ms range during load surges, significantly exceeding the upper limit. This is because, without the long-term cumulative penalty signal of the virtual queue, the policy network relies solely on the throughput component in the single-cycle immediate reward to drive decisions, tending to concentrate tasks on nodes with good recent throughput performance, ignoring the queuing pressure already accumulated on these nodes. The polling scheduling baseline exhibited high latency across all scheduling cycles, surging to over 300 milliseconds during load spikes, reflecting the complete lack of ability of the deterministic polling strategy to dynamically adjust the allocation scheme based on the real-time load status of nodes. Figure (b) shows the scheduling cycle on the horizontal axis and the memory overrun rate on the vertical axis, expressed as a percentage. The memory overrun rate is defined as the proportion of cycles in a sliding window of 20 scheduling cycles where the actual memory usage of the inference node exceeds the platform's preset memory capacity threshold. The method of this invention maintains a memory overrun rate below 5% during normal load phases, rising to approximately 12% to 18% during load spikes before rapidly declining. Only the near-end strategy optimized scheduling method achieved an overrun rate of 40% to 50% during spikes, while the polling baseline reached an even higher 55% to 65%. The memory overrun rate of the method of this invention is significantly lower than that of the other two methods, which verifies the constraint effect of the memory occupancy virtual queue on memory resources: when the normalized memory occupancy rate of a certain inference node continues to exceed 1.0, the queue depth value of the memory occupancy virtual queue continues to increase. The amplification effect of the penalty component forces the policy network to rapidly reduce the probability of allocating new tasks to this node, so that the memory overrun state is alleviated within several scheduling cycles.

[0081] The above description of the embodiments is only for the purpose of helping to understand the method and core ideas of the present invention. It should be noted that those skilled in the art can make several improvements and modifications to the present invention, and these improvements and modifications also fall within the protection scope of the claims of the present invention.

Claims

1. A one-stop development and operation platform for large-scale intelligent agents that integrates computing power scheduling and model management, characterized in that: include: The heterogeneous computing power resource pool contains multiple inference nodes. Each inference node is equipped with a main computing chip and a high-bandwidth memory. The high-bandwidth memory has in-memory computing units deployed within its storage core. The in-memory computing unit is used to perform near-data attention calculations for large model inference and periodically collect local computing power status indicators. The local computing power status indicators include three items: the first item is the storage row activation count value, the second item is the computing channel occupancy time ratio, and the third item is the number of bytes transmitted through the interface. The main computing chip of each inference node summarizes the local computing power status indicators of all in-memory computing units of its node and generates a node-level computing power status summary record. The node-level computing power status summary record is a record generated by taking the arithmetic mean of the above three indicator values ​​of all in-memory computing units under the inference node. The computing power scheduling and control node is used to collect the node-level computing power status summary records of all inference nodes to obtain a global computing power status table. It maintains a resource-constrained virtual queue for each inference node. The resource-constrained virtual queue includes a memory usage virtual queue and a response latency virtual queue. It also deploys a scheduling agent containing a policy network, a value network, and a throughput prediction network. The policy network outputs a task allocation scheme based on a scheduling state vector composed of the global computing power status table, the attribute information of the tasks to be scheduled, and the queue depth value of the resource-constrained virtual queue. The scheduling throughput component is the total number of inference tasks actually completed in the current scheduling cycle. The penalty component is the sum of the current queue depth values ​​of the resource-constrained virtual queues of all inference nodes multiplied by a fixed adjustment coefficient. The instantaneous reward signal is the difference between the scheduling throughput component and the penalty component of the queue depth value of the resource-constrained virtual queue. The adaptive conformal prediction confidence verification module runs in the computing power scheduling control node. It maintains a calibration sample sliding buffer and performs adaptive conformal prediction based on the absolute difference between the total number of inference tasks actually completed in each scheduling cycle and the number of tasks predicted to be completed by the throughput prediction network. It obtains the confidence interval radius and selects a task allocation scheme or performs deterministic backoff scheduling based on the global computing power status table based on whether the confidence interval radius exceeds the preset radius threshold. The fixed capacity of the calibration sample sliding buffer is N. After each scheduling cycle is completed, the absolute difference between the total number of inference tasks actually completed and the number of tasks predicted by the throughput prediction network is written as the inconsistency value to the end of the calibration sample sliding buffer and the frontmost record is removed. The adaptive conformal prediction is executed as follows: the existing N inconsistency values ​​in the calibration sample sliding buffer are assigned exponential decay coefficients according to time order so that recent samples obtain larger coefficient values. All inconsistency values ​​are arranged in ascending order, and the corresponding exponential decay coefficients are accumulated item by item in the order of arrangement until the sum of the accumulated values ​​first reaches or exceeds the product of the preset confidence level and the sum of all exponential decay coefficients. The inconsistency value at this moment is taken as the conformal prediction critical value, and the conformal prediction critical value is used as the confidence interval radius of the current scheduling cycle.

2. The one-stop large-scale model intelligent agent development and operation and maintenance platform integrating computing power scheduling and model management according to claim 1, characterized in that, The in-memory computation unit includes a dot product array, an exponential approximation function unit based on a piecewise linear lookup table, and a set of local feature registers.

3. The one-stop large-scale model intelligent agent development and operation and maintenance platform integrating computing power scheduling and model management according to claim 2, characterized in that, Near-data attention computation is performed in three stages: In the first stage, the dot product array of each in-memory computing unit reads the key vector slices residing in local storage, performs element-wise multiplication on the query vector and each key vector, and accumulates them level by level along the vector dimension to obtain a local attention score sequence. The maximum value of the local attention score sequence is taken as the local extreme value scalar and sent back to the main computing chip. The main computing chip takes the maximum value of all local extreme value scalars as the global extreme value scalar and broadcasts it to all in-memory computing units. In the second stage, each in-memory computing unit subtracts the global extreme value scalar from each score in the local attention score sequence to obtain an offset score sequence, which is then approximated using an exponential approximation function. The unit performs exponential mapping on the offset fraction sequence to obtain a local exponential value sequence, sums the local exponential value sequence to obtain a local exponential accumulation scalar, performs element-wise multiplication of each exponential value in the local exponential value sequence with the value vector at the corresponding position in the local memory and accumulates them to obtain a local weighted value vector, and sends the local exponential accumulation scalar and the local weighted value vector back to the main computing chip; in the third stage, the main computing chip accumulates all local weighted value vectors element-wise to obtain a global weighted value vector, sums all local exponential accumulation scalars to obtain a global exponential accumulation scalar, and divides each element of the global weighted value vector by the global exponential accumulation scalar to obtain the final output vector of the current attention head.

4. The one-stop large-scale model intelligent agent development and operation and maintenance platform integrating computing power scheduling and model management according to claim 1, characterized in that, The first item is the storage row activation count, which is the cumulative number of times the in-memory computing unit triggers storage row activation operations in the current acquisition cycle; the second item is the computing channel occupancy time ratio, which is the ratio of the number of clock cycles in which each computing channel of the in-memory computing unit is in an active computing state in the current acquisition cycle to the total number of clock cycles in the acquisition cycle; the third item is the number of bytes transmitted via the interface, which is the total number of bytes of data sent by the in-memory computing unit to the main computing chip via the high-bandwidth memory interface in the current acquisition cycle.

5. The one-stop large-scale model intelligent agent development and operation and maintenance platform integrating computing power scheduling and model management according to claim 1, characterized in that, The update method for the video memory usage virtual queue is as follows: the actual video memory usage of the inference node in the current scheduling period is divided by the platform's preset video memory capacity threshold to obtain the normalized video memory usage rate. The difference between the normalized video memory usage rate and 1 is taken. When the difference is greater than 0, the difference is appended to the current queue depth value of the video memory usage virtual queue. When the difference is not greater than 0, the absolute value of the difference is deducted from the current queue depth value and the lower limit is truncated to 0. The response latency virtual queue is updated in the same way. Its normalized index is the response latency of the inference node in the current scheduling period divided by the platform's preset response latency upper limit value.

6. The one-stop large-scale model intelligent agent development and operation and maintenance platform integrating computing power scheduling and model management according to claim 1, characterized in that, The scheduling state vector is a fixed-dimensional vector, constructed as follows: the node-level computing power status summary records of all inference nodes in the global computing power status table are arranged in order of node number; the current task queue to be scheduled is sorted in descending order of context token number, and the context token number and concurrent request number of the first M tasks are taken. The empty positions with less than M tasks are filled with 0. Then, the current queue depth value of the resource constraint virtual queue of all inference nodes is concatenated, where M is the upper limit of the number of task slots preset by the platform.

7. The one-stop large-scale model intelligent agent development and operation and maintenance platform integrating computing power scheduling and model management according to claim 1, characterized in that, After receiving the scheduling state vector, the policy network outputs the probability distribution of each task to be scheduled to each inference node. The computing power scheduling control node performs one random sampling for each task to be scheduled based on the probability distribution to determine the target inference node. After all tasks are sampled, a task allocation scheme is formed. The scheduling agent performs gradient updates on the policy network and the value network according to the pruning update rules of the near-end policy optimization algorithm.

8. The one-stop large-scale model intelligent agent development and operation and maintenance platform integrating computing power scheduling and model management according to claim 1, characterized in that, The execution method of deterministic rollback scheduling is as follows: read the average value of the activation count of the storage row in the node-level computing power status summary record of each inference node from the global computing power status table as the current load of the inference node, arrange the tasks to be scheduled in descending order of context token number, and assign each task to the inference node with the smallest current load in turn. After each assignment is completed, the context token number of the corresponding task is appended to the current load of the receiving inference node until all tasks are assigned and then sent out for execution.

9. The one-stop large-scale model intelligent agent development and operation and maintenance platform integrating computing power scheduling and model management according to claim 1, characterized in that, When the radius of the confidence interval exceeds the preset radius threshold, the computing power scheduling control node marks the current scheduling period as a low confidence period. Within the training batch collected by the current policy version of the scheduling agent, the state transition samples generated in the low confidence period are multiplied by an increased loss scaling factor when calculating the policy gradient, so that the policy network increases the policy adjustment range for high uncertainty state regions in the current batch gradient update.