Large model inference scheduling method and system, storage medium and computer program product
By having computing nodes actively report status data and performance metrics, and adjusting to a computing node-led scheduling mode, the problems of load imbalance and high latency in traditional solutions are solved, achieving more efficient load balancing and stability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- IFLYTEK CO LTD
- Filing Date
- 2026-03-10
- Publication Date
- 2026-06-26
AI Technical Summary
Traditional distributed scheduling schemes cannot adapt to the characteristics of large model inference loads, resulting in uneven distribution of inference tasks, decreased throughput, and high inference latency.
By having computing nodes proactively perceive their own status data, calculate performance metrics, and report them to the gateway, the gateway uses these metrics to allocate inference tasks and adjusts to a computing node-led scheduling mode to achieve load balancing.
It improves the overall system throughput, enhances inference efficiency and service stability, and solves the problems of unbalanced load and high latency in traditional solutions.
Smart Images

Figure CN121809702B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of large model inference scheduling technology, and more specifically, to a large model inference scheduling method, system, storage medium, and computer program product. Background Technology
[0002] With the widespread adoption of Large Scale Language Models (LLMs), inference services have become the core of industrial-grade AI systems. A typical inference architecture employs a two-tier structure: a front-end gateway layer and a back-end compute nodes (Inference Workers). The gateway layer is responsible for routing and load balancing, while the compute nodes are responsible for scheduling the model inference pipeline.
[0003] Traditional distributed scheduling schemes are gateway-centric, employing either request-oriented allocation strategies or round-robin strategies. Request-oriented allocation strategies primarily assign different IP addresses to compute nodes for different types of requests to suit client preferences. Round-robin strategies, on the other hand, allocate requests sequentially to different compute nodes using a polling method.
[0004] In real-world production environments, large model inference workloads have the following characteristics:
[0005] The distribution of user request sequence lengths is highly uneven, with large real-time fluctuations.
[0006] The Prefill stage has high computational power requirements, while the Decode stage is sensitive to GPU memory utilization.
[0007] The performance of a compute node is strongly correlated with real-time status factors such as GPU memory, concurrency, and KV cache size.
[0008] The overall system throughput is limited by the uneven load between computing nodes and insufficient gateway decision information.
[0009] Traditional distributed scheduling schemes are not suitable for the characteristics of large model inference loads. When traditional distributed scheduling schemes are directly applied to schedule large model inference, problems such as uneven distribution of inference tasks, decreased throughput, and high inference latency are likely to occur. Summary of the Invention
[0010] In view of the above problems, this application is proposed to provide a method, system, storage medium, and computer program product for large model inference scheduling, so as to improve the inference efficiency of the large model inference scheduling process and enhance service stability. The specific solution is as follows:
[0011] Firstly, a method for scheduling large-scale model inference is provided, including:
[0012] Obtain the status data of this computing node, which includes the status data of the processor inside the node and the status data of the inference task carried by the node;
[0013] The performance metrics of this computing node are determined based on the status data of this computing node.
[0014] The performance metrics are reported to the gateway according to the configured reporting strategy, so that the gateway can allocate computing nodes to inference tasks according to the performance metrics of each computing node.
[0015] Upon receiving a target inference task assigned by the gateway, the target inference task is executed using the inference model deployed on this computing node, and the inference result is fed back to the gateway.
[0016] In one possible design, in another implementation of the first aspect of the embodiments of this application, the reporting strategy includes at least one of the following:
[0017] Upon receiving an inference task assigned by the gateway, the performance metric is reported.
[0018] The reporting of the performance metrics is triggered upon completion of an inference task;
[0019] The performance metric is reported when the set reporting period is reached.
[0020] In one possible design, another implementation of the first aspect of the embodiments of this application further includes:
[0021] While reporting the performance metrics to the gateway, the first flag is also reported to the gateway. The first flag indicates whether the compute node wants to pull inference tasks, so that the gateway can filter compute nodes that can be assigned inference tasks.
[0022] In one possible design, in another implementation of the first aspect of the embodiments of this application, the state data of the processor inside the node includes the available video memory and KV Cache occupancy rate of the processor, and the state data of the inference tasks carried by the node includes the number of Prefill parallel tasks, the number of Decode parallel tasks, and the number of queued inference tasks.
[0023] The process of determining the performance metrics of this computing node based on its state data includes:
[0024] The hardware resource index C is calculated based on the available video memory and KV Cache utilization rate of the processor. The hardware resource index C is positively correlated with the available video memory and negatively correlated with the KV Cache utilization rate.
[0025] The sum of the number of parallel Prefill tasks and the number of parallel Decode tasks is taken as the number of tasks in reasoning, L.
[0026] The hardware resource index C, the number of tasks in inference L, and the number of queued inference tasks Q are used as the performance indexes.
[0027] In one possible design, in another implementation of the first aspect of the embodiments of this application, the computing node that receives the target inference task assigned by the gateway is:
[0028] The computing node with the highest fit score to the target inference task among all computing nodes, wherein the fit score of any computing node to the target inference task is positively correlated with the hardware resource index C of the computing node and negatively correlated with the number of tasks being inferred L and the number of inference tasks in the queue Q.
[0029] In one possible design, in another implementation of the first aspect of the embodiments of this application, the target inference task is:
[0030] The gateway retrieves and allocates inference tasks sequentially from the inference task queue according to priority. The inference tasks in the inference task queue are ordered according to priority, which is assigned by the gateway to each received inference task according to the Service Commitment (SLA).
[0031] Secondly, a large-scale model inference scheduling method is provided, including:
[0032] Receive and update the performance metrics reported by the computing nodes. The performance metrics are calculated by the computing nodes based on their own state data, which includes the state data of the processor inside the computing node and the state data of the inference task carried by the node.
[0033] Obtain target inference tasks to be assigned from the inference task queue, and assign target computing nodes to the target inference tasks according to the latest performance indicators of each computing node;
[0034] The target inference task is sent to the target computing node.
[0035] In one possible design, another implementation of the second aspect of the embodiments of this application further includes:
[0036] The system receives the inference result of the target inference task from the target computing node and sends the inference result to the client corresponding to the target inference task.
[0037] In one possible design, another implementation of the second aspect of the embodiments of this application further includes:
[0038] Receive and update the first flag reported by the compute node, which indicates whether the compute node wants to fetch the inference task;
[0039] The process of allocating target computing nodes to the target inference task according to the latest performance metrics of each computing node includes:
[0040] Select candidate computing nodes that are marked as needing to retrieve inference tasks from all computing nodes;
[0041] Based on the latest performance metrics of each candidate computing node, a target computing node is assigned to the target inference task.
[0042] In one possible design, in another implementation of the second aspect of the embodiments of this application, the performance indicators of the computing node include hardware resource indicators C, the number of tasks in inference L, and the number of queued inference tasks Q;
[0043] The process of allocating target computing nodes to the target inference task according to the latest performance metrics of each computing node includes:
[0044] Based on the hardware resource index C of each computing node, the number of tasks in the inference process L, and the number of inference tasks in the queue Q, the fit score between the computing node and the target inference task is determined. The fit score is positively correlated with the hardware resource index C and negatively correlated with the number of tasks in the inference process L and the number of inference tasks in the queue Q.
[0045] The target inference task is assigned to the target computing node with the highest fit score.
[0046] In one possible design, in another implementation of the second aspect of the embodiments of this application, the hardware resource index C is calculated by the computing node based on the available video memory and KV Cache utilization rate of its own internal processor. The hardware resource index C is positively correlated with the available video memory and negatively correlated with the KV Cache utilization rate.
[0047] The number L of tasks in inference is the sum of the number of Prefill parallel tasks and the number of Decode parallel tasks carried by the computing node itself.
[0048] In one possible design, in another implementation of the second aspect of the embodiments of this application, after assigning the target inference task to the target computing node with the highest fit score, the method further includes:
[0049] Increment the number of inference tasks L in the performance metrics of the target computing node by 1, or increment the number of queued inference tasks Q in the performance metrics of the target computing node by 1.
[0050] In one possible design, in another implementation of the second aspect of the embodiments of this application, the inference tasks in the inference task queue are ordered according to priority, wherein the priority is assigned by the gateway to each received inference task according to the service commitment SLA;
[0051] The process of obtaining target inference tasks to be assigned from the inference task queue includes:
[0052] The target reasoning tasks to be assigned are obtained sequentially from the reasoning task queue according to their priority.
[0053] Thirdly, a large-scale model inference scheduling system is provided, including: a gateway and multiple computing nodes;
[0054] Each of the aforementioned computing nodes is used to execute the large model inference scheduling method described in any of the first aspects above;
[0055] The gateway is used to execute the large model inference scheduling method described in any of the second aspects above.
[0056] Fourthly, a readable storage medium is provided on which a computer program is stored, which, when executed by a processor, implements the various steps of the large model inference scheduling method described in any one of the first or second aspects.
[0057] Fifthly, a computer program product is provided, comprising a computer program that, when executed by a processor, implements the various steps of the large model inference scheduling method described in any one of the first or second aspects.
[0058] By employing the above technical solution, the large-scale model inference scheduling method of this application changes the traditional gateway "blind push" mode and adjusts it to be dominated by computing nodes. The computing nodes can dynamically and comprehensively perceive the status data of their own nodes, such as processor status and the status of the inference tasks carried by the node. Based on this status data, the performance indicators of the computing nodes can be determined. The computing nodes actively report the performance indicators to the gateway. The gateway can allocate target computing nodes to the target inference tasks to be assigned based on the latest performance indicators of each computing node, so as to better achieve load balancing, thereby improving the overall throughput of the system, improving inference efficiency, and enhancing the stability of the inference service. Attached Figure Description
[0059] Various other advantages and benefits will become apparent to those skilled in the art upon reading the following detailed description of preferred embodiments. The accompanying drawings are for illustrative purposes only and are not intended to limit the scope of this application. Furthermore, the same reference numerals denote the same parts throughout the drawings. In the drawings:
[0060] Figure 1 A schematic diagram of a large-model inference scheduling system architecture is provided for an embodiment of this application;
[0061] Figure 2 A schematic diagram of another large-model inference scheduling system architecture provided in this application embodiment;
[0062] Figure 3 This is a schematic diagram of a large-model inference scheduling method provided from the perspective of computing nodes in an embodiment of this application;
[0063] Figure 4 This is a schematic diagram of a large-model inference scheduling method provided from the perspective of a gateway in an embodiment of this application. Detailed Implementation
[0064] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0065] It is understood that before using the technical solutions disclosed in the various embodiments of this application, users should be informed of the types, scope of use, and usage scenarios of the personal information involved in this application in an appropriate manner in accordance with relevant laws and regulations, and user authorization should be obtained.
[0066] In traditional distributed scheduling schemes, the gateway is typically the central hub, and common load balancing strategies include:
[0067] 1. Round-robin and weighted round-robin (pre-assigning fixed weights to different computing nodes);
[0068] 2. Random;
[0069] 3. Simple dynamic scheduling based on connection count / RTT (dynamic scheduling is based on the statistical connection count between the gateway and the compute node. The connection count can be automatically updated after the gateway assigns an inference task to the compute node, such as connection count +1, and then updated again after receiving the inference result returned by the compute node, such as connection count -1).
[0070] In the above mechanism, the gateway can only perceive network layer metrics and cannot perceive the internal state data of the compute nodes, such as the current remaining GPU memory, KV cache usage, and prefill / decode concurrency status. Large model inference workloads have their own characteristics compared to traditional distributed computing tasks, including:
[0071] The distribution of user request sequence lengths is highly uneven, with large real-time fluctuations.
[0072] The Prefill stage has high computational power requirements, while the Decode stage is sensitive to GPU memory utilization.
[0073] The performance of a compute node is strongly correlated with real-time status factors such as GPU memory, concurrency, and KV cache size.
[0074] The overall system throughput is limited by the uneven load between computing nodes and insufficient gateway decision information.
[0075] If traditional distributed scheduling schemes are directly applied to the scheduling of large model inference execution, it is easy for the gateway to be unable to obtain the internal state of the computing nodes, which in turn leads to unreasonable allocation of inference tasks, uneven load on different computing nodes, decreased overall system throughput, and high inference latency, thus affecting inference efficiency and service stability.
[0076] To address this, this case presents a large-scale model inference scheduling scheme centered on compute nodes. It shifts from the traditional gateway-centric "blind push" model to compute nodes, where each compute node proactively perceives its internal state, obtains its performance metrics, and actively reports these metrics to the gateway. This guides the gateway in allocating inference tasks based on the performance metrics of each compute node. Compared to the gateway, compute nodes have a more comprehensive understanding of their own complete internal state data. Based on this, they can calculate their performance metrics, enabling more rational guidance for the gateway in allocating inference tasks, improving load balancing, inference efficiency, and service stability.
[0077] This application provides a large model inference scheduling method that can be applied to, for example... Figure 1 The system architecture shown can include a client 100, a gateway 200, and multiple computing nodes 300.
[0078] The gateway can be of types such as Nginx, Envoy, Kong, or HAProxy. Each compute node is configured with a large inference model, and the inference engine invokes this model to execute inference tasks assigned to that compute node.
[0079] Client 100 sends an inference task request to the gateway, which then allocates compute nodes. Upon receiving the allocated inference task, the compute node executes the task, obtains the inference result, and sends it back to the gateway, which then forwards it to the corresponding client.
[0080] Reference Figure 2This example illustrates another framework for a large-scale model inference scheduling system. In this embodiment, computing nodes can be clustered. Specifically, multiple computing nodes related to the same inference model can be grouped into the same cluster, such as grouping 10 computing nodes related to the DeepSeekv3 model into the same cluster. Computing nodes related to different inference models can be grouped into different clusters. A performance model can be configured for each computing cluster, which is used to calculate performance metrics for the computing nodes within the cluster.
[0081] The performance model is used to calculate and output the node's performance metrics based on the internal state data monitored by the node. The performance model is related to parameters such as the size and activation size of the inference model deployed on the node. Therefore, in this embodiment, the same performance model is configured for all computing nodes that deploy the same inference model.
[0082] The following sections of this application will introduce the large-scale model inference scheduling method from the perspectives of compute node 300 and gateway 200, respectively.
[0083] First, from the perspective of computing node 300, a large-scale model inference scheduling method is introduced, referring to... Figure 3 As shown, this large-model inference scheduling method may include the following steps:
[0084] Step S100: Obtain the status data of this computing node, which includes the status data of the processor inside the node and the status data of the inference task carried by the node.
[0085] Each compute node can monitor its own status data in real time or periodically, including the status data of its internal processors (such as GPUs) and the status data of the inference tasks it hosts. Examples of internal processor status data include GPU available memory (mem_free) and KV cache occupancy (kv_usage). The status data of the inference tasks hosted by the node includes the execution status data of the inference tasks allocated to the current compute node, including but not limited to: the number of prefill parallel tasks, the number of demodulation parallel tasks, and the number of queued inference tasks.
[0086] Among them, the large inference model belongs to the autoregressive model, and its inference process includes two stages:
[0087] 1. Prefill stage:
[0088] Process the entire prompt for the input.
[0089] Computational characteristics: Computationally intensive. The model needs to compute attention for each token in the prompt and generate a key-value cache for the first output token.
[0090] The Prefill Parallelism Count indicates the capacity to process a certain number of user requests (Prompts) simultaneously during the Prefill phase.
[0091] 2. Decode stage:
[0092] After Prefill, the model begins to autoregressively generate output tokens. Each time a new token is generated, it is used as part of the input to generate the next one.
[0093] Computational characteristics: Memory bandwidth intensive. Each generation involves the computation of only one (or a batch) of new tokens, with a very small computational load. However, the entire process heavily relies on reading from and writing to the KV cache, so the performance bottleneck is often memory bandwidth rather than computing power.
[0094] The Decode parallelism count indicates how many requests (sequences) being generated can be processed simultaneously during the Decode phase, each generating its next token.
[0095] If a compute node receives too many inference tasks, some of these tasks will be queued. Therefore, the inference task status data monitored in this step includes the number of queued inference tasks.
[0096] It should be noted that the above only illustrates a portion of the node's state data, not all of it. This application can also monitor other state data within the node to more accurately measure the node's performance status.
[0097] Step S110: Determine the performance indicators of this computing node based on the status data of this computing node.
[0098] Specifically, based on the state data obtained by the computing node, the computing node can further calculate the performance index of the node. The performance index measures the latest performance of the computing node, and the performance of the computing node is related to the number of inference tasks it can accept.
[0099] Reference Figure 2 As shown, when computing nodes calculate performance metrics, they can call the performance model related to the computing cluster to which the computing node belongs. Figure 2 In the example, compute node A and compute node B deploy the same large inference model and are grouped into compute cluster 1. Compute cluster 1 is configured with performance model 1, which is used by all compute nodes within compute cluster 1 to measure the performance metrics of the compute nodes. Similarly, compute node C and compute node D deploy the same large inference model and are grouped into compute cluster 2. Compute cluster 2 is configured with performance model 2, which is used by all compute nodes within compute cluster 2 to measure the performance metrics of the compute nodes.
[0100] In some possible implementations, taking the state data of the internal processor of the node, including the available video memory mem_free and KV cache occupancy kv_usage, and the state data of the inference tasks carried by the node, including the number of prefill parallel tasks prefill_concurrency, the number of decode parallel tasks decode_concurrency, and the number of queued inference tasks Q, as an example, the calculation process of the node's performance indicators is introduced. The performance indicators may include: hardware resource indicators C, the number of tasks in inference L, and the number of queued inference tasks Q.
[0101] The calculation process for hardware resource index C:
[0102] The performance model is invoked to calculate the hardware resource metric C based on the processor's available video memory (mem_free) and KV cache occupancy (kv_usage).
[0103] The hardware resource metric C is positively correlated with available video memory (mem_free) and negatively correlated with the KV cache occupancy rate (kv_usage). One possible calculation method is shown in the following formula:
[0104] .
[0105] Here, α and β are the parameters of the performance model. The performance model can be obtained from offline testing, empirical estimation, or learning from historical data.
[0106] The calculation process for the number L of tasks currently in reasoning:
[0107] The sum of the number of parallel prefill operations (prefill_concurrency) and the number of parallel decode operations (decode_concurrency) is taken as the number of tasks L currently inference.
[0108] L=prefill_concurrency+decode_concurrency.
[0109] The number of inference tasks in the queue, Q, can be obtained directly from the internal state data monitored by the node.
[0110] Step S120: Report the performance metrics to the gateway according to the configured reporting strategy, so that the gateway can allocate computing nodes to the inference task according to the performance metrics of each computing node.
[0111] Specifically, this application allows for pre-configuration of strategies for computing nodes to report performance metrics. Several optional reporting strategies are provided below:
[0112] 1. Trigger performance metric reporting upon receiving an inference task assigned by the gateway.
[0113] Specifically, after receiving an inference task assigned by the gateway, the compute node can trigger the calculation of performance metrics because the newly added inference task will affect the performance of the compute node. At this time, the performance metrics are calculated based on the latest internal state of the compute node, and the operation of reporting the latest performance metrics to the gateway is further triggered so that the gateway can obtain the latest performance metrics of the compute node in a timely manner and guide the subsequent inference task allocation work.
[0114] 2. Trigger performance metric reporting upon completion of a reasoning task.
[0115] Specifically, when a compute node completes an inference task, its internal state changes, which can trigger the calculation of performance metrics. Based on the latest internal state of the compute node, the performance metrics are calculated, and then the latest performance metrics are reported to the gateway. This allows the gateway to obtain the latest performance metrics of the compute node in a timely manner and guide the subsequent allocation of inference tasks.
[0116] 3. Trigger the reporting of performance indicators when the set reporting period is reached.
[0117] In addition to the two reporting strategies mentioned above, this application can also set a fixed reporting cycle, for example, to perform the calculation and reporting of performance indicators once every fixed time interval T.
[0118] Of course, in addition to the reporting strategies exemplified above, other reporting strategies can be set, which will not be listed one by one in this application embodiment.
[0119] In this embodiment, the computing nodes actively report the latest performance indicators according to the configured reporting strategy, which can guide the gateway to allocate inference tasks according to the performance indicators of each computing node, avoiding the "blind push" in traditional solutions and effectively improving the load balancing effect.
[0120] Step S130: Upon receiving the target inference task assigned by the gateway, execute the target inference task through the inference model deployed on this computing node, and feed back the inference result to the gateway.
[0121] Specifically, for any compute node, after receiving a target inference task assigned by the gateway, it can invoke the inference model deployed on this compute node to execute the target inference task. Various task scheduling strategies can be employed within the compute node, for example:
[0122] Page Attention and Radix Attention: Page Attention, Radix Attention;
[0123] Dynamic batching / Token scheduling;
[0124] Continuous batching;
[0125] Parallel execution of the prefill and decode phases, etc.
[0126] After obtaining the inference result of the target inference task, the computing node feeds back the inference result to the gateway, which then feeds back the inference result of the target inference task to the client that issued the target inference task, thus completing the scheduling of the inference task and the feedback of the result.
[0127] In some possible implementations, the compute node that receives the target inference task assigned by the gateway in this step is specifically the compute node with the highest fit score to the target inference task among all compute nodes. Specifically, the gateway calculates the fit score between each compute node and the target inference task and assigns the target inference task to the compute node with the highest fit score, i.e., the compute node that receives the target inference task in this embodiment. The fit score between any compute node and the target inference task is positively correlated with the compute node's hardware resource index C and negatively correlated with the number of tasks currently inference L and the number of queued inference tasks Q. The detailed process of the gateway calculating the fit score can be found in the gateway-side scheme description below.
[0128] In some possible implementations, the target inference task received by this computing node is:
[0129] The gateway retrieves and allocates inference tasks sequentially from the inference task queue according to their priority. The inference tasks in the inference task queue are ordered according to their priority, which is assigned by the gateway to each received inference task according to the Service Agreement (SLA). For details, please refer to the gateway-side solution description below.
[0130] In some possible implementations, compute nodes can asynchronously write the runtime information of the task inference process to the log system for training performance models. The runtime information includes, but is not limited to, actual latency, memory consumption, batch throughput, etc.
[0131] The large-model inference scheduling method provided in this application changes the traditional gateway "blind push" mode and adjusts it to be dominated by computing nodes. The computing nodes can dynamically and comprehensively perceive the status data of their own nodes, such as processor status and the status of the inference tasks carried by the node. Based on this status data, the performance indicators of the computing nodes can be determined. The computing nodes actively report the performance indicators to the gateway. The gateway can allocate target computing nodes to the target inference tasks to be allocated based on the latest performance indicators of each computing node, so as to better achieve load balancing, thereby improving the overall throughput of the system, improving inference efficiency, and enhancing the stability of the inference service.
[0132] In some embodiments of this application, a scheduling scheme for computing nodes to actively pull inference tasks is provided.
[0133] Specifically, considering that different computing nodes may have different business needs at different times, such as during off-peak periods for inference task requests (e.g., at night), some computing nodes can be configured to stop executing inference tasks and instead execute model training tasks. During peak periods for inference task requests, some computing nodes can be further adjusted to execute inference tasks. Furthermore, the internal state of computing nodes changes at different times during the inference task process. When a computing node determines that its current load is too high, it can also request to temporarily stop receiving new inference tasks, and can resume receiving new inference tasks after the load decreases. To flexibly adapt to the different business needs of different computing nodes at different times, this embodiment takes the computing node as the lead, adjusting the traditional gateway "blind push" scheduling mode, where computing nodes passively receive inference tasks, to a mode where computing nodes actively pull inference tasks.
[0134] Specifically:
[0135] When a compute node reports performance metrics to the gateway, it can carry a first flag. The first flag indicates whether the compute node wants to pull inference tasks, so that the gateway can filter compute nodes that can be assigned inference tasks.
[0136] For example, when the first flag is a first value (e.g., a value of 1), it means that the compute node needs to fetch the inference task; when the first flag is a second value (e.g., a value of 0), it means that the compute node does not need to fetch the inference task.
[0137] When a compute node is not performing inference tasks, and it is determined that the compute node needs to be switched to performing inference tasks, it can report performance metrics to the gateway along with a first flag of the first value. Upon receiving this, the gateway can locally modify the first flag of the compute node to the first value. Subsequently, when allocating inference tasks, it can filter from all compute nodes whose first flag is set to the first value.
[0138] When a compute node is performing an inference task and determines that it needs to be changed to not perform inference tasks, it can report performance metrics to the gateway along with a first flag of a second value. Upon receiving this, the gateway can locally modify the first flag of the compute node to the second value.
[0139] Furthermore, from the perspective of gateway 200, a large-scale model inference scheduling method is introduced, referring to... Figure 4 As shown, this large-model inference scheduling method may include the following steps:
[0140] Step S200: Receive and update the performance metrics reported by the computing nodes.
[0141] The performance metrics are calculated by the computing node based on its own state data, which includes the state data of the processor inside the computing node and the state data of the inference task carried by the node.
[0142] Compute nodes can monitor their own status data and dynamically update their performance metrics based on this data. They then report these performance metrics to the gateway according to the configured reporting strategy. Examples of performance metrics, as described in the previous embodiments, include hardware resource metrics C, the number of tasks currently in inference L, and the number of queued inference tasks Q.
[0143] After receiving the performance metrics reported by the compute node, the gateway updates the performance metrics of that compute node stored locally with the latest reported performance metrics.
[0144] In this embodiment, a strategy is adopted in which the compute nodes actively report performance metrics. The compute nodes can comprehensively monitor their internal state and calculate their performance metrics accordingly. In contrast, in traditional solutions, the gateway can only obtain network metrics between itself and the compute nodes, and cannot obtain the internal state data of the compute nodes, thus making it difficult to achieve good load balancing.
[0145] Step S210: Obtain the target inference task to be assigned from the inference task queue, and assign the target computing node to the target inference task according to the latest performance index of each computing node.
[0146] Specifically, the gateway can receive inference tasks sent by the client and add the inference tasks to the inference task queue.
[0147] When performing task allocation, the target inference task to be allocated can be obtained from the inference task queue, and the target inference task requirements can be matched in real time according to the latest performance indicators of each computing node to determine the target computing node that matches the target inference task.
[0148] In this step, the gateway can obtain the latest performance metrics of each computing node, and therefore can select the target inference task to be assigned within the capabilities of each computing node, thereby achieving better load balancing.
[0149] Step S220: Send the target inference task to the target computing node.
[0150] After identifying the target computing node that matches the target inference task, the gateway sends the target inference task to the target computing node, which then calls the deployed inference big model to execute the inference process.
[0151] In some possible solutions, the gateway can further receive the inference results of the target inference task fed back by the target computing node, and send the inference results to the client corresponding to the target inference task.
[0152] The large-scale model inference scheduling method proposed in this application changes the traditional gateway "blind push" mode and adjusts it to be dominated by computing nodes. The computing nodes can dynamically and comprehensively perceive the status data of their own nodes, such as the processor status and the status of the inference tasks they carry. Based on this status data, the performance indicators of the computing nodes can be determined. The computing nodes actively report the performance indicators to the gateway. The gateway can allocate target computing nodes to the target inference tasks to be assigned based on the latest performance indicators of each computing node, so as to better achieve load balancing, thereby improving the overall throughput of the system, improving inference efficiency, and enhancing the stability of the inference service.
[0153] As discussed earlier regarding the large-model inference scheduling method from the perspective of compute nodes, compute nodes can also report a first flag indicating whether they want to fetch inference tasks to the gateway. Based on this, the gateway can receive and update the first flag reported by the compute nodes in its local storage.
[0154] Then, step S210, the process of allocating target computing nodes to the target inference task according to the latest performance indicators of each computing node, may include:
[0155] From all computing nodes, select the candidate computing nodes that are marked with the first flag and need to be used for the inference task.
[0156] Based on the latest performance metrics of each candidate computing node, target computing nodes are assigned to the target inference task.
[0157] In this embodiment, the traditional gateway "blind push" and compute node passive reception scheduling mode is adjusted to a compute node active pull scheduling mode. That is, compute nodes can carry a first flag when reporting performance indicators based on their own business needs. Different values of the first flag indicate whether the compute node needs to pull inference tasks. When allocating inference tasks, the gateway can select compute nodes from the candidate compute nodes identified by the first flag that need to pull inference tasks. This allows compute nodes to actively control whether they receive new inference tasks.
[0158] In some embodiments of this application, after receiving inference tasks sent by the client, the gateway can assign priorities to each received inference task according to the SLA (Service Level Agreement), and send the received inference tasks into the inference task queue according to priority. The inference tasks in the inference task queue are sorted in order of priority.
[0159] In one alternative approach, the inference task queue may include multiple queues with different priorities, each containing inference tasks of the same priority. This embodiment uses the SLA's method of prioritizing tasks based on their length as an example to provide a method for setting up priority queues, as follows:
[0160] Q1: Short sequence;
[0161] Q2: Medium to long sequences;
[0162] Q3: Long sequences;
[0163] Q4: Background batch tasks.
[0164] The priority of Q1-Q4 decreases sequentially. Among them, the background batch tasks are background tasks with low timeliness requirements, so their priority is the lowest.
[0165] The remaining inference tasks are prioritized according to the length of the task request; the shorter the task request, the higher the priority.
[0166] The process by which the client obtains the target inference task to be assigned from the inference task queue may include:
[0167] Retrieve target inference tasks from the inference task queue in order of priority.
[0168] By prioritizing inference tasks, computation nodes can be allocated sequentially according to this priority, ensuring that high-priority inference tasks receive timely feedback.
[0169] In some embodiments of this application, the process by which the gateway allocates target computing nodes to the target inference task according to the performance indicators of each computing node is described.
[0170] The following example illustrates the performance metrics of a computing node, including hardware resource metrics C, the number of tasks in inference L, and the number of queued inference tasks Q:
[0171] The gateway can determine the fit score S between the computing node and the target inference task based on the hardware resource index C of each computing node, the number of tasks in inference L, and the number of queued inference tasks Q.
[0172] The fit score S is positively correlated with the hardware resource metric C and negatively correlated with the number of tasks in inference L and the number of queued inference tasks Q. The calculation process for the hardware resource metric C and the number of tasks in inference L can be referred to the relevant scheme description on the computing node side above, and will not be repeated here.
[0173] The target inference task is assigned to the target computation node with the highest fitness score S.
[0174] Specifically, to enable a unified comparison of metrics across different dimensions, the metrics of each computation node can be quantified:
[0175] Quantized hardware resource metrics for the i-th computing node:
[0176] .
[0177] in, and These represent the minimum and maximum values of hardware resource metrics across all computing nodes, respectively.
[0178] The number of quantized inference tasks at the i-th computing node:
[0179] .
[0180] in, and These represent the minimum and maximum number of tasks currently inference across all computing nodes, respectively.
[0181] The number of quantized, queued inference tasks at the i-th computing node:
[0182] .
[0183] in, and These represent the minimum and maximum number of inference tasks queued across all computing nodes, respectively.
[0184] The fit score between the i-th computing node and the target inference task is:
[0185] .
[0186] in, , , This indicates an adjustable weighting coefficient, which can be set based on offline testing or experience.
[0187] Finally, the target computing node with the highest score is selected.
[0188] In some embodiments of this application, another alternative implementation scheme for the gateway to perform inference task allocation is provided.
[0189] In this embodiment, the gateway can obtain multiple inference tasks to be assigned from the inference task queue. For example, multiple inference tasks are obtained from the same priority queue to form a set of inference tasks to be assigned in the current batch. For each inference task in the set, a fit score between each computing node and the inference task is calculated.
[0190] In this embodiment, the resource requirements R of each inference task (such as the computing power requirements, memory requirements, etc.) can be considered. When calculating the fit score between the inference task and the computing node, both the performance indicators of the computing node and the resource requirements R of the inference task are considered. The calculation formula is as follows:
[0191] .
[0192] in, This represents the fit score between the j-th reasoning task and the i-th computation node in the set of reasoning tasks to be assigned. , , , This indicates an adjustable weighting coefficient, which can be set based on offline testing or experience. This represents the resource requirement for the j-th inference task. Since the resource requirement is positively correlated with the length of the inference task, one possible quantification strategy is to quantify the resource requirement based on the length of the inference task, as shown in the following formula:
[0193] .
[0194] in, The length of the j-th inference task is represented by k, which represents the quantization coefficient and can be set based on offline testing or experience.
[0195] Determine the maximum score among all fitness scores, establish a pairing relationship between the inference task and the computing node corresponding to the maximum score, and assign the inference task to the computing node according to the pairing relationship.
[0196] After completing the node allocation for a reasoning task, the above reasoning task allocation process can be repeated until all pending reasoning tasks have been allocated or the set termination condition has been met.
[0197] The method provided in this embodiment can dynamically adjust the resource requirements of the inference tasks to be assigned and the suitability score of the computing nodes, so that for each inference task in the same priority queue, the allocation order can be adjusted according to the resource requirements of the inference tasks, making the allocation strategy more flexible.
[0198] In some embodiments of this application, considering the potential for network latency fluctuations between the gateway and compute nodes, after the gateway assigns an inference task to a compute node, the inference task may fail to be sent to the compute node in a timely manner due to network latency. Consequently, the compute node may fail to report the latest performance metrics in a timely manner, leading to the gateway potentially assigning a large number of inference tasks to the same compute node within a short period, exceeding the actual capacity of the compute node. Therefore, this embodiment provides a solution:
[0199] After allocating target compute nodes to the target inference task according to the latest performance metrics of each compute node, the gateway further performs the following steps:
[0200] Increment the number of inference tasks L or the number of inference tasks Q in the performance metrics of the target computing node by 1.
[0201] In other words, the gateway automatically updates the performance metrics of the target compute node after assigning an inference task to it, thus avoiding the assignment of a large number of inference tasks to the same compute node in a short period of time under network latency conditions.
[0202] Understandably, once the target compute node reports the latest performance metrics again, the gateway will update the target compute node's performance metrics to complete the calibration of the performance metrics.
[0203] Next, the traditional distributed scheduling scheme and the large-model inference scheduling scheme of this application will be compared using Table 1 below:
[0204] Table 1
[0205]
[0206] "Prone to long-sequence pollution" refers to the situation where one or more requests with abnormally long processing times ("long sequences") occur in the system, occupying critical resources (such as GPU memory) for an extended period. This forces a large number of subsequent short requests to wait in the queue, thus "polluting" the entire queue and causing these requests, which should have completed quickly, to experience high latency.
[0207] "p95 / p99": p95 represents 95% of the request response time; p99 represents 99% of the request response time. These are key metrics for measuring system latency and stability.
[0208] This application also provides a computer program product including computer-readable instructions, which, when executed on an electronic device, cause the electronic device to implement any of the large model inference scheduling methods provided in this application.
[0209] This application also provides a computer-readable storage medium that carries one or more computer programs. When the one or more computer programs are executed by an electronic device, the electronic device can implement any of the large model inference scheduling methods provided in this application.
[0210] It should also be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. In addition, in the device embodiment drawings provided in this application, the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.
[0211] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CPUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk, or optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, training equipment, or network device, etc.) to execute the methods described in the various embodiments of this application.
[0212] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.
[0213] The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device or data center that integrates one or more available media. The available media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid-state drives (SSDs)).
[0214] The various embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments. The various embodiments can be combined as needed, and the same or similar parts can be referred to each other.
Claims
1. A large-scale model inference scheduling method, characterized in that, Each computing node deploys an inference model. Computing nodes with the same inference model are grouped into the same computing cluster. Each computing cluster is configured with a related performance model for use by the computing nodes within the cluster. The performance metrics of the computing nodes are defined by the following method: Obtain the status data of this computing node, which includes the status data of the processor inside the node and the status data of the inference task carried by the node; Based on the status data of this computing node, the performance indicators of this computing node are determined by calling the performance model related to the computing cluster to which this computing node belongs. The performance metrics and the first flag are reported to the gateway according to the configured reporting strategy, so that the gateway can allocate computing nodes to inference tasks according to the performance metrics of each computing node. The first flag is used to indicate whether the computing node wants to pull inference tasks, so that the gateway can filter computing nodes that can be allocated inference tasks. Upon receiving a target inference task assigned by the gateway, the target inference task is executed using the inference model deployed on this computing node, and the inference result is fed back to the gateway. The status data of the processor inside the node includes the available video memory and KV cache utilization rate of the processor, and the status data of the inference tasks carried by the node includes the number of prefill parallel tasks, the number of decode parallel tasks, and the number of queued inference tasks. The process of determining the performance metrics of this computing node based on the state data of this computing node includes: The hardware resource index C is calculated based on the available video memory and KV Cache utilization rate of the processor. The hardware resource index C is positively correlated with the available video memory and negatively correlated with the KV Cache utilization rate. The sum of the number of parallel Prefill tasks and the number of parallel Decode tasks is taken as the number of tasks in reasoning, L. The hardware resource index C, the number of tasks in inference L, and the number of queued inference tasks Q are used as the performance indexes.
2. The method according to claim 1, characterized in that, The reporting strategy includes at least one of the following: Upon receiving an inference task assigned by the gateway, the performance metric is reported. The reporting of the performance metrics is triggered upon completion of an inference task; The performance metric is reported when the set reporting period is reached.
3. The method according to claim 1, characterized in that, The computing node that receives the target inference task assigned by the gateway is: The computing node with the highest fit score to the target inference task among all computing nodes, wherein the fit score of any computing node to the target inference task is positively correlated with the hardware resource index C of the computing node and negatively correlated with the number of tasks being inferred L and the number of inference tasks in the queue Q.
4. The method according to claim 1, characterized in that, The target reasoning task is: The gateway retrieves and allocates inference tasks sequentially from the inference task queue according to priority. The inference tasks in the inference task queue are ordered by priority, which is assigned by the gateway to each received inference task according to the Service Commitment (SLA).
5. A large-scale model inference scheduling method, characterized in that, Each computing node deploys an inference model. Computing nodes with the same inference model are grouped into the same computing cluster. Each computing cluster is configured with a related performance model for use by the computing nodes within the cluster. The performance metrics of the computing nodes are defined by the following method: The system receives and updates performance metrics and a first flag reported by compute nodes. The first flag indicates whether the compute node wants to pull inference tasks. The performance metrics are calculated by the compute node based on its own state data by calling the performance model related to the compute cluster to which the compute node belongs. The performance metrics of the compute node include hardware resource metrics C, the number of tasks in progress L, and the number of queued inference tasks Q. The state data includes the state data of the compute node's internal processor and the state data of the inference tasks carried by the node. The state data of the internal processor includes the available GPU memory and KV Cache utilization. The state data of the inference tasks carried by the node includes the number of Prefill parallel tasks, the number of Decode parallel tasks, and the number of queued inference tasks. The hardware resource metrics C are calculated by the compute node based on the available GPU memory and KV Cache utilization of its own internal processor. The hardware resource metrics C are positively correlated with the available GPU memory and negatively correlated with the KV Cache utilization. The number of tasks in progress L is the sum of the number of Prefill parallel tasks and the number of Decode parallel tasks carried by the compute node itself. Obtain the target inference task to be assigned from the inference task queue, filter out the candidate computing nodes that need to pull the inference task from all computing nodes with the first mark, and assign the target computing node to the target inference task according to the latest performance index of each candidate computing node. The target inference task is sent to the target computing node.
6. The method according to claim 5, characterized in that, Also includes: The system receives the inference result of the target inference task from the target computing node and sends the inference result to the client corresponding to the target inference task.
7. The method according to claim 5, characterized in that, The process of allocating target computing nodes to the target inference task according to the latest performance metrics of each computing node includes: Based on the hardware resource index C of each computing node, the number of tasks in the inference process L, and the number of inference tasks in the queue Q, the fit score between the computing node and the target inference task is determined. The fit score is positively correlated with the hardware resource index C and negatively correlated with the number of tasks in the inference process L and the number of inference tasks in the queue Q. The target inference task is assigned to the target computing node with the highest fit score.
8. The method according to claim 7, characterized in that, After assigning the target inference task to the target computing node with the highest fit score, the process further includes: Increment the number of inference tasks L in the performance metrics of the target computing node by 1, or increment the number of queued inference tasks Q in the performance metrics of the target computing node by 1.
9. The method according to claim 5, characterized in that, The inference tasks in the inference task queue are sorted by priority, which is assigned by the gateway to each received inference task according to the Service Commitment Agreement (SLA). The process of obtaining target inference tasks to be assigned from the inference task queue includes: The target reasoning tasks to be assigned are obtained sequentially from the reasoning task queue according to their priority.
10. A large-scale model inference scheduling system, characterized in that, include: Gateway and multiple compute nodes; Each of the computing nodes is used to execute the large model inference scheduling method according to any one of claims 1-4; The gateway is used to execute the large model inference scheduling method according to any one of claims 5-9.
11. A readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the various steps of the large model inference scheduling method as described in any one of claims 1-9.
12. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the steps of the large model inference scheduling method as described in any one of claims 1-9.