Token-aware batch and online scheduling method for serverless NLP inference

By using token-aware batch processing and online scheduling methods, token buckets are dynamically partitioned and resource allocation is optimized, which solves the computational redundancy and resource fragmentation of NLP inference tasks in serverless platforms, improving efficiency and reducing costs.

CN122198150APending Publication Date: 2026-06-12NANJING UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANJING UNIV OF SCI & TECH
Filing Date
2026-05-11
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing serverless platforms face challenges in handling NLP inference tasks, including computational redundancy due to token length variability, resource imbalance, and fragmentation of heterogeneous node resources, resulting in low inference efficiency and high operating costs.

Method used

By adopting a token-aware batch processing method, we can optimize resource configuration and scheduling strategies by dynamically dividing token buckets and combining offline performance data and resource matching guidance, thereby ensuring service level objectives while reducing costs.

🎯Benefits of technology

It effectively solves the problems of computational redundancy and resource fragmentation, improves inference efficiency and reduces operating costs, and meets the strict service level target latency constraints.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122198150A_ABST
    Figure CN122198150A_ABST
Patent Text Reader

Abstract

The application discloses a token-aware batch processing and online scheduling method for Serverless NLP inference. The method comprises the following steps: receiving an inference request in real time and extracting its length feature, and performing offline performance analysis for different batch processing parameters and resource configurations; token-aware batch processing division is performed, and similar requests in length feature are dynamically divided into buckets by quantifying the padding waste generated in the calculation process; for each bucket, the most economical computing resource quota is matched based on offline performance profiling and combined with the remaining execution time, so that the cost minimization is realized while the service level target delay is met; an online scheduling strategy guided by resource coverage is adopted, and the task is dispatched to the optimal node instance by evaluating the matching degree of the remaining space of the work node and the current task demand. The application significantly reduces the resource cost and execution delay of the serverless inference service, reduces resource fragmentation, and effectively improves the system efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the interdisciplinary field of serverless computing and natural language processing, and in particular, it is a token-aware batch processing and online scheduling method for serverless NLP inference. Background Technology

[0002] Serverless computing (SC), as an emerging cloud computing paradigm, is gradually becoming the preferred deployment platform for large-scale Natural Language Processing (NLP) inference tasks due to its extremely high elasticity and scalability, on-demand billing economy, and maintenance-free characteristics. In a serverless environment, developers can deploy NLP models as cloud functions, automatically triggering inference tasks based on real-time traffic, thus effectively handling the fluctuating load demands of internet applications. However, in real-world large-scale NLP inference scenarios, existing serverless platforms face three key technical bottlenecks when handling complex workloads:

[0003] First, there's the computational redundancy issue caused by token length variability. Traditional batching techniques are a primary means of improving inference throughput, but in NLP scenarios, the token lengths of input requests often exhibit significant differences. Existing serverless platforms often lack awareness of token characteristics during batching. To align requests of varying lengths within the same batch, the system must insert a large number of placeholders. This "padding waste" not only causes unnecessary memory allocation but also leads to substantial ineffective computational overhead, significantly reducing inference efficiency.

[0004] Secondly, there is an imbalance between resource allocation schemes and Service Level Objectives (SLO) constraints. Pre-trained language models (such as BERT and GPT) have extremely high computational complexity, and their performance is highly sensitive to computing resources (such as the number of CPU cores and GPU computing power share). Although serverless platforms allow for fine-grained resource quotas, most existing allocation schemes adopt static or coarse-grained allocation methods, failing to establish a non-linear mapping relationship between input sequence length, batch size, and heterogeneous resource combinations. This leads to the system often over-provisioning resources to avoid default when facing strict SLO latency constraints, resulting in high operating costs.

[0005] Finally, there is the issue of resource fragmentation in heterogeneous node environments. In scenarios with dynamically arriving inference requests, the available resources of each worker node in a Serverless cluster exhibit high fragmentation and heterogeneity. Traditional greedy scheduling algorithms (such as round-robin or least load first) only focus on the node's load metrics, ignoring the spatial matching degree between the resource requirements of the batch to be executed and the node's remaining resources. This blindness in scheduling leads to a large number of small resource remnants not being effectively utilized, frequently triggering unnecessary cold starts, and severely limiting the overall throughput performance of the system.

[0006] Therefore, there is an urgent need to design an online scheduling method that can deeply perceive token characteristics, achieve dynamic optimal resource allocation, and effectively suppress resource fragmentation, so as to maximize the resource utilization of the serverless system and reduce operating costs while ensuring the quality of service (QoS) of NLP inference. Summary of the Invention

[0007] The purpose of this invention is to address the deficiencies or shortcomings of the existing technology by providing a token-aware batch processing and online scheduling method for Serverless NLP inference. This method can effectively address the challenges of NLP request token length variability, heterogeneous resource configuration imbalance, and worker node resource fragmentation. Under the premise of strictly ensuring the service SLO latency constraint, it significantly reduces inference costs and improves the overall resource utilization of the system.

[0008] The technical solution to achieve the purpose of this invention is: a token-aware batch processing and online scheduling method for Serverless NLP inference, the method comprising the following steps:

[0009] Step 1: Receive user inference requests and extract their token length features to obtain the token length distribution of the current request sequence;

[0010] Step 2: Based on the token length distribution, perform batch division and dynamically merge requests into at least one token bucket;

[0011] Step 3: For each token bucket, configure computing resources based on preset offline performance data, while meeting service level target constraints.

[0012] Step 4: Based on the resource requirements of each token bucket, the task is assigned to the target node instance using an online scheduling strategy guided by resource matching degree.

[0013] Furthermore, after step 1 and before step 2, the method further includes performing:

[0014] Step 1.1, Constructing Pre-defined Offline Performance Data: Through offline performance profiling, a model inference latency performance table, i.e., an offline performance table, is pre-built under different token lengths, batch sizes, and CPU and GPU resource configuration combinations.

[0015] Furthermore, after step 1.1, the method further includes performing:

[0016] Step 1.2: Establish a system model to formally define the scheduling and configuration problem.

[0017] Furthermore, step 1.2 specifically includes:

[0018] Establish a latency model to characterize end-to-end latency constraints, including batch processing wait, network communication, and inference execution.

[0019] Establish a token-aware batch model to explicitly characterize the variability of token length;

[0020] Establish a cost model to model operating costs based on resource continuous occupancy time and fixed call overhead;

[0021] Establish a resource capacity model to constrain the total resources consumed by all token buckets to not exceed the available resource capacity of all worker nodes;

[0022] Joint optimization is carried out with the goal of minimizing total operating costs.

[0023] Furthermore, step 2 involves batch segmentation, specifically including:

[0024] Step 2-1: Sort the set of requests to be processed in ascending order by token length;

[0025] Step 2-2: Iterate through the sorted requests one by one and quantify the incremental waste of filling the bucket caused by adding a new request to the current token bucket;

[0026] Steps 2-3: When the preset splitting threshold condition is met, close the current token bucket and make the new request the first element of the next token bucket.

[0027] Furthermore, the splitting threshold condition described in steps 2-3 includes at least one of the following conditions:

[0028] Condition 1: The number of requests in the token bucket exceeds the preset maximum batch capacity limit;

[0029] Condition 2: The expected execution delay of the current token bucket under the maximum resource configuration violates the deadline constraint of any of its internal requests;

[0030] Condition 3: The current token bucket filling waste increment exceeds the preset proportional threshold.

[0031] The quantification method for the filling waste is: calculating the sum of the differences between the longest request length in the token bucket and the actual length of each request, expressed as:

[0032]

[0033] In the formula, Indicates token bucket Waste of filling For token bucket The longest request in the series. For each request The actual length.

[0034] Furthermore, step 3 involves configuring computing resources, specifically including:

[0035] Retrieve the offline performance data or offline performance table, and select the minimum necessary combination of CPU and GPU resources that meets the deadline constraint, given the token length and batch size.

[0036] Furthermore, the online scheduling strategy described in step 4 specifically includes:

[0037] Arrange the token buckets that have completed resource allocation in descending order of resource demand;

[0038] Calculate the matching degree between each token bucket and the idle resource vector of each candidate worker node, i.e., the resource coverage;

[0039] Based on the resource coverage, the token bucket is preferentially scheduled to the node instance with the highest matching degree for execution.

[0040] Furthermore, the instance matching logic is as follows:

[0041] Prioritize executing the warm-up function instance on the node that yields the highest resource coverage value;

[0042] If no preheating function instance exists, a new resource configuration will be created on the node with the highest resource coverage value. Function instances; where ( Token buckets awaiting scheduling Resource allocation.

[0043] Furthermore, the resource coverage is measured using cosine similarity to quantify the fineness of the matching between the node's remaining resources and the token bucket demand. The calculation formula is as follows:

[0044]

[0045] In the formula, Indicates token bucket With each candidate working node Idle resources The degree of matching, These represent the nth candidate working node. Idle CPU resources and idle GPU resources on the platform. , These represent the h-th batch of token buckets. The CPU and GPU resources required for inference.

[0046] Compared with the prior art, the significant advantages of this invention are:

[0047] (1) This invention effectively addresses the challenges of computational redundancy and padding waste in serverless environments by recognizing the variability of NLP inference tasks. It establishes a token-aware dynamic bucketing mechanism and, with rigorous theoretical proof of the padding waste approximation ratio, systematically optimizes the length alignment problem in batch processing. Compared to traditional coarse-grained partitioning schemes, it theoretically guarantees the minimization of padding overhead, providing an efficient zero-redundancy solution for processing long-tailed NLP requests.

[0048] (2) It achieves a precise trade-off between SLO performance guarantees and economic costs, eliminating the blindness of Serverless resource allocation. This invention constructs a nonlinear inference performance model and cost function to perform adaptive configuration optimization in a multi-dimensional heterogeneous resource space. This not only satisfies SLO latency constraints in privacy- and performance-sensitive Serverless scenarios, but also alleviates the cost pressure caused by over-provisioning from the perspective of refined resource management, further enhancing the system's economy under fluctuating loads.

[0049] (3) Significantly improves the resource throughput efficiency of heterogeneous clusters and reduces fragmentation losses during task scheduling. This invention adopts an online scheduling strategy based on resource coverage, balancing batch resource requirements with node remaining space, and achieves optimal mapping of heterogeneous computing instances. By maximizing node fragmentation, it achieves the maximum system concurrency gain, while also effectively solving the problems of system heterogeneity and execution fluctuations in serverless computing.

[0050] The present invention will now be described in further detail with reference to the accompanying drawings. Attached Figure Description

[0051] Figure 1 This is a flowchart illustrating the token-aware batch processing and online scheduling method for Serverless NLP inference according to the present invention.

[0052] Figure 2 This is a schematic diagram of a token-aware batch processing and adaptive scheduling architecture in one embodiment.

[0053] Figure 3This is a schematic diagram comparing the costs of four algorithms under different SLO values ​​on the RACE dataset in one embodiment.

[0054] Figure 4 This is a schematic diagram comparing the costs of four algorithms under different SLO values ​​on the RTE dataset in one embodiment.

[0055] Figure 5 This is a schematic diagram comparing the inference time of four algorithms under different SLO values ​​on the RACE dataset in one embodiment.

[0056] Figure 6 This is a schematic diagram comparing the inference time of four algorithms under different SLO values ​​on the RTE dataset in one embodiment. Detailed Implementation

[0057] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0058] It should be noted that if the embodiments of the present invention involve descriptions such as "first" and "second," these descriptions are for descriptive purposes only and should not be construed as indicating or implying their relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined with "first" and "second" may explicitly or implicitly include at least one of those features. Furthermore, the technical solutions of the various embodiments can be combined with each other, but this must be based on the ability of those skilled in the art to implement them. When the combination of technical solutions is contradictory or impossible to implement, it should be considered that such a combination of technical solutions does not exist and is not within the scope of protection claimed by the present invention.

[0059] In one embodiment, a token-aware batch processing and online scheduling method for serverless NLP inference is provided, the method comprising the following steps:

[0060] Step 1: Receive user inference requests and extract their token length features to obtain the token length distribution of the current request sequence;

[0061] Step 2: Based on the token length distribution, perform batch division and dynamically merge requests into at least one token bucket;

[0062] Step 3: For each token bucket, configure computing resources based on preset offline performance data, while meeting service level target constraints.

[0063] Step 4: Based on the resource requirements of each token bucket, the task is assigned to the target node instance using an online scheduling strategy guided by resource matching degree.

[0064] Furthermore, in one embodiment, after step 1 and before step 2, the method further includes performing:

[0065] Step 1.1, Constructing Pre-defined Offline Performance Data: Through offline performance profiling, a model inference latency performance table, i.e., an offline performance table, is pre-built under different token lengths, batch sizes, and CPU and GPU resource configuration combinations.

[0066] Furthermore, in one embodiment, after step 1.1, the method further includes performing:

[0067] Step 1.2: Establish a system model to formally define the scheduling and configuration problem.

[0068] Here, consider a system with N heterogeneous working nodes. A distributed serverless cluster. The nth worker node. Characterized by its idle computing resources, i.e., CPU capacity and GPU capacity This cluster hosts a total of Function examples .in Represents a node The m-th instance on. Assume there are... A stream of inference requests arrives at the gateway, represented as Each request (The i-th request) consists of a tuple Define, where This is the time it takes for the request to arrive at the gateway. It is the token length of the i-th request input sequence. This is the maximum tolerable latency specified by the SLO. These requests need to be allocated into a set of token buckets. (The number of token buckets is) ), This represents the h-th token bucket and determines its CPU / GPU resource configuration and node allocation to minimize total operating costs while satisfying all SLO constraints.

[0069] Preferably, in some embodiments, step 1.2 specifically includes:

[0070] Establish a latency model to characterize end-to-end latency constraints, including batch processing wait, network communication, and inference execution.

[0071] Establish a token-aware batch model to explicitly characterize the variability of token length;

[0072] Establish a cost model to model operating costs based on resource continuous occupancy time and fixed call overhead;

[0073] Establish a resource capacity model to constrain the total resources consumed by all token buckets to not exceed the available resource capacity of all worker nodes;

[0074] Joint optimization is carried out with the goal of minimizing total operating costs.

[0075] To be more specific here:

[0076] For token bucket Define decision variables Indicates a request Whether it is assigned to the token bucket .

[0077] Each request must be assigned to one and only one batch, i.e., satisfying:

[0078]

[0079] The latency model is used to characterize the end-to-end time chain and SLO constraints of inference requests. Inference request By the user at the start time The end-to-end latency submitted to the gateway consists of three parts: batch processing wait time. Network communication delay and inference execution time Then the i-th inference request The time when the specified function instance is reached is:

[0080]

[0081] Network communication delay The data follows a distribution within the range of [1, 10] milliseconds. To ensure SLO compliance, the completion time of a request must not exceed its absolute deadline, expressed as:

[0082]

[0083] That is, the following constraints must be met:

[0084] .

[0085] Indicates a reasoning request The deadline;

[0086] For the request The inference execution time is defined as the weighted sum of the delays of the batches it belongs to, expressed as:

[0087]

[0088] in, For token bucket Allocating CPU resources and GPU resources The subsequent execution delay time is determined by the offline performance table established above. The derived nonlinear function is expressed as:

[0089]

[0090] In the formula, the function It is used to describe the mapping relationship between resource allocation, input size, and inference latency.

[0091] To be more specific here:

[0092] The batch model described here differs from traditional scale-based coarse-grained batch processing by explicitly characterizing the variability of token length. Token Bucket Valid token length and batch size Determined by the following formula:

[0093]

[0094]

[0095] Token Bucket The reasoning process must satisfy the deadline constraints of all its internal requests, that is, the batch reasoning delay must satisfy:

[0096]

[0097] To be more specific here:

[0098] The cost model is based on the duration of resource occupancy and fixed call overhead to model the operational cost of a single instance call. Token Bucket Execution cost Defined as:

[0099]

[0100] In the formula, ι and ζ are the unit time prices of CPU resources and GPU resources, respectively, and μ is the fixed cost of a single function instance call.

[0101] Total operating cost of all requests The sum of the costs of each batch:

[0102]

[0103] To be more specific here:

[0104] The total CPU and GPU resources consumed by all token buckets must not exceed the available resource capacity on all worker nodes, with the specific constraint being:

[0105]

[0106]

[0107] Among them, resource allocation ≥0 belongs to the discrete available configuration set. Indicates token bucket Was it scheduled to an instance? Each batch must be scheduled to one and only one instance:

[0108]

[0109] To be more specific here:

[0110] The input-aware inference service scheduling and configuration problem is formalized as a joint optimization problem of request grouping, resource allocation, and scheduling, with the objective of minimizing the total operating cost, expressed as:

[0111]

[0112] And it is subject to the constraints defined in the above models.

[0113] Furthermore, in one embodiment, step 2 involves batch division, specifically including:

[0114] Step 2-1: Sort the set of requests to be processed in ascending order by token length;

[0115] Step 2-2: Iterate through the sorted requests one by one and quantify the incremental waste of filling the bucket caused by adding a new request to the current token bucket;

[0116] Steps 2-3: When the preset splitting threshold condition is met, close the current token bucket and make the new request the first element of the next token bucket.

[0117] Preferably, in some embodiments, the splitting threshold condition in steps 2-3 includes at least one of the following conditions:

[0118] Condition 1: The number of requests in the token bucket exceeds the preset maximum batch capacity limit;

[0119] Condition 2: The expected execution delay of the current token bucket under the maximum resource configuration violates the deadline constraint of any of its internal requests;

[0120] Condition 3: The current token bucket filling waste increment exceeds the preset proportional threshold.

[0121] Here, more preferably, in some embodiments, the quantification of the filling waste is as follows: the sum of the differences between the longest request length in the token bucket and the actual length of each request is calculated, expressed as:

[0122]

[0123] In the formula, Indicates token bucket Waste of filling For token bucket The longest request in the series. For each request The actual length.

[0124] Here, more preferably, in some embodiments, it is assumed that the current bucket is Iterate through the requests one by one and calculate the new requests to be added. The resulting increase in total fill waste in the bucket is :

[0125]

[0126] in, This represents the (i+1)th request. The length of the tokens in the input sequence. Indicates the j-th request The token length of the input sequence.

[0127] More preferably, in some embodiments, condition two is the current token bucket. It allows the expected execution delay under the maximum resource configuration to violate the deadline constraints of all its internal requests, i.e.:

[0128]

[0129] in, Indicates a request The time to reach the function instance Indicates a request The deadline Indicates a request The time to reach the function instance Indicates that a request will be made Add to token bucket The subsequent execution delay time, Representing requests respectively and requests The deadline.

[0130] More preferably, in some embodiments, condition three is expressed as:

[0131]

[0132] In the formula, This represents the preset threshold for the percentage of wasted token bucket filling.

[0133] Furthermore, in one embodiment, configuring computing resources in step 3 specifically includes:

[0134] Retrieve the offline performance data or offline performance table, and select the minimum necessary combination of CPU and GPU resources that meets the deadline constraint, given the token length and batch size.

[0135] Furthermore, in one embodiment, the online scheduling strategy described in step 4 specifically includes:

[0136] Arrange the token buckets that have completed resource allocation in descending order of resource demand;

[0137] Calculate the matching degree between each token bucket and the idle resource vector of each candidate worker node, i.e., the resource coverage;

[0138] Based on the resource coverage, the token bucket is preferentially scheduled to the node instance with the highest matching degree for execution.

[0139] Preferably, in some embodiments, the instance matching logic is as follows:

[0140] Prioritize executing the warm-up function instance on the node that yields the highest resource coverage value;

[0141] If no preheating function instance exists, a new resource configuration will be created on the node with the highest resource coverage value. Function instances; where ( Token buckets awaiting scheduling Resource allocation.

[0142] Preferably, in some embodiments, the resource coverage is measured using cosine similarity to quantify the fineness of the matching between the node's remaining resources and the token bucket demand. The calculation formula is as follows:

[0143]

[0144] In the formula, Indicates token bucket With each candidate working node Idle resources The degree of matching, These represent the nth candidate working node. Idle CPU resources and idle GPU resources on the platform. , These represent the h-th batch of token buckets. The CPU and GPU resources required for inference.

[0145] As a specific example, in some embodiments, combined with Figure 1 This paper presents a token-aware batch processing and online scheduling method for serverless NLP inference, which includes the following steps:

[0146] Step 1: Receive user reasoning requests, convert their input text into token sequences using a token segmenter and record their lengths to obtain the token length distribution of the current batch of requests;

[0147] Step 2: Through offline performance profiling, a model inference latency performance table is pre-established under different token lengths, batch sizes, and CPU / GPU resource configurations;

[0148] Step 3: Define the input-aware inference service scheduling and configuration problem, the goal of which is to minimize the total resource cost while satisfying the service level target latency (SLO) of all inference requests and the resource capacity constraints of each worker node; establish a system model including a latency model, a batch model, a cost model and a resource capacity model, and formalize the optimization problem.

[0149] Here, the input-aware inference service scheduling and configuration problem, system model and formal optimization problem are detailed as follows:

[0150] Step 3-1, consider a system with N heterogeneous working nodes. A distributed serverless cluster. The nth worker node. Characterized by its idle computing resources, i.e., CPU capacity and GPU capacity This cluster hosts a total of Function examples .in Represents a node The m-th instance on. Assume there are... A stream of inference requests arrives at the gateway, represented as Each request A tuple Define, where This is the time it takes for the request to arrive at the gateway. It is the token length of the input sequence. This is the maximum tolerable latency specified by the SLO. These requests need to be allocated into a set of token buckets. (The number of token buckets is) It determines its CPU / GPU resource configuration and node allocation to minimize total operating costs while satisfying all SLO constraints.

[0151] Step 3-2: Establish a latency model to characterize the end-to-end time chain and SLO constraints of inference requests. Inference Request By the user at the start time The end-to-end latency submitted to the gateway consists of three parts: batch processing wait time. Network communication delay and inference execution time .ask The time when the specified function instance is reached is:

[0152]

[0153] Network communication delay The data follows a distribution within the range of [1, 10] milliseconds. To ensure SLO compliance, the completion time of a request must not exceed its absolute deadline, expressed as:

[0154]

[0155] That is, the following constraints must be met:

[0156]

[0157] Step 3-3 establishes a token-aware batch model, which, unlike traditional scale-based coarse-grained batch processing, explicitly characterizes the variability of token length. For token buckets... Define decision variables Indicates a request Was it assigned to a bucket? .bucket Valid token length and batch size Determined by the following formula:

[0158]

[0159]

[0160] bucket Allocating CPU resources and GPU resources Execution delay time This is the offline performance table established through step 2. The derived nonlinear function is expressed as:

[0161]

[0162] function Describe the mapping relationship between resource allocation, input size, and inference latency. Each request must and can only be assigned to one batch:

[0163]

[0164] ask The inference execution time is defined as the weighted sum of the delays of the batches to which it belongs:

[0165]

[0166] bucket The reasoning process must satisfy the deadline constraints of all its internal requests, that is, the batch reasoning delay must satisfy:

[0167]

[0168] Steps 3-4 involve establishing a cost model, which models the operational cost of a single instance call based on resource persistence time and fixed call overhead. (Token Bucket) Execution cost Defined as:

[0169]

[0170] Where ι and ζ are the unit time prices of CPU and GPU resources, respectively, and μ is the fixed cost of a single function instance call. The total operating cost of all requests is the sum of the costs of each batch:

[0171]

[0172] Steps 3-5 establish a resource capacity model. The total CPU and GPU resources consumed by all token buckets must not exceed the available resource capacity on all worker nodes. Specific constraints are as follows:

[0173]

[0174]

[0175] Resource allocation ≥0 belongs to the discrete available configuration set. Indicates bucket Was it scheduled to an instance? Each batch must be scheduled to one and only one instance:

[0176]

[0177] Steps 3-6 formalize the optimization problem. The input-aware inference service scheduling and configuration problem is formalized as a joint optimization problem of request grouping, resource allocation, and scheduling. The objective is to minimize the total operating cost, expressed as:

[0178]

[0179] And is subject to the constraints defined in steps 3.2 to 3.5.

[0180] Step 4, Token-aware batch partitioning and resource allocation: Sort the inference requests arriving within the preset time window according to their input token length, and use a greedy strategy to dynamically merge requests with similar token lengths into the same token bucket. At the same time, query the pre-established offline performance table for each bucket and select the minimum necessary CPU and GPU resource configuration under the overall service level target latency constraint.

[0181] Here, the token-aware batch partitioning and resource allocation described in step 4 aims to solve the "filling waste" problem caused by the variability of NLP request length in a serverless environment. Through dynamic bucketing and optimal resource matching, it minimizes costs under SLO constraints. Specifically:

[0182] Step 4-1: After obtaining the token distribution of the current request sequence, token-aware dynamic batch partitioning is first performed. This process is not blind, but rather driven by quantifying waste filling. For any token bucket to be formed... The system calculates its internal filling overhead in real time. .

[0183] (1) Fill waste quantitative modeling and define token bucket Filling waste This is the sum of the differences between the length of the feature token within the bucket (i.e., the longest request length within the bucket) and the actual length of each request:

[0184]

[0185] in, For token bucket The longest request in the series. This represents the actual length of each request.

[0186] (2) Set up the request collection to be processed By token length Sort in ascending order. Assume the current token bucket is... Iterate through the requests one by one and calculate the new requests to be added. The resulting increase in total fill waste in the bucket:

[0187]

[0188] in, Indicates a request The token length of the input sequence.

[0189] (3) Set the maximum batch capacity limit as follows And introduce filler growth factor This serves as a splitting threshold. The current bucket is closed if any of the following conditions are met. and will As a new bucket The first element.

[0190] Condition 1 is that the number of requests within the bucket exceeds the hardware or memory definition. ;

[0191] Condition 2 is the current bucket It allows the expected execution delay under the maximum resource configuration to violate the deadline constraints of all its internal requests, i.e.

[0192]

[0193] in, Indicates a request The time to reach the function instance Indicates a request The deadline Indicates a request The time to reach the function instance Indicates that a request will be made Add to token bucket The subsequent execution delay time, Each represents a request and requests The deadline.

[0194] Condition 3 is that the current bucket's filling waste exceeds a preset percentage threshold, that is:

[0195]

[0196] Step 4-2, for each generated token bucket Determine the optimal CPU / GPU resource combination Using the offline performance table established in step 2 In the given and Under the dimension of , retrieve the minimum resource configuration that satisfies the last formula in step 3-3, thereby ensuring that the computational resource consumption of a single inference is most economical while satisfying SLO.

[0197] Step 5, Online scheduling based on resource coverage: Arrange the token buckets that have completed resource configuration in Step 4 in descending order of their resource requirements; for each token bucket, calculate its matching degree with the idle resource vectors of each candidate worker node, and prioritize scheduling it to the existing warm-up function instance on the worker node with the highest matching degree for execution; if there is no matching warm-up instance, create a new function instance on the node with the highest matching degree to execute the batch.

[0198] Here, the online scheduling based on resource coverage described in step 5 refers to the online scheduling of the token bucket with the optimal resource combination determined in step 4. This step addresses the problem of resource fragmentation in serverless worker nodes by introducing the concept of "resource coverage" to achieve efficient mapping between token buckets and heterogeneous nodes. Specifically:

[0199] Step 5-1: First, filter the nodes by traversing all active nodes in the current cluster and selecting those that meet the criteria. and If no nodes are available, a cold start logic is triggered, requesting a new instance according to the serverless platform's elastic scaling strategy, and... Add to the ready queue and wait. These represent the token buckets. The optimal CPU / GPU resource configuration parameters.

[0200] Step 5-2, for the token bucket to be scheduled and its resource allocation Calculate its relationship with each candidate working node Idle resources The degree of matching, i.e., resource coverage. The resource coverage Cosine similarity is used to quantify the granularity of the matching between the node's remaining resources and the request requirements.

[0201]

[0202] Step 5-3: Perform instance matching and scheduling, traversing all worker nodes. Use existing, resource-rich preheating function instances. Prioritize those that can Dispatch to make The instance on the node with the highest value is selected, and the resource usage status of that node is updated in real time. If no suitable warm-up instance is found, then an instance that allows... The worker node with the highest value creates a resource configured as ( To execute a new function instance .

[0203] Step 5-4: After all token bucket scheduling is completed, release the corresponding virtual resource quota and trigger step 3-4 to perform cost statistics.

[0204] In one embodiment, a token-aware batch processing and online scheduling system for serverless NLP inference is provided, the system comprising:

[0205] The first module is used to: receive user inference requests and extract their token length features, and obtain the token length distribution of the current request sequence;

[0206] The second module is used to: divide requests into batches based on the token length distribution and dynamically merge requests into at least one token bucket;

[0207] The third module is used to: configure computing resources for each token bucket based on preset offline performance data, while meeting service level target constraints;

[0208] The fourth module is used to: assign tasks to target node instances based on the resource requirements of each token bucket and using an online scheduling strategy guided by resource matching degree.

[0209] Furthermore, in one embodiment, the system further includes:

[0210] The fifth module, executed after the first module and before the second module:

[0211] The fifth module is used to: construct preset offline performance data: through offline performance profiling, a model inference latency performance table, i.e., an offline performance table, is pre-built under different token lengths, batch sizes, and CPU and GPU resource configuration combinations.

[0212] Furthermore, in one embodiment, the system further includes the following executed after the fifth module:

[0213] The sixth module is used to establish a system model to formally define the scheduling and configuration problem.

[0214] Specific limitations regarding the token-aware batch processing and online scheduling system for Serverless NLP inference can be found in the limitations of the token-aware batch processing and online scheduling method for Serverless NLP inference described above, and will not be repeated here. Each module in the aforementioned token-aware batch processing and online scheduling system for Serverless NLP inference can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device in software form, so that the processor can call and execute the corresponding operations of each module.

[0215] In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements:

[0216] Step 1: Receive user inference requests and extract their token length features to obtain the token length distribution of the current request sequence;

[0217] Step 2: Based on the token length distribution, perform batch division and dynamically merge requests into at least one token bucket;

[0218] Step 3: For each token bucket, configure computing resources based on preset offline performance data, while meeting service level target constraints.

[0219] Step 4: Based on the resource requirements of each token bucket, the task is assigned to the target node instance using an online scheduling strategy guided by resource matching degree.

[0220] Furthermore, in one embodiment, the processor executes the computer program after step 1 and before step 2, performing the following:

[0221] Step 1.1, Constructing Pre-defined Offline Performance Data: Through offline performance profiling, a model inference latency performance table, i.e., an offline performance table, is pre-built under different token lengths, batch sizes, and CPU and GPU resource configuration combinations.

[0222] Furthermore, in one embodiment, when the processor executes the computer program, it also performs the following after step 1.1:

[0223] Step 1.2: Establish a system model to formally define the scheduling and configuration problem.

[0224] For specific limitations on each step, please refer to the limitations of the token-aware batch processing and online scheduling method for Serverless NLP inference mentioned above, which will not be repeated here.

[0225] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, the computer program being implemented when executed by a processor:

[0226] Step 1: Receive user inference requests and extract their token length features to obtain the token length distribution of the current request sequence;

[0227] Step 2: Based on the token length distribution, perform batch division and dynamically merge requests into at least one token bucket;

[0228] Step 3: For each token bucket, configure computing resources based on preset offline performance data, while meeting service level target constraints.

[0229] Step 4: Based on the resource requirements of each token bucket, the task is assigned to the target node instance using an online scheduling strategy guided by resource matching degree.

[0230] Furthermore, in one embodiment, the computer program, when executed by the processor, is also executed after step 1 and before step 2:

[0231] Step 1.1, Constructing Pre-defined Offline Performance Data: Through offline performance profiling, a model inference latency performance table, i.e., an offline performance table, is pre-built under different token lengths, batch sizes, and CPU and GPU resource configuration combinations.

[0232] Furthermore, in one embodiment, the computer program, when executed by the processor, also performs the following after step 1.1:

[0233] Step 1.2: Establish a system model to formally define the scheduling and configuration problem.

[0234] As a specific example, the invention will be further verified and illustrated in one embodiment.

[0235] This embodiment takes a large-scale NLP model inference task based on a serverless architecture as an example. Figure 2 As shown, a token-aware batch processing and online scheduling system for Serverless NLP inference is configured, consisting of a preprocessor for inference requests, an offline analyzer for inference request performance, a token-aware batch processor, a central scheduler responsible for online batch scheduling, and a Serverless cluster composed of three worker nodes with heterogeneous computing resources (heterogeneous CPU cores and GPU memory share).

[0236] This embodiment verifies the batch partitioning and allocation computing resource algorithm proposed in this invention, and compares its performance with existing benchmark algorithms to verify the effectiveness of the proposed algorithm. Specific conditions include:

[0237] (1) Evaluation was conducted in an OpenFaaS-based experimental and simulation environment. The simulation environment was built using a Python simulator. All evaluations were performed on a computer configured with a 60-core Intel(R) Core(TM) i9-13900K CPU @ 3.0GHz, 16GB RAM, and an NVIDIA GeForce RTX 4090 graphics card with 48GB GDDR6X.

[0238] (2) The BERT model, which is representative in the field of NLP, is used for inference, and experimental results on the RTE and RACE datasets are given. The RTE (Recognizing Textual Entailment) dataset is used to determine whether there is an entailment relationship between a pair of texts. The text length is concentrated in 50 tokens and shows a long-tail distribution. The RACE (ReAding Comprehension from Examinations) dataset is a set of reading comprehension questions based on middle school exams. It is mainly used to evaluate the machine's ability to understand long articles. The text length is widely distributed in the range of 50-512.

[0239] (3) Configure serverless function parameters: the unit price of vCPU is $1.3e-5 / vCPU·s, and the unit price of vGPU is $8.1e-5 / vGPU·s. The constant unit cost of function instance calls is $2e-7. In addition, the communication latency from each IoT device to each ES is distributed in the range of [1, 10] milliseconds.

[0240] (4) Benchmark algorithms compared with the algorithm of this invention (TaBatch) include: MBS (Ali A, Pinciroli R, Yan F, et al. Optimizing inference serving on serverless platforms[J]. Proceedings of the VLDB Endowment, 2022, 15(10).): This scheme uses Bayesian optimization to coarsely group requests based on request size. The scheduling algorithm proposed in this invention is integrated into MBS to support efficient scheduling of heterogeneous resources. ESG (Hui X, Xu Y, Guo Z, et al. Esg: Pipeline-conscious efficient scheduling of dnn workflows on serverless platforms with shareable gpus[C] / / Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing. 2024: 42-55.): This scheme uses a round-robin approach to check the waiting queue of each request and uses an A* search elastic batch scheduling algorithm to allocate task instances to the most suitable nodes.

[0241] Base: When using a traditional serverless platform, each request is instantiated independently (without a batch processing mechanism), and requests are allocated to nodes that meet resource requirements through round-robin scheduling.

[0242] Figure 3 , Figure 4 The cost-performance comparison results of the proposed method and the baseline method are presented on the RTE and RACE datasets, respectively. The number of requests was fixed at 100, and different Service Level Targets (SLOs) were set in ascending order. In both figures, the blue bar chart TaBatch represents the optimal cost value calculated by the proposed method. Observation shows that the cost value of the proposed algorithm is lower than that of the baseline algorithm under different datasets and different SLO values.

[0243] at the same time, Figure 5 , Figure 6 The evaluation results show the inference time performance comparison between the proposed method and the baseline method on the RTE and RACE datasets. In both figures, the blue square line TaBatch represents the optimal inference time value calculated by the proposed method. It can be observed that the inference time of the proposed algorithm is lower than that of the baseline algorithm under different datasets and different SLO values. This is because other methods either do not consider input variations (such as ESG) or only focus on request size (such as MBS), while the proposed algorithm is specifically optimized for the variable-length features of Natural Language Processing (NLP) inputs. By implementing token-level fine-grained batch processing, the proposed method effectively reduces resource costs and inference time.

[0244] In summary, this invention significantly reduces the resource costs and execution latency of serverless inference services, reduces resource fragmentation, and thus effectively improves system efficiency.

[0245] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely illustrative of the principles of the invention. Any modifications, equivalent substitutions, or improvements made within the spirit and principles of the present invention without departing from its spirit and scope should be included within the protection scope of the present invention.

Claims

1. A token-aware batch processing and online scheduling method for serverless NLP inference, characterized in that, The method includes the following steps: Step 1: Receive user inference requests and extract their token length features to obtain the token length distribution of the current request sequence; Step 2: Based on the token length distribution, perform batch division and dynamically merge requests into at least one token bucket; Step 3: For each token bucket, configure computing resources based on preset offline performance data, while meeting service level target constraints. Step 4: Based on the resource requirements of each token bucket, the task is assigned to the target node instance using an online scheduling strategy guided by resource matching degree.

2. The token-aware batch processing and online scheduling method for Serverless NLP inference according to claim 1, characterized in that, After step 1 and before step 2, the method further includes performing: Step 1.1, Constructing Pre-defined Offline Performance Data: Through offline performance profiling, a model inference latency performance table, i.e., an offline performance table, is pre-built under different token lengths, batch sizes, and CPU and GPU resource configuration combinations.

3. The token-aware batch processing and online scheduling method for Serverless NLP inference according to claim 2, characterized in that, Following step 1.1, the method further includes performing: Step 1.2: Establish a system model to formally define the scheduling and configuration problem.

4. The token-aware batch processing and online scheduling method for Serverless NLP inference according to claim 3, characterized in that, Step 1.2 specifically includes: Establish a latency model to characterize end-to-end latency constraints, including batch processing wait, network communication, and inference execution. Establish a token-aware batch model to explicitly characterize the variability of token length; Establish a cost model to model operating costs based on resource continuous occupancy time and fixed call overhead; Establish a resource capacity model to constrain the total resources consumed by all token buckets to not exceed the available resource capacity of all worker nodes; Joint optimization is carried out with the goal of minimizing total operating costs.

5. The token-aware batch processing and online scheduling method for Serverless NLP inference according to claim 1, characterized in that, Step 2 involves batch segmentation, specifically including: Step 2-1: Sort the set of requests to be processed in ascending order by token length; Step 2-2: Iterate through the sorted requests one by one and quantify the incremental waste of bucket filling caused by adding a new request to the current token bucket; Steps 2-3: When the preset splitting threshold condition is met, close the current token bucket and make the new request the first element of the next token bucket.

6. The token-aware batch processing and online scheduling method for Serverless NLP inference according to claim 5, characterized in that, The splitting threshold conditions described in steps 2-3 include at least one of the following conditions: Condition 1: The number of requests in the token bucket exceeds the preset maximum batch capacity limit; Condition 2: The expected execution delay of the current token bucket under the maximum resource configuration violates the deadline constraint of any of its internal requests; Condition 3: The current token bucket filling waste increment exceeds the preset proportional threshold.

7. The token-aware batch processing and online scheduling method for Serverless NLP inference according to claim 6, characterized in that, The quantification method for the filling waste is: calculating the sum of the differences between the longest request length in the token bucket and the actual length of each request, expressed as: In the formula, Indicates token bucket Waste of filling For token bucket The longest request in the series. For each request The actual length.

8. The token-aware batch processing and online scheduling method for Serverless NLP inference according to claim 2, characterized in that, Step 3 involves configuring computing resources, specifically including: Retrieve the offline performance data or offline performance table, and select the minimum necessary combination of CPU and GPU resources that meets the deadline constraint, given the token length and batch size.

9. The token-aware batch processing and online scheduling method for Serverless NLP inference according to claim 1, characterized in that, The online scheduling strategy described in step 4 specifically includes: Arrange the token buckets that have completed resource allocation in descending order of resource demand; Calculate the matching degree between each token bucket and the idle resource vector of each candidate worker node, i.e., the resource coverage; Based on the resource coverage, the token bucket is preferentially scheduled to the node instance with the highest matching degree for execution.

10. The token-aware batch processing and online scheduling method for Serverless NLP inference according to claim 9, characterized in that, The resource coverage is measured using cosine similarity, which quantifies the fineness of the matching between the node's remaining resources and the token bucket demand. The calculation formula is as follows: In the formula, Indicates token bucket With each candidate working node Idle resources The degree of matching, These represent the nth candidate working node. Idle CPU resources and idle GPU resources on the device. , These represent the h-th batch of token buckets. The CPU and GPU resources required for inference.