Method, system, device and medium for optimizing effective throughput of large language model inference
By performing word-level segmentation and dynamic determination of the optimal segmentation point for large language model inference requests, the resource contention problem is solved, the utilization rate and throughput of computing resources are improved, the service level objectives (SLO) are met, and latency is reduced.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- UNIV OF SCI & TECH OF CHINA
- Filing Date
- 2026-04-10
- Publication Date
- 2026-06-16
AI Technical Summary
Large language models suffer from computationally intensive reasoning and severe resource contention, leading to throughput fluctuations and increased latency, making it difficult to meet stringent Service Level Objectives (SLOs) and resulting in low resource utilization.
By performing word-level segmentation on inference requests and dynamically determining the optimal segmentation point, requests are broken down into schedulable micro-requests. A segmentation scheme based on simulated scheduling and performance profiling is adopted to reduce resource contention and mutual interference during the pre-filling and decoding stages, thereby improving the utilization of computing resources.
While meeting latency constraints, it improves effective throughput and reduces end-to-end service latency, thereby increasing the utilization of computing resources and throughput performance.
Smart Images

Figure CN121996437B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of large language model inference optimization technology, and in particular to a method, system, device and medium for optimizing the efficient throughput of large language model inference. Background Technology
[0002] With the widespread application of large language models in scenarios such as dialogue, search and question answering, code generation, and content creation, online inference using large language models has become a crucial foundational capability. In practical deployments, large language model inference is computationally intensive and heavily reliant on GPUs (Graphics Processing Units), typically requiring deployment on high-performance hardware. This results in high hardware procurement and maintenance costs. Inference services need to maximize throughput to improve resource utilization and reduce the cost per request, while simultaneously meeting Service Level Objectives (SLOs). Therefore, the concept of "effective throughput" can be introduced, which refers to the number of requests or tokens generated by the system per unit of time according to SLO requirements. Optimizing the effective throughput of inference services has become critical.
[0003] Large language model inference can be divided into two stages: prefill and decoding. The prefill stage mainly involves large-scale matrix operations, which are computationally intensive and consume a lot of computing power. The decoding stage generates subsequent tokens step by step in an autoregressive manner, which requires frequent access to and maintenance of the cache, thus placing demands on memory access and bandwidth. Since the two stages have different resource requirements, they are prone to resource contention and mutual interference when sharing hardware, which can lead to throughput fluctuations and increased latency. Therefore, a prefill and decoding stage separation technique (PD separation) can be adopted, which schedules and executes the two stages on different computing resources or different execution instances to reduce resource contention and mutual interference. In addition, a chunk prefill technique can be used, which divides the long input of the prefill stage into multiple blocks according to the token sequence and executes them in batches, mitigating the interference of a single long sequence prefill on the decoding stage and improving the overall throughput and latency stability of the system.
[0004] However, while Chunk Prefill alleviates interference to some extent by dividing long prompts into multiple pre-filled sub-blocks, its control over service latency remains relatively coarse and fails to meet strict SLO requirements. On the other hand, when faced with sudden surges in request arrivals, large differences in request length distribution, and dynamic changes in concurrency, PD separation technology may still result in uneven resource allocation: one side may have idle computing power or bandwidth, while the other side becomes a system bottleneck due to concentrated load, thus limiting overall throughput.
[0005] In view of this, the present invention is hereby proposed. Summary of the Invention
[0006] The purpose of this invention is to provide a method, system, device, and medium for optimizing the efficient throughput of large language model inference, which can improve the utilization of computing resources (GPU) and the efficient throughput of large language model inference.
[0007] The objective of this invention is achieved through the following technical solution:
[0008] An efficient throughput optimization method for inference in large language models includes:
[0009] The system receives inference requests and estimates the decoding length to determine the number of predicted tokens, thereby constructing a logical request. The logical request is then segmented to obtain an initial segmentation scheme. Simulated scheduling is used to determine whether the delay constraint and load balancing constraint are met. If not, the segmentation point is adjusted starting from the initial segmentation scheme, and the simulation scheduling is used to continue to determine whether the delay constraint and load balancing constraint are met. This process is iterated until the optimal segmentation scheme is obtained. The optimal segmentation scheme contains two micro-requests after the logical request has been segmented.
[0010] Each micro-request is sent to the execution module in sequence.
[0011] The execution module performs inference on the received micro-requests by calling the corresponding execution resources.
[0012] A large language model inference efficient throughput optimization system, used to implement the aforementioned method, includes:
[0013] The scheduling module receives inference requests and estimates the decoding length to determine the number of predicted tokens, thereby constructing a logical request. It then segments the logical request to obtain an initial segmentation scheme. Simulated scheduling is used to determine whether the delay and load balancing constraints are met. If not, the initial segmentation scheme is used as the starting point to adjust the segmentation point position, and simulated scheduling is used to continue determining whether the delay and load balancing constraints are met. This process iterates until an optimal segmentation scheme is obtained, which includes two micro-requests after the logical request has been segmented.
[0014] The distribution module is used to send each micro-request to the execution module one by one in sequence;
[0015] The execution module is used to perform inference on the received micro-requests by calling the corresponding execution resources.
[0016] A processing device includes: one or more processors; and a memory for storing one or more programs;
[0017] When the one or more programs are executed by the one or more processors, the one or more processors implement the aforementioned method.
[0018] A readable storage medium storing a computer program that, when executed by a processor, implements the aforementioned method.
[0019] As can be seen from the technical solution provided by the present invention, by performing word-level segmentation on inference requests and dynamically determining the optimal segmentation point, and then splitting the inference requests into schedulable micro-requests, resource contention and mutual interference in the pre-filling and decoding stages can be reduced and computing resource utilization can be improved, thereby increasing effective throughput and reducing end-to-end service latency, while meeting latency constraints. Attached Figure Description
[0020] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0021] Figure 1 This is a flowchart illustrating an efficient throughput optimization method for large language model inference, provided as an embodiment of the present invention.
[0022] Figure 2 This is a schematic diagram of a logical request segmentation scheme provided in an embodiment of the present invention.
[0023] Figure 3 This is a schematic diagram of the scheduling module provided in an embodiment of the present invention.
[0024] Figure 4 A flowchart illustrating the segmentation scheme search process provided in an embodiment of the present invention.
[0025] Figure 5 This is a schematic diagram of an efficient throughput optimization system for large language model inference provided in an embodiment of the present invention.
[0026] Figure 6 This is a schematic diagram of a processing device provided in an embodiment of the present invention. Detailed Implementation
[0027] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the protection scope of the present invention.
[0028] First, the following explanations are provided for the terms that may be used in this article:
[0029] The terms "comprising," "including," "containing," "having," or other similar semantic descriptions should be interpreted as non-exclusive inclusion. For example, including a technical feature element (such as raw material, component, ingredient, carrier, dosage form, material, size, part, component, mechanism, device, step, process, method, reaction conditions, processing conditions, parameter, algorithm, signal, data, product or article of manufacture, etc.) should be interpreted as including not only the expressly listed technical feature element, but also other technical feature elements that are not expressly listed and are well-known in the art.
[0030] The following provides a detailed description of a method, system, device, and medium for optimizing the efficient throughput of large language model inference provided by this invention. Contents not described in detail in the embodiments of this invention are prior art known to those skilled in the art. Where specific conditions are not specified in the embodiments of this invention, conventional conditions in the art or conditions recommended by the manufacturer shall apply. Instruments used in the embodiments of this invention, unless otherwise specified by the manufacturer, are all commercially available conventional products.
[0031] Example 1
[0032] This invention provides a method for optimizing the throughput of large language model inference, such as... Figure 1 As shown, it mainly includes the following steps:
[0033] Step 1: Receive inference requests and search for the optimal segmentation scheme.
[0034] In this step: receiving inference requests and estimating the decoding length to determine the number of predicted generated tokens, thereby constructing a logical request; segmenting the logical request to obtain an initial segmentation scheme; determining whether the delay constraint and load balancing constraint are met through simulated scheduling; if not, adjusting the segmentation point position based on the initial segmentation scheme, and continuing to determine whether the delay constraint and load balancing constraint are met through simulated scheduling, iterating until the optimal segmentation scheme is obtained, which includes two micro-requests after the logical request is segmented.
[0035] The preferred implementation methods for each step in this process are as follows:
[0036] (1) The receiving of the reasoning request includes: processing the received text-based reasoning request into an initial lexical sequence consisting of multiple lexical units, and adding it to the queue to be scheduled.
[0037] (2) The process of performing decoding length estimation to determine the number of predicted generated tokens, thereby constructing the logical request, includes:
[0038] (2.1) Decode length estimation is performed based on one or more of the following information to determine the number of predicted generated tokens: the business type of the inference request, the content features of the prompt words, historical statistical information, and preset generation parameters. The historical statistical information includes, but is not limited to: the decoding length distribution corresponding to similar prompt word lengths, the statistics of the number of predicted generated tokens under different business types, and the mapping relationship between different prompt word content features and the number of predicted generated tokens.
[0039] (2.2) Combining the cue words and their quantity with the predicted generated words, construct a logic request. The logic request is used to uniformly describe the input scale and expected output scale of the inference task. As mentioned earlier, the inference request is processed into an initial word sequence. Here, the cue words are all the words in the initial word sequence. After the two micro-requests complete the inference, all the generated words obtained from the inference and the cue words form a complete word sequence.
[0040] (3) The step of determining whether delay constraints and load balancing constraints are met through simulated scheduling includes:
[0041] (3.1) Maintain historical scheduling results and performance profiles; wherein, the historical scheduling results are used to characterize the queue status, scheduled batches and their execution order of each execution resource; the performance profile is a data structure used to characterize the mapping relationship between different batch features and execution time, wherein the batch features include at least the batch size, the number of prompt words in the batch and the number of decoded words in the batch, wherein the decoded words are words generated after executing the corresponding batch, and the execution time of the batch can be estimated through the batch features.
[0042] (3.2) The initial segmentation scheme includes a set segmentation point position and two micro-requests after the logical request is segmented based on the set segmentation point position. The two micro-requests are simulated and scheduled based on the historical scheduling results. The two micro-requests (preceding and subsequent micro-requests) are executed through corresponding execution resources. Each execution resource executes in batches, and each micro-request needs to execute multiple batches. The execution time of the batches is estimated based on the performance profile to obtain the end-to-end latency estimation result and the estimated computation time of the two micro-requests. The estimated computation time of the micro-request is the sum of the execution times of the corresponding batches, while the end-to-end latency estimation result is the sum of the waiting batch and execution batch execution times of the two micro-requests on the corresponding execution resources. The logic of the simulated scheduling is consistent with the logic of the actual scheduling in the execution resources. Specifically, the initial segmentation scheme refers to a segmentation scheme that divides the logical request into a preceding micro-request and a subsequent micro-request by determining the segmentation point based on the boundary position of the prompt word and the predicted generated word.
[0043] In this embodiment of the invention, the preceding and subsequent micro-requests are executed through corresponding execution resources. Each execution resource has one batch. Assuming that the preceding micro-request requires the execution of m batches and the subsequent micro-request requires the execution of y batches, the end-to-end latency estimation result and the estimated computation time of the two micro-requests are calculated as follows: The preceding micro-request is assigned to the first execution resource. After the first execution resource waits for n batches, it executes the m batches corresponding to the preceding micro-request. After the execution is completed, the subsequent micro-request is assigned to the second execution resource. After the second execution resource waits for x batches, it executes the y batches corresponding to the subsequent micro-request. In the above example, the sum of the execution times of the m batches is the estimated computation time of the preceding micro-request, the sum of the execution times of the y batches is the estimated computation time of the subsequent micro-request, and the sum of the execution times of (n+m+x+y) batches is the end-to-end latency estimation result.
[0044] (3.3) Based on the end-to-end delay estimation results and the estimated computation time of the two micro-requests, determine whether the delay constraint and load balancing constraint are satisfied.
[0045] (4) Based on the end-to-end latency estimation results and the estimated computation time of the two micro-requests, determine whether the latency constraints and load balancing constraints are met, including:
[0046] (4.1) Determine whether the delay constraint is satisfied. The delay constraint is the service level target. At this time, determine whether the end-to-end delay estimation result does not exceed the service level target. If so, the delay constraint is satisfied.
[0047] (4.2) When the delay constraint is met, continue to determine whether the load balancing constraint is met, that is, whether the difference between the estimated computation time of the two micro-requests does not exceed the preset threshold; if so, the load balancing constraint is met.
[0048] (5) Adjusting the splitting point position based on the initial splitting scheme and determining whether the delay constraint and load balancing constraint are satisfied, iterating continuously until the optimal splitting scheme is obtained, including:
[0049] (5.1) Based on the judgment results of whether the delay constraint and load balancing constraint are met, adjust the split point position as follows: Divide the logical request into two micro-requests based on the split point position, called the preceding micro-request and the following micro-request; if the expected computation time of the preceding micro-request is greater than the expected computation time of the following micro-request, then move the split point position k forward; otherwise, move the split point position k backward by a preset step size. Update the split point position; the split point position is updated to... ,in, The assignment symbol is used. After each adjustment of the split point position, a new splitting scheme is obtained. At this time, the schemes in the above-mentioned parts (3) to (4) are used to determine whether the delay constraint and load balancing constraint are satisfied.
[0050] (5.2) The simulation scheduling determines whether the delay constraint and load balancing constraint are met. If the optimal segmentation scheme that satisfies the delay constraint and load balancing constraint is not obtained after reaching the maximum number of iterations, a delay time is set based on the historical scheduling results. After the delay time is reached, the iteration continues. If a segmentation scheme that satisfies the delay constraint and load balancing constraint is not obtained after the specified number of delays, the initial segmentation scheme is taken as the optimal segmentation scheme, and the inference request is directly sent to the execution module for execution. The specified number of delays is generally no more than 3 times to avoid excessive delay in request execution. The specific number can be dynamically set according to the system load. The initial segmentation scheme is a segmentation scheme determined based on the boundary between the prompt lexical and the predicted generated lexical. In most cases, it can reduce resource interference between the pre-filling and decoding stages, thus serving as an optional scheme for degradation processing.
[0051] Step 2: Send each micro-request to the execution module one by one in sequence.
[0052] Based on the foregoing description, each segmentation scheme in this embodiment of the invention includes a corresponding segmentation point position, and two micro-requests after the logical request is segmented based on the segmentation point position. The first one is the pre-sequence micro-request, and the second one is the post-sequence micro-request. The segmentation point positions of different segmentation schemes are different, so the corresponding pre-sequence micro-requests and post-sequence micro-requests are naturally different as well.
[0053] This step belongs to the distribution phase. In the previous steps, the logical request will be divided into a preceding micro-request and a following micro-request by the splitting point position in the optimal splitting scheme. At this time, the corresponding requests need to be distributed in order, that is: first send the preceding micro-request to the execution module, and after receiving the execution result of the preceding micro-request returned by the execution module, send the following micro-request to the execution module.
[0054] Step 3: The execution module performs inference on the received micro-requests by calling the corresponding execution resources.
[0055] In this step, the execution module adds the received micro-requests to the local pending queue and calls the corresponding execution resources to perform inference.
[0056] The above-mentioned solution provided by the embodiments of the present invention, by segmenting requests and breaking down inference requests into schedulable micro-requests, reduces mutual interference between the pre-filling and decoding stages under shared resources, thereby improving effective throughput. Furthermore, by employing a segmentation scheme search method based on simulated scheduling and performance profiling, the latency (service level target) and load balancing of different segmentation schemes are evaluated without actually performing inference computation, and a better segmentation scheme is obtained iteratively with lower overhead. In addition, the present invention enables the inference service to improve resource utilization, reduce service latency, and improve effective throughput while meeting service level targets.
[0057] To more clearly demonstrate the technical solution and its effects provided by the present invention, the method provided by the embodiments of the present invention will be described in detail below with reference to specific examples.
[0058] I. Overall Introduction to the Plan.
[0059] The overall process of this invention can be summarized as follows: A reasoning request is received; the request is processed into an initial token sequence consisting of multiple tokens and added to a scheduling queue; then, decoding (token generation) length prediction is performed; based on the prediction result and the hint tokens (tokens in the initial token sequence), a logical request is constructed; the logical request is segmented to obtain an initial segmentation scheme; starting from the initial segmentation scheme, the segmentation point positions are progressively adjusted and searched; each segmentation scheme is simulated, scheduled, and evaluated, and is required to meet latency and load balancing constraints. If these constraints are not met, the segmentation point positions are adjusted and the search is iteratively performed until the optimal segmentation scheme is obtained. In this process, segmentation refers to dividing the logical request into two micro-requests at a certain token, and based on simulated scheduling and performance profiling, the latency and load balancing of different segmentation schemes are evaluated without actually performing reasoning calculations. After obtaining the optimal segmentation scheme, the logical request is segmented into two micro-requests: a preceding micro-request and a following micro-request. The preceding and following micro-requests are distributed for execution. During the distribution process, the corresponding following micro-request is only distributed after the preceding micro-request has been executed on its corresponding computing resource.
[0060] II. Detailed introduction of the plan.
[0061] 1. Introduction to the principle of the segmentation scheme.
[0062] Figure 2This is a schematic diagram illustrating the principle of a segmentation scheme for a logical request provided in this invention, showing a logical request and its different segmentation methods. The logical request includes at least prompt words and their quantity, as well as the number of predicted generated words. Based on the business type of the inference request, historical statistical information of the prompt word content features, and one or more preset generation parameters, the decoding length is estimated to determine the number of predicted generated words. The preset generation parameters include, but are not limited to: maximum generation length (controlling the maximum number of predicted generated words), stop word (terminating the generation when a stop word is generated), temperature (controlling the randomness of the generation result), top-K (sampling the next word only from the top K candidate words with the highest probability), and top-p (dynamically selecting the smallest set of candidate words whose cumulative probability reaches a threshold p for sampling), etc.
[0063] The logical request can be segmented at any lexical position, including segmentation at the boundary between the prompt lexical and the predicted generated lexical, segmentation within the prompt lexical, and segmentation within the predicted generated lexical. After segmentation, the logical request yields two micro-requests: a preceding micro-request and a following micro-request, which are executed sequentially. When segmenting within the prompt lexical, each micro-request contains at least: the prompt lexical subsequence corresponding to the micro-request and the number of predicted generated lexicals associated with the micro-request.
[0064] 2. Simulation scheduling and constraint judgment.
[0065] In this embodiment of the invention, the entire scheduling process (i.e., step 1 mentioned above) is implemented through a scheduling module. This module not only needs to predict the decoding length to obtain the expected number of generated tokens, but also needs to construct logical requests and complete the segmentation accordingly. Furthermore, it needs to use simulated scheduling to determine whether the segmentation scheme meets latency and load balancing constraints.
[0066] like Figure 3 As shown, the scheduling module maintains historical scheduling results and performance profiles. The historical scheduling results characterize the queue status, scheduled batches, and their execution order for each execution resource. The performance profile is a data structure representing the mapping relationship between different batch characteristics and execution time. The batch characteristics include at least the batch size, the number of cue words within the batch, and the number of decoded words within the batch, used to determine the execution time of a given batch. The execution time of batches is estimated under certain conditions and used as the basis for simulation scheduling and constraint verification of the splitting scheme; here... , , The corresponding values represent the given batch size, the number of suggested tokens within the batch, and the number of predicted generated tokens within the batch.
[0067] The scheduling module queues and simulates the scheduling of newly arriving requests based on historical scheduling results and the performance profile. The scheduling logic here should be consistent with the actual scheduling logic in the corresponding computing instance to ensure that the simulation results are consistent with the actual batch assignment, execution order and latency estimation.
[0068] Furthermore, the scheduling module verifies the splitting scheme, estimating the execution time of each batch based on the performance profile and combining it with the simulation scheduling results to obtain the end-to-end latency estimation result and the estimated computation time of the two micro-requests, which are used to verify the latency constraints and load balancing constraints. When the splitting scheme meets the above two constraints, the scheduling module determines the splitting scheme as the optimal splitting scheme and outputs it; when the above two constraints are not met, the scheduling module adjusts the splitting point position according to the estimated computation time difference and generates a new splitting scheme to continue verification.
[0069] In this invention, the segmentation of a logical request is achieved by introducing a segmentation point position k. This segmentation point position k is located on the lexical sequence of the logical request and can take values at any position. The logical request is segmented after the k-th lexical term, thereby dividing the logical request into a preceding micro-request and a following micro-request. The segmentation scheme includes at least the segmentation point position and the preceding and following micro-requests determined by the segmentation point position.
[0070] In this embodiment of the invention, the boundary between the prompt word and the generated word is selected as the initial segmentation point for the first segmentation of the request to form an initial segmentation scheme. This reduces resource contention and mutual interference between stages while providing a reasonable initial solution for subsequent segmentation point searches.
[0071] In this invention, the latency constraint refers to the Service Level Objective (SLO). Specifically, the latency constraint requires that the end-to-end latency estimate does not exceed the Service Level Objective; the load balancing constraint requires that the estimated computation time of two micro-requests on their corresponding execution resources be as consistent as possible, and that the estimated computation time of the two micro-requests does not exceed a preset threshold. The specific threshold value can be set by the user based on the actual situation or experience, and this invention does not impose any restrictions.
[0072] In this embodiment of the invention, adjusting the splitting point position to generate a new splitting scheme is called splitting scheme search. Specifically, it includes: when the estimated computation time of the preceding micro-request under the splitting scheme is greater than the estimated computation time of the succeeding micro-request, moving the splitting point position k forward; otherwise, moving the splitting point position k backward by a preset step size. Update the split point; the split point is updated to... .in, It can be set in conjunction with the length of the current request to be scheduled, for example, as the length of the prompt word, or as a preset proportion of the predicted generated word length, and can be gradually reduced as the number of iterations increases to improve search accuracy.
[0073] In this invention embodiment, if a partitioning scheme that satisfies the constraints is not obtained even after reaching the maximum number of iterations, the scheduling module can set a delay time based on historical scheduling results. ,exist Upon expiration, the queue status is retrieved again and the splitting scheme search is performed again.
[0074] 3. Search process for segmentation schemes.
[0075] The following section, combining the descriptions in parts 1 and 2 above, provides an overall explanation of the search process for the segmentation scheme. For example... Figure 4 As shown, the main steps include the following:
[0076] Step A1: Determine if the maximum number of iterations has been reached; if not, split the logical request according to the current splitting point position to generate the current splitting scheme; if the maximum number of iterations has been reached, set the delay time based on the historical scheduling results. ,exist Upon expiration, the queue status is retrieved again and the splitting scheme search process is restarted.
[0077] Specifically, a new segmentation scheme can be generated as follows: The adjustment direction of the segmentation point position k is determined based on the comparison of the estimated computation times of the two micro-requests obtained in step A2; when the estimated computation time of the preceding micro-request in the segmentation scheme is greater than the estimated computation time of the following micro-request, the segmentation point position k is moved forward; otherwise, the segmentation point position k is moved backward, with a preset step size. Update the split point; the split point is updated to... The iteration count is incremented by one.
[0078] Step A2: Simulate scheduling and evaluate the current splitting scheme: Without actually performing inference calculations, queue and schedule the split micro-requests based on historical scheduling results to determine the batch affiliation and execution order of the micro-requests; then estimate the execution time of each batch based on the performance profile to obtain the end-to-end latency estimation results and the estimated computation time of the two micro-requests.
[0079] Step A3: Sequentially determine whether the current splitting scheme meets latency constraints and load balancing constraints: First, determine whether the latency constraint is met, i.e., whether the end-to-end latency estimate does not exceed the Service Level Target (SLO); If the latency constraint is met, further determine whether the load balancing constraint is met, i.e., whether the difference in the estimated computation time of two micro-requests on their corresponding execution resources does not exceed a preset threshold. When both constraints are satisfied, the current partitioning scheme is recorded as the optimal partitioning scheme, and the process proceeds to step A4; when either constraint is not satisfied, the process proceeds to step A1 to adjust the partitioning point and generate a new partitioning scheme.
[0080] Step A4: Output the current optimal partitioning scheme; the output optimal partitioning scheme includes the preceding and following micro-requests obtained from the partitioning. As mentioned earlier, if a partitioning scheme that satisfies the delay constraint and load balancing constraint is not obtained after a specified number of delays, the initial partitioning scheme is directly adopted.
[0081] In the above-described scheme provided by this invention, the segmented micro-requests are simulated and queued without actual inference computation. The batch execution time is estimated by combining performance profiles, thereby obtaining end-to-end latency estimation results and the estimated computation time of the two micro-requests on their corresponding computing resources. This is used to verify latency constraints and load balancing constraints. By progressively adjusting and searching the segmentation point, this invention can make the estimated computation time of the two micro-requests more balanced while meeting Service Level Objectives (SLOs), reducing waiting and idle time caused by uneven resource matching, improving computing resource utilization, and thus increasing the effective throughput completed according to SLOs.
[0082] Through the above description of the embodiments, those skilled in the art can clearly understand that the above embodiments can be implemented by software, or by using software plus necessary general-purpose hardware platforms. Based on this understanding, the technical solutions of the above embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, mobile hard drive, etc.), including several instructions to cause a computer device (such as a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present invention.
[0083] Example 2
[0084] This invention also provides an efficient throughput optimization system for large language model inference, which is mainly used to implement the methods provided in the foregoing embodiments, such as... Figure 5 As shown, the system mainly includes:
[0085] The scheduling module receives inference requests and estimates the decoding length to determine the number of predicted tokens, thereby constructing a logical request. It then segments the logical request to obtain an initial segmentation scheme. Simulated scheduling is used to determine whether the delay and load balancing constraints are met. If not, the initial segmentation scheme is used as the starting point to adjust the segmentation point position, and simulated scheduling is used to continue determining whether the delay and load balancing constraints are met. This process iterates until an optimal segmentation scheme is obtained, which includes two micro-requests after the logical request has been segmented.
[0086] The distribution module is used to send each micro-request to the execution module one by one in sequence;
[0087] The execution module is used to perform inference on the received micro-requests by calling the corresponding execution resources.
[0088] Since the main technical details involved in each module have been described in detail in the previous embodiments, they will not be repeated here.
[0089] Those skilled in the art will understand that, for the sake of convenience and brevity, the above-described division of functional modules is used as an example. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the system can be divided into different functional modules to complete all or part of the functions described above.
[0090] Example 3
[0091] The present invention also provides a processing device, such as Figure 6 As shown, it mainly includes: one or more processors; a memory for storing one or more programs; wherein, when the one or more programs are executed by the one or more processors, the one or more processors implement the method provided in the foregoing embodiments.
[0092] Furthermore, the processing device also includes at least one input device and at least one output device; in the processing device, the processor, memory, input device, and output device are connected via a bus.
[0093] In this embodiment of the invention, the specific types of the memory, input device, and output device are not limited; for example:
[0094] Input devices can be touchscreens, image acquisition devices, physical buttons, or mice, etc.
[0095] The output device can be a display terminal;
[0096] The memory can be random access memory (RAM) or non-volatile memory, such as disk storage.
[0097] Example 4
[0098] The present invention also provides a readable storage medium storing a computer program that, when executed by a processor, implements the method provided in the foregoing embodiments.
[0099] In this embodiment of the invention, the readable storage medium is a computer-readable storage medium and can be disposed in the aforementioned processing device, for example, as a memory in the processing device. Furthermore, the readable storage medium can also be any medium capable of storing program code, such as a USB flash drive, portable hard drive, read-only memory (ROM), magnetic disk, or optical disk.
[0100] The above description is merely a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims. The information disclosed in the background section is intended only to enhance the understanding of the overall background technology of the present invention and should not be construed as an admission or implication in any way that such information constitutes prior art known to those skilled in the art.
Claims
1. A method for optimizing the throughput of reasoning in large language models, characterized in that, include: The system receives inference requests and estimates the decoding length to determine the number of predicted generated tokens. A logical request is then constructed by combining the prompt tokens and their quantities with the number of predicted generated tokens. The prompt tokens are all tokens in the initial token sequence. The logical request is then segmented to obtain an initial segmentation scheme. This initial segmentation scheme includes a set segmentation point position and two micro-requests resulting from segmenting the logical request based on the set segmentation point position. Simulated scheduling is used to determine whether latency and load balancing constraints are met. If not, the segmentation point position is adjusted starting from the initial segmentation scheme, and simulated scheduling continues to determine whether latency and load balancing constraints are met. This process iterates until an optimal segmentation scheme is obtained, which includes the two micro-requests resulting from segmenting the logical request. Each micro-request is sent to the execution module in sequence. The execution module performs inference on the received micro-requests by calling the corresponding execution resources.
2. The method for optimizing the efficient throughput of large language model inference according to claim 1, characterized in that, The receiving of inference requests includes: processing the received text-based inference request into an initial lexical sequence consisting of multiple lexical units, and adding it to the scheduling queue.
3. The method for optimizing the efficient throughput of large language model inference according to claim 2, characterized in that, The process of estimating the decoding length and determining the number of predicted generated tokens includes: The decoding length is estimated based on one or more of the following information to determine the number of predicted generated tokens: the business type of the inference request, the content features of the prompt words, historical statistics, and preset generation parameters; the historical statistics include: the decoding length distribution corresponding to similar prompt word lengths, the statistics of the number of predicted generated tokens under different business types, and the mapping relationship between different prompt word content features and the number of predicted generated tokens.
4. The method for optimizing the efficient throughput of large language model inference according to claim 1, characterized in that, The step of determining whether delay constraints and load balancing constraints are met through simulated scheduling includes: Maintain historical scheduling results and performance profiles; wherein, the historical scheduling results are used to characterize the queue status, scheduled batches and their execution order of each execution resource; the performance profile is a data structure used to characterize the mapping relationship between different batch characteristics and execution time, wherein the batch characteristics include at least the batch size, the number of prompt words in the batch and the number of decoded words in the batch, wherein the decoded words are words generated after executing the corresponding batch; the execution time of the batch can be estimated through the batch characteristics; Based on the historical scheduling results, the two micro-requests in the initial segmentation scheme are simulated and scheduled. The two micro-requests are executed through the corresponding execution resources, with each execution resource executing in batches. Each micro-request needs to execute multiple batches. The execution time of the batches is estimated based on the performance profile to obtain the end-to-end latency estimation result and the estimated computation time of the two micro-requests. The estimated computation time of the micro-request is the sum of the execution times of the corresponding batches. The end-to-end latency estimation result is the sum of the waiting batch and execution batch execution times of the two micro-requests on the corresponding execution resources. The logic of the simulated scheduling is consistent with the logic of the actual scheduling in the execution resources. The set segmentation position is the boundary position between the prompt word and the predicted generated word. Based on the end-to-end latency estimation results and the estimated computation time of the two micro-requests, it is determined whether the latency constraint and load balancing constraint are satisfied.
5. The method for optimizing the efficient throughput of large language model inference according to claim 4, characterized in that, The determination of whether the latency constraint and load balancing constraint are met based on the end-to-end latency estimation result and the estimated computation time of the two micro-requests includes: Determine whether the delay constraint is met. The delay constraint is a service level target. At this time, determine whether the end-to-end delay estimation result does not exceed the service level target. If so, the delay constraint is met. When the delay constraint is met, it is then determined whether the load balancing constraint is met, i.e., whether the difference between the estimated computation time of the two micro-requests does not exceed the preset threshold; if so, the load balancing constraint is met.
6. The method for optimizing the efficient throughput of large language model inference according to claim 4, characterized in that, The step of adjusting the splitting point position based on the initial splitting scheme and continuing to iterate through simulated scheduling to determine whether the delay constraint and load balancing constraint are met until the optimal splitting scheme is obtained includes: Based on the determination of whether the delay constraint and load balancing constraint are met, the split point position is adjusted as follows: The logical request is split into two micro-requests based on the split point position, called the preceding micro-request and the following micro-request; if the estimated computation time of the preceding micro-request is greater than the estimated computation time of the following micro-request, the split point position k is moved forward; otherwise, the split point position k is moved backward by a preset step size. Update the split point position; the split point position is updated to... ,in, This is an assignment operator; The simulation schedule determines whether the delay constraint and load balancing constraint are met. If the optimal partitioning scheme that satisfies the delay constraint and load balancing constraint is not obtained after reaching the maximum number of iterations, a delay time is set based on the historical scheduling results, and the iteration continues after the delay time is reached. If a partitioning scheme that satisfies the delay constraint and load balancing constraint is not obtained after the specified number of delays, the initial partitioning scheme is taken as the optimal partitioning scheme.
7. A method for optimizing the efficient throughput of large language model inference according to claim 1 or 6, characterized in that, The step of sending each micro-request to the execution module sequentially includes: The optimal segmentation scheme includes two micro-requests after the logical request is segmented, which refers to the preceding micro-request and the following micro-request obtained by segmenting the logical request through the segmentation point position corresponding to the optimal segmentation scheme. The preceding micro-request is sent to the execution module. After receiving the execution result of the preceding micro-request returned by the execution module, the subsequent micro-request is then sent to the execution module.
8. A large language model inference efficient throughput optimization system, characterized in that, To implement the method according to any one of claims 1 to 7, comprising: The scheduling module receives inference requests and estimates the decoding length to determine the number of predicted tokens, thereby constructing a logical request. It then segments the logical request to obtain an initial segmentation scheme. Simulated scheduling is used to determine whether the delay and load balancing constraints are met. If not, the initial segmentation scheme is used as the starting point to adjust the segmentation point position, and simulated scheduling is used to continue determining whether the delay and load balancing constraints are met. This process iterates until an optimal segmentation scheme is obtained, which includes two micro-requests after the logical request has been segmented. The distribution module is used to send each micro-request to the execution module one by one in sequence; The execution module is used to perform inference on the received micro-requests by calling the corresponding execution resources.
9. A processing device, characterized in that, include: One or more processors; Memory, used to store one or more programs; Wherein, when the one or more programs are executed by the one or more processors, the one or more processors cause the one or more processors to implement the method as described in any one of claims 1 to 7.
10. A readable storage medium storing a computer program, characterized in that, When a computer program is executed by a processor, it implements the method as described in any one of claims 1 to 7.