Resource allocation method and electronic device
By intercepting kernel execution requests and dynamically adjusting the resource ratio between pre-filling and decoding processes, the problem of GPU resource contention in large language models is solved, achieving more efficient resource utilization and performance goals.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- LENOVO (BEIJING) LTD
- Filing Date
- 2026-02-26
- Publication Date
- 2026-06-19
AI Technical Summary
In the process of large-scale language model inference, the pre-filling process and the decoding process compete for GPU resources due to differences in computational characteristics and resource requirements. The coarse-grained and slow-response resource adjustment methods in the existing technology are difficult to adapt to the dynamically changing workload in real time, resulting in low GPU resource utilization.
By intercepting kernel execution requests, obtaining grid size and block size, and combining this with the total number of GPU streaming multiprocessors, the scaling ratio of pre-filling and decoding processes is dynamically adjusted to achieve fine-grained resource allocation and fast-response resource management at the kernel level.
It improves the overall utilization of GPU resources, enables fine-grained adaptation to dynamic workloads, and enhances resource utilization efficiency and the achievement rate of performance targets.
Smart Images

Figure CN122240297A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer technology, and in particular to a resource allocation method and electronic device. Background Technology
[0002] In large-scale language model inference, the pre-filling process and the decoding process have inherent differences in computational characteristics and resource requirements, leading to competition when sharing Graphics Processing Unit (GPU) resources. Related technologies typically adjust these processes based on fixed time intervals or static quotas, resulting in coarse-grained adjustments and delayed responses. Due to the dynamic and diverse nature of inference requests, this coarse-grained, slow-response adjustment method struggles to adapt to constantly changing workloads in real time, leading to low GPU resource utilization. Summary of the Invention
[0003] This disclosure provides a resource allocation method and an electronic device.
[0004] According to one aspect of this disclosure, a resource allocation method is provided, comprising: intercepting kernel execution requests of a pre-filling process and a decoding process, and obtaining the grid size and block size in the kernel execution request; determining the number of prediction streaming multiprocessors required to execute the kernel based on the grid size and block size and the total number of streaming multiprocessors of the GPU; obtaining performance metrics of the pre-filling process and the decoding process, the performance metrics characterizing the latency of the process generating tokens; determining the scaling ratio of each of the pre-filling process and the decoding process based on the performance metrics, the scaling ratio being used to scale the number of prediction streaming multiprocessors; and determining the actual number of streaming multiprocessors used to execute the kernels of each of the pre-filling process and the decoding process based on the scaling ratio and the number of prediction streaming multiprocessors.
[0005] According to embodiments of this disclosure, determining the scaling ratio of the pre-filling process and the decoding process based on performance metrics includes: comparing the performance metrics of each process with a preset threshold for each process; and adjusting the scaling ratio of the pre-filling process and the decoding process based on the comparison results.
[0006] According to embodiments of this disclosure, based on comparison results, adjusting the scaling ratios of the pre-filling process and the decoding process includes: increasing the scaling ratio of any process when its performance index is greater than or equal to a preset threshold, and decreasing the scaling ratios of other processes; and resetting the scaling ratio of any process to a baseline value when its performance index is less than a preset threshold.
[0007] According to embodiments of this disclosure, determining the predicted number of streaming multiprocessors required to execute a kernel based on the grid size, block size, and the total number of streaming multiprocessors in the GPU includes: determining an initial number of streaming multiprocessors based on the grid size, block size, and the total number of streaming multiprocessors in the GPU, wherein the initial number of streaming multiprocessors is the number of streaming multiprocessors required to meet theoretical performance; obtaining a performance ratio, wherein the performance ratio characterizes the ratio between target performance and theoretical performance; and determining a predicted number of streaming multiprocessors based on the performance ratio and the initial number of streaming multiprocessors, wherein the predicted number of streaming multiprocessors is the number of streaming multiprocessors required to meet the target performance.
[0008] According to embodiments of this disclosure, obtaining the performance ratio includes: obtaining an initial value for the performance ratio; obtaining the GPU's streaming multiprocessor utilization when the GPU simultaneously executes the kernels of the pre-filling process and the decoding process during a historical period; and adjusting the initial value for the performance ratio based on the streaming multiprocessor utilization to obtain the performance ratio.
[0009] According to embodiments of this disclosure, obtaining performance metrics for the pre-filling process and the decoding process includes: monitoring lexical generation events of the pre-filling process and the decoding process; in response to detecting that any process has completed a lexical generation, recording the completion timestamp of that lexical generation; calculating the most recent lexical generation delay for each process based on the recorded completion timestamp; and determining the most recent lexical generation delay for each process as the performance metric of that process.
[0010] According to embodiments of this disclosure, determining the actual number of streaming multiprocessors for each kernel executing the pre-filling process and the decoding process based on the scaling ratio and the predicted number of streaming multiprocessors includes: determining an allocation ratio based on the scaling ratio and the predicted number of streaming multiprocessors, wherein the allocation ratio characterizes the proportional relationship between the number of streaming multiprocessors allocated to different processes; and determining the actual number of streaming multiprocessors for each kernel executing the pre-filling process and the decoding process based on the allocation ratio and the total number of streaming multiprocessors of the GPU.
[0011] According to embodiments of this disclosure, determining the actual number of stream multiprocessors for each kernel used to execute the pre-filling process and the decoding process based on the allocation ratio and the total number of stream multiprocessors of the GPU includes: determining the stream multiprocessor allocation status of the peer process of the kernel based on the allocation ratio; when the allocation status indicates that the peer process of the kernel has not been allocated a stream multiprocessor, determining the total number of stream multiprocessors of the GPU as the actual number of stream multiprocessors used to execute the kernel.
[0012] According to embodiments of this disclosure, the method further includes: adjusting the kernel execution request based on the actual number of streaming multiprocessors, sending the adjusted request to the GPU, and updating the execution status identifier of the process to which the kernel execution request belongs to indicate that the kernel of the process is executing; monitoring GPU event callbacks associated with the kernel execution request; and updating the execution status identifier in response to the GPU event callback being triggered to indicate that the kernel execution of the process is complete.
[0013] Another aspect of this disclosure provides an electronic device comprising: an interface for data interaction with a GPU; and a processor for: intercepting kernel execution requests from a pre-padding process and a decoding process, and obtaining a grid size and a block size from the kernel execution requests; determining, based on the grid size and block size and the total number of streaming multiprocessors of the GPU, the number of prediction streaming multiprocessors required to execute the kernel; obtaining performance metrics for the pre-padding process and the decoding process, the performance metrics characterizing the latency of the process generating tokens; determining, based on the performance metrics, a scaling ratio for each of the pre-padding process and the decoding process, the scaling ratio being used to scale the number of prediction streaming multiprocessors; and determining, based on the scaling ratio and the number of prediction streaming multiprocessors, the actual number of streaming multiprocessors used to execute the kernels for each of the pre-padding process and the decoding process.
[0014] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0015] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:
[0016] Figure 1 This is a flowchart of a resource allocation method according to an embodiment of the present disclosure;
[0017] Figure 2 This is a schematic diagram of a resource allocation method according to an embodiment of the present disclosure;
[0018] Figure 3 This is a schematic diagram of a resource allocation method according to another embodiment of the present disclosure;
[0019] Figure 4 This is a schematic diagram of a resource allocation method according to yet another embodiment of the present disclosure;
[0020] Figure 5 This is a schematic diagram of the structure of an electronic device according to an embodiment of the present disclosure; and
[0021] Figure 6 This is a schematic block diagram of an example electronic device used to implement embodiments of the present disclosure. Detailed Implementation
[0022] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.
[0023] In the technical solutions disclosed herein, the collection, storage, use, processing, transmission, provision, disclosure, and application of data (including but not limited to user personal information) comply with the provisions of relevant laws and regulations, necessary confidentiality measures have been taken, and they do not violate public order and good morals.
[0024] In large language model inference services, the inference process is typically divided into a prefill (P) stage and a decoding (D) stage. To improve the overall utilization of GPU resources, a "semi-converged" architecture has been proposed, in which the P and D stages are deployed as independent processes on the same GPU card. Under this architecture, streaming multiprocessor isolation technology is needed to allocate and limit the proportion of computing resources available to the P and D processes, in order to balance the resource competition arising from their different computational characteristics.
[0025] In related technologies, resource allocation methods suffer from slow response times when adjusting resources. Common solutions rely on statistically analyzing performance metrics over fixed time periods before making adjustments, failing to adapt to the dynamic and diverse nature of inference requests at a finer time granularity. Furthermore, these solutions depend on extensive and time-consuming offline performance profiling to pre-determine kernel execution times under different configurations, increasing deployment complexity and cost. Some solutions involve process restarts during resource adjustments, resulting in significant performance overhead.
[0026] In some examples, deficiencies in the related technologies make it difficult for resource allocation methods to achieve precise, real-time, and low-overhead dynamic allocation of GPU computing resources. They are unable to consistently guarantee the respective latency targets of the pre-filling and decoding processes under continuously changing loads, resulting in low GPU resource utilization.
[0027] Figure 1 This is a flowchart of a resource allocation method according to an embodiment of the present disclosure.
[0028] like Figure 1 As shown, the resource allocation method in this embodiment includes operations S110-S150.
[0029] In operation S110, the kernel execution requests of the pre-filling process and the decoding process are intercepted, and the grid size and block size in the kernel execution request are obtained.
[0030] In embodiments of this disclosure, an inference request is obtained, which includes an input prompt word. A pre-filling process receives and processes this input prompt word, generates the first output word, and decodes to continue generating subsequent words to complete the response to the entire inference request.
[0031] In embodiments of this disclosure, the pre-filling process refers to the process in large language model inference that is responsible for processing complete input prompts and generating the first output lexical unit. It is computationally intensive, typically involving a complete forward propagation of the model.
[0032] In the embodiments of this disclosure, the decoding process refers to the process in large language model inference that is responsible for generating all subsequent output tokens after the first output token in an autoregressive manner. Its computation mode is memory-constrained iterative computation.
[0033] In the embodiments of this disclosure, a kernel execution request refers to an instruction or call initiated by the Central Processing Unit (CPU) host to request the execution of a specific computing task (i.e., a kernel) on the GPU device. The kernel execution request contains all the configuration information required to execute the computing task.
[0034] In embodiments of this disclosure, the grid size refers to the number of thread blocks contained within a grid defining a kernel execution configuration in a GPU parallel computing model. The block size refers to the number of threads contained within a thread block in a GPU parallel computing model.
[0035] In embodiments of this disclosure, kernel execution requests from the pre-filling process and the decoding process are intercepted to obtain the grid size and block size from the kernel execution requests. This is achieved by intercepting each kernel execution request that the pre-filling process and the decoding process attempt to submit to the GPU for execution. From the intercepted kernel execution requests, parameters describing the scale of parallel kernel execution, namely the grid size and block size, are extracted.
[0036] In operation S120, based on the grid size and block size, as well as the total number of streaming multiprocessors of the GPU, the predicted number of streaming multiprocessors required to execute the kernel is determined.
[0037] In embodiments of this disclosure, the total number of streaming multiprocessors in a GPU refers to the total number of streaming multiprocessors (SMs) contained within the GPU. SMs are the core units in which the GPU performs computations. For example, a certain model of GPU may contain 80 SMs.
[0038] In embodiments of this disclosure, the predicted number of streaming multiprocessors refers to the estimated number of SMs required when the kernel is executed, based on the kernel's parallel scale (grid size and block size) and GPU hardware specifications (total number of streaming multiprocessors of the GPU).
[0039] In embodiments of this disclosure, the predicted number of streaming multiprocessors required to execute the kernel is determined based on the grid size, block size, and the total number of streaming multiprocessors in the GPU. Based on the grid size, block size, and the total number of streaming multiprocessors in the GPU obtained in the preceding steps, a computational model estimates the amount of SM resources required for the kernel to achieve the expected computational throughput.
[0040] In operation S130, performance metrics of the pre-filling process and the decoding process are obtained. The performance metrics characterize the latency of the process in generating tokens.
[0041] In the embodiments of this disclosure, a token refers to the basic unit of text processed by the large language model, which can be a character, a word, or a subword. Generating tokens is the core output action of the large language model's inference.
[0042] In the embodiments of this disclosure, performance metrics refer to measures used to quantitatively evaluate the execution efficiency of the pre-filling process or the decoding process. In the scheme of this embodiment, performance metrics specifically refer to latency metrics related to lexical generation speed.
[0043] In embodiments of this disclosure, the latency of generating a word refers to the time taken from the start of processing to the successful output of a word. For a pre-padding process, this may refer to the latency of generating the first word. For a decoding process, this may refer to the average latency of generating each subsequent word.
[0044] In the embodiments of this disclosure, key latency data reflecting the current execution efficiency of the pre-filling and decoding processes is monitored and collected in real time. Performance metrics directly reflect whether the processes meet their service level objectives.
[0045] For example, the monitoring system shows that the most recent pre-filling request took 150 milliseconds from start to output of the first word; this is the current performance metric of the pre-filling process. Simultaneously, the decoding process measured an average of 50 milliseconds to generate each word; this is the current performance metric of the decoding process.
[0046] In operation S140, based on performance metrics, the scaling ratios for the pre-filling process and the decoding process are determined, and these scaling ratios are used to scale the number of multiprocessors in the prediction stream.
[0047] In embodiments of this disclosure, the scaling ratio refers to a dynamically adjusted coefficient whose value is determined based on the performance metrics of the corresponding process. The scaling ratio is used to scale up or down the predicted number of streaming multiprocessors obtained in the preceding steps to achieve dynamic reallocation of resources.
[0048] In the embodiments of this disclosure, the scaling ratios of the pre-filling process and the decoding process are determined based on performance metrics. The resource weights allocated to each process are dynamically adjusted according to these scaling ratios. If a process experiences excessive latency (performance failure), its scaling ratio is increased to allocate more SM resources to its subsequent kernel processes; conversely, its ratio can be decreased or reset.
[0049] In operation S150, based on the scaling ratio and the predicted number of streaming multiprocessors, the actual number of streaming multiprocessors used to execute the pre-filling process and the decoding process is determined.
[0050] In embodiments of this disclosure, the actual number of streaming multiprocessors refers to the number of streaming multiprocessors (SMs) that are ultimately determined and restricted, allowing the GPU to actually use a particular kernel when executing it. The actual number of streaming multiprocessors is the result of adjusting the predicted number of streaming multiprocessors for scaling.
[0051] In embodiments of this disclosure, the actual number of streaming multiprocessors (SMs) for each kernel executing the pre-filling process and the decoding process is determined based on the scaling ratio and the predicted number of streaming SMs. A dynamic adjustment strategy (scaling ratio) for each kernel is applied to the kernel's resource requirement baseline (predicted number of SMs) to calculate the actual number of SMs that the kernel should limit to use during this execution.
[0052] For example, a decoding process kernel about to be launched is predicted to require 20 SMs. The scaling ratio of the current decoding process has been adjusted to 0.9. Therefore, the actual number of streaming multiprocessors for this kernel is determined to be 20 × 0.9 = 18. This means that when launching a kernel execution request corresponding to this kernel, it will be restricted to executing on a maximum of 18 SMs.
[0053] Through the embodiments of this disclosure, by intercepting kernel execution requests and analyzing their execution scale, the computational resources required by each kernel can be precisely predicted. By monitoring lexical generation latency in real time, the resource allocation ratio is adaptively and dynamically adjusted, and this ratio is applied to the resource configuration of each kernel. The method of this embodiment achieves fine-grained, fast-response resource allocation at the kernel level, thereby better adapting to dynamic workloads and effectively improving the overall utilization of GPU resources.
[0054] Figure 2 This is a schematic diagram of a resource allocation method according to an embodiment of the present disclosure.
[0055] like Figure 2As shown, Figure 2 This demonstrates the complete workflow of coordinating CPU and GPU to achieve dynamic resource scheduling in the large language model inference service.
[0056] In the embodiments of this disclosure, the inference engine is a core application or service running on the CPU. The inference engine is the core software framework of the large language model inference service. The inference engine is responsible for managing the lifecycle of user requests and contains two dedicated subprocesses: pre-filling and decoding, which coordinate the two to complete the entire process from receiving prompts to generating a complete response.
[0057] In embodiments of this disclosure, the pre-filling process is a component in the inference engine responsible for handling the first stage of a new request. The pre-filling process is characterized by computational intensity, processing the entire input prompt in parallel at once to generate the first word. The decoding process is a component in the inference engine responsible for handling subsequent stages. The decoding process is characterized by iterative computation, generating a complete response autoregressively.
[0058] In the embodiments of this disclosure, the interception library is a functional module injected into the pre-filling process and decoding process of the inference engine in the form of a library file. The interception library intercepts each kernel execution request before it is launched to the GPU and makes real-time decisions and adjustments based on the system state obtained from shared memory, thus acting as an executor for fine-grained, dynamic resource management.
[0059] In the embodiments of this disclosure, shared memory is located on the CPU side. Shared memory is a memory region that can be accessed quickly and simultaneously by the pre-filling process and the decoding process. Shared memory is a memory region for real-time exchange of collaborative information between two processes. For example, it can be used to record whether each process currently has a kernel executing on the GPU, thereby providing immediate basis for the interception library's decision-making.
[0060] In embodiments of this disclosure, a graphics processing unit (GPU) is a hardware device that performs all computationally intensive tasks, such as matrix multiplication and attention calculation. The GPU is the physical unit that actually performs model computations throughout the process.
[0061] In embodiments of this disclosure, a user issues an inference request. The user (human or client application) submits a request, such as a question or a command. The inference request is sent to the inference engine, which, for a new request, first hands it over to its internal pre-filling process. Before the pre-filling process submits its computational task (kernel) to the GPU, its internal interception library is activated. The interception library intercepts the upcoming GPU kernel execution request and reads and updates the state information in the CPU-side shared memory.
[0062] Furthermore, after the interception library makes judgments and possible adjustments based on the content in shared memory, it submits the computation task to the graphics processor for actual parallel computation. After generating the first token, the processing right for the request is transferred to the decoding process within the inference engine. The task of this process is to generate all subsequent output tokens one by one in an autoregressive manner. Similarly, before each kernel launch, the interception library within the decoding process also performs interception and coordination operations. Once the decoding process generates the complete response text, the final result is output to the user in the form of a request return via the inference engine.
[0063] In some embodiments of this disclosure, determining the scaling ratio of the pre-filling process and the decoding process based on performance metrics includes: comparing the performance metrics of each process with a preset threshold for each process; and adjusting the scaling ratio of the pre-filling process and the decoding process based on the comparison results.
[0064] In the embodiments of this disclosure, the performance metric refers to the latency of the pre-padded process or the decoding process in generating tokens. For the pre-padded process, the performance metric may be the first token generation time. For the decoding process, the performance metric may be the latency per output token. The performance metric is key data reflecting the current real-time performance of the process.
[0065] In the embodiments of this disclosure, the preset threshold for each process refers to a latency target value pre-set for each process to determine whether its performance meets the target. This threshold is typically set according to service level objectives.
[0066] For example, a threshold of 150 milliseconds can be set for the first word generation delay in the pre-filling process, and a threshold of 50 milliseconds can be set for the per-word generation delay in the decoding process.
[0067] In embodiments of this disclosure, the performance metrics of each process are compared with preset thresholds for each process. The system continuously acquires the performance metrics of each process and compares them with their respective pre-configured thresholds representing performance targets.
[0068] For example, the system measured the generation latency of the most recent first word in the current pre-filling process to be 180 milliseconds, while the preset threshold is 150 milliseconds. The measured generation latency of the most recent per-word in the decoding process is 45 milliseconds, while the preset threshold is 50 milliseconds. By comparison, it can be seen that the pre-filling process's latency (180ms) is greater than its threshold (150ms), and its performance is not up to standard. The decoding process's latency (45ms) is less than its threshold (50ms), and its performance is up to standard.
[0069] In the embodiments of this disclosure, the comparison result refers to the conclusion drawn after comparing the performance index with a preset threshold, that is, whether the performance index is greater than, equal to, or less than the threshold. This result can indicate the current performance status of the corresponding process (such as "unsatisfactory", "satisfactory", or "excellent").
[0070] In the embodiments of this disclosure, the scaling ratios of the pre-filling process and the decoding process are adjusted based on the comparison results. According to the comparison results, the system adopts corresponding strategies to update the scaling ratios of the pre-filling process and the decoding process, thereby changing the resource quotas that the kernel can subsequently obtain. The purpose of the adjustment is to reallocate computing resources so that processes that do not meet performance standards can obtain more resources to improve performance, or to release excess resources when performance standards are met.
[0071] Through the embodiments of this disclosure, a closed-loop feedback control system is established by continuously comparing real-time performance indicators with preset thresholds and dynamically adjusting resource allocation ratios accordingly. This enables more refined resource scheduling and effectively guarantees the service quality of each process.
[0072] In some embodiments of this disclosure, the scaling ratios of the pre-filling process and the decoding process are adjusted based on the comparison results, including: when the performance index of any process is greater than or equal to a preset threshold, increasing the scaling ratio of that process and decreasing the scaling ratios of other processes; when the performance index of any process is less than the preset threshold, resetting the scaling ratio of that process to a baseline value.
[0073] In the embodiments of this disclosure, when the performance index of any process is greater than or equal to a preset threshold, the scaling-up / scaling ratio of that process is increased, and the scaling-up / scaling ratios of other processes are decreased. The steps of this embodiment define a resource reallocation strategy when the performance of any process is detected to be substandard. Its core logic is to immediately increase the resource weight (scaling-up / scaling ratio) of the lagging process and correspondingly reduce the resource weight of the relatively high-performing processes by an equal amount, thereby achieving resource allocation to lagging processes under the constraint of constant total resources.
[0074] For example, suppose the threshold for the pre-patch process is 100 milliseconds, and its current first-word latency is 120 milliseconds (greater than the threshold), which is substandard. Meanwhile, the threshold for the decoding process is 50 milliseconds, and its current per-word latency is 40 milliseconds (less than the threshold). In this case, the system implements this strategy by increasing the scaling ratio of the pre-patch process by 0.1 (e.g., from 1.0 to 1.1), while decreasing the scaling ratio of the decoding process by 0.1 (e.g., from 1.0 to 0.9). This allows the kernel of subsequent pre-patch processes to obtain more SM resources in an attempt to reduce their latency.
[0075] In embodiments of this disclosure, when the performance metric of any process falls below a preset threshold, the scaling ratio of that process is reset to a baseline value. The baseline value refers to the initial default or neutral value of the scaling ratio. The baseline value is typically set to 1.0, indicating that no additional scaling is performed, and the predicted number of streaming multiprocessors is used directly.
[0076] In the embodiments of this disclosure, the steps of this embodiment define a resource release and state reset strategy when it is detected that the performance of any process has met the requirements. The core logic is that when the performance of a process has met the requirements, its resource weight is restored to the default baseline value, thereby releasing any additional resources that may have been occupied due to previous adjustments, making them available for system rescheduling and avoiding the rigid occupation of resources.
[0077] For example, continuing from the previous example, after adjustment, the latency of the pre-filling process improved to 90 milliseconds (less than its 100 millisecond threshold). The system resets the scaling ratio of the pre-filling process to the baseline value of 1.0. This means that the resource bias against it has been removed, allowing it to return to the standard resource allocation, and the 0.1 reduction in the decoding process may also be restored when it triggers its own adjustment in the future.
[0078] Through the embodiments of this disclosure, resource adjustment behavior is made to have clear direction and reversibility. The method of this embodiment can not only respond quickly to performance degradation and accurately support lagging processes, but also release resources in a timely manner after performance targets are met to prevent over-allocation, thereby achieving a balance between resource utilization and performance target achievement rate.
[0079] In some embodiments of this disclosure, determining the predicted number of streaming multiprocessors required to execute the kernel based on the grid size, block size, and the total number of streaming multiprocessors in the GPU includes: determining an initial number of streaming multiprocessors based on the grid size, block size, and the total number of streaming multiprocessors in the GPU, wherein the initial number of streaming multiprocessors is the number of streaming multiprocessors required to meet the theoretical performance; obtaining a performance ratio, wherein the performance ratio characterizes the ratio between the target performance and the theoretical performance; and determining the predicted number of streaming multiprocessors based on the performance ratio and the initial number of streaming multiprocessors, wherein the predicted number of streaming multiprocessors is the number of streaming multiprocessors required to meet the target performance.
[0080] In embodiments of this disclosure, the initial number of streaming multiprocessors refers to the minimum number of streaming multiprocessors required for a core to operate at its highest theoretical computational throughput (i.e., fully utilizing GPU computing units without bottlenecking due to resource shortages) under ideal conditions. The initial number of streaming multiprocessors reflects the minimum resource requirements of the core at peak performance.
[0081] In embodiments of this disclosure, theoretical performance refers to the maximum or peak computational throughput that the GPU hardware can provide under ideal conditions. Meeting theoretical performance means allocating just enough resources to the kernel so that all its thread blocks can be executed in parallel by streaming multiprocessors without waiting, to achieve the highest possible execution speed.
[0082] In embodiments of this disclosure, the initial number of stream multiprocessors is determined based on the grid size, block size, and the total number of stream multiprocessors in the GPU. Based on the kernel's parallelism scale (grid size, block size) and the GPU's hardware specifications (total number of stream multiprocessors), a specific algorithm calculates the minimum number of stream multiprocessors required by the kernel when there are no external constraints and the pursuit of maximum speed.
[0083] In embodiments of this disclosure, when the calculated minimum number of required streaming multiprocessors exceeds the total number of streaming multiprocessors, the total number of streaming multiprocessors is determined as the initial number of streaming multiprocessors. When the calculated minimum number of required streaming multiprocessors is less than or equal to the total number of streaming multiprocessors, the calculated minimum number of required streaming multiprocessors is determined as the initial number of streaming multiprocessors.
[0084] For example, the grid size is M, the block size is N, the total number of streaming multiprocessors in the GPU is SMCount, and the number of compute units within each SM is CoreCount. The initial number of streaming multiprocessors is C. 100% =min{(M*N) / CoreCount, SMCount}.
[0085] In the embodiments of this disclosure, target performance refers to an acceptable non-peak performance level set in actual deployment based on the overall system resource scheduling strategy, quality of service requirements, or energy efficiency targets. Target performance is typically lower than theoretical performance in exchange for higher resource utilization and system throughput.
[0086] In embodiments of this disclosure, the performance ratio is a coefficient between 0% and 100% (or an equivalent ratio value, such as between 0 and 1), used to represent the percentage of desired target performance relative to theoretical performance. For example, a 70% performance ratio means that the goal is to execute the kernel at 70% of its theoretical peak performance.
[0087] In embodiments of this disclosure, a performance percentage is obtained. The system determines, based on the current scheduling policy, load conditions, or configuration, what percentage of theoretical performance is acceptable for a kernel to be executed. The performance percentage is an adjustable parameter.
[0088] For example, when the system load is high and multiple processes need to be managed simultaneously, a conservative performance ratio can be set, such as 60% (i.e., the target performance is 60% of the theoretical performance). This ratio can be obtained through policy configuration, historical load analysis, or adaptive algorithms.
[0089] In embodiments of this disclosure, the predicted number of streaming multiprocessors refers to the number of streaming multiprocessors expected to meet the target performance requirements for kernel execution, calculated by combining the initial number of streaming multiprocessors (theoretical requirement) and performance ratio.
[0090] In embodiments of this disclosure, a predicted number of streaming multiprocessors is determined based on a performance ratio and an initial number of streaming multiprocessors. This predicted number of streaming multiprocessors is the number of streaming multiprocessors required to meet the target performance. The performance ratio is then converted into a specific number of resources. By scaling the initial number according to the performance ratio (typically non-linear, as performance and resources are not simply proportional), a more economical and reasonable predicted resource value is obtained while achieving the target performance.
[0091] For example, when determining the initial number of streaming multiprocessors C 100% Then, predict the number of streaming multiprocessors C. n% =(M*N) / (((M*N) / C 100% )+(((100-n) / 10)*CoreCount)). Where n% is the performance percentage.
[0092] Through embodiments of this disclosure, by introducing performance ratios, resource prediction is expanded from pursuing a single theoretical peak to a flexible estimation that can adapt to multiple performance targets. The method of this embodiment enables the system to proactively reduce resource allocation to non-critical cores while ensuring basic performance, thereby significantly improving the overall utilization of GPU streaming multiprocessors and system throughput.
[0093] In some embodiments of this disclosure, obtaining the performance ratio includes: obtaining an initial value for the performance ratio; obtaining the GPU's streaming multiprocessor utilization when the GPU simultaneously executes the kernels of the pre-filling process and the decoding process during a historical period; and adjusting the initial value for the performance ratio based on the streaming multiprocessor utilization to obtain the performance ratio.
[0094] In embodiments of this disclosure, the initial performance ratio refers to a starting or default value set for the performance ratio before dynamic adjustment begins. The initial performance ratio is typically determined based on experience, system configuration, or preliminary estimation, serving as a baseline for the adjustment process.
[0095] For example, the initial performance ratio can be set to 85%, which means that the initial goal is to achieve 85% of the theoretical performance.
[0096] In embodiments of this disclosure, an initial performance ratio value is obtained. The system loads or calculates the initial performance ratio value at startup or before scheduling a batch of kernels, which provides an initial strategy for resource allocation.
[0097] In embodiments of this disclosure, a historical period can refer to a recently passed time window used for sampling and analysis. For example, a historical period could be the most recent 100 milliseconds or the last scheduling cycle.
[0098] In the embodiments of this disclosure, the streaming multiprocessor utilization of a GPU refers to the percentage of time during which all streaming multiprocessors in the GPU are actively executing computational tasks within a historical time period. It directly reflects the busyness of the GPU's computing cores. For example, a 70% utilization rate means that, on average, 70% of the streaming multiprocessors (SMs) are working and 30% are idle during the sampling period.
[0099] In the embodiments of this disclosure, the streaming multiprocessor utilization of the GPU is obtained when the kernels of the pre-filling process and the decoding process are executed simultaneously during a historical time period. The system monitors the actual computing resource usage of the GPU during the period when the kernels of the pre-filling process and the decoding process are executed simultaneously. The streaming multiprocessor utilization can reflect the actual workload of the GPU under the current resource allocation strategy.
[0100] For example, the system detected that within the most recent 50-millisecond sampling window, both the pre-filling process and the decoding process had kernels executing on the GPU. By querying the GPU's performance counters, the average streaming multiprocessor utilization of the GPU during this period was calculated to be 70%.
[0101] In the embodiments of this disclosure, the initial value of the performance ratio is adjusted based on the streaming multiprocessor utilization to obtain the performance ratio. The system analyzes the streaming multiprocessor utilization over historical periods and adjusts the performance target (performance ratio) accordingly. If the streaming multiprocessor utilization is too low, it indicates that GPU computing resources are idle. Increasing the performance requirements for individual cores (i.e., increasing the performance ratio) can incentivize them to use more resources, thereby improving utilization. If the streaming multiprocessor utilization has reached or is close to saturation, it may be necessary to appropriately reduce the performance ratio to avoid exceeding the total resource request limit, which could lead to queuing or contention.
[0102] For example, when streaming multiprocessor utilization is low (e.g., 70%), it indicates that GPU computing resources are not being fully utilized. The system can adjust the initial performance ratio upwards. For example, increasing the performance ratio from the initial 85% to 95%. The system will expect subsequent cores to run at a higher target performance (closer to the theoretical peak), thus requesting more streaming multiprocessors and helping to improve GPU utilization.
[0103] When utilization is high (e.g., 100%), it indicates that GPU computing resources are saturated. If the combined resources requested by two processes at the current ratio may exceed the total GPU capacity, the system can adjust the initial performance ratio downwards. For example, reducing the performance ratio from 85% to 75%. The system will lower its performance expectations for individual cores, causing them to request fewer streaming multiprocessors, thereby ensuring that total resource demand does not exceed supply and maintaining system stability.
[0104] Through the embodiments of this disclosure, by introducing a feedback mechanism based on streaming multiprocessor utilization, the performance ratio is changed from a static configuration to a dynamically adjustable parameter, realizing real-time matching between resource allocation strategy and actual GPU load. This improves utilization when resources are idle and avoids excessive competition when resources are scarce, thereby achieving better throughput and resource utilization efficiency.
[0105] In some embodiments of this disclosure, obtaining performance metrics for the pre-filling process and the decoding process includes: monitoring lexical generation events of the pre-filling process and the decoding process; recording the completion timestamp of a lexical generation in response to detecting that any process has completed a lexical generation; calculating the most recent lexical generation delay for each process based on the recorded completion timestamp; and determining the most recent lexical generation delay for each process as the performance metric of that process.
[0106] In the embodiments of this disclosure, a lexical generation event refers to a specific, observable program behavior or state change point where the pre-filling process or the decoding process successfully computes and outputs a complete lexical. For the pre-filling process, the lexical generation event specifically refers to the generation of the first output lexical. For the decoding process, the lexical generation event specifically refers to the generation of each subsequent output lexical.
[0107] In the embodiments of this disclosure, word generation events of the pre-filling process and the decoding process are monitored. The system can continuously observe the running status of the two processes by injecting hook functions, listening to specific function calls, or polling the output buffer, and wait for and identify word generation events.
[0108] For example, the system sets monitoring points in the execution path of the pre-filling process. When the process completes all calculations for the input prompt and writes the first resulting word "The" to the output buffer, the monitoring logic captures this change, triggering a word generation event. Similarly, when the decoding process iteratively generates the next word "cat," a separate event is also triggered.
[0109] In the embodiments of this disclosure, the completion timestamp refers to a highly precise current time value recorded by the system when the lexical generation event occurs. The completion timestamp is used to mark the precise time point at which lexical generation is completed and serves as the endpoint for calculating delays. The completion timestamp can be obtained using a system clock or a high-performance timer, and its unit can be microseconds or nanoseconds.
[0110] In embodiments of this disclosure, in response to detecting that any process has completed a word generation, a completion timestamp for that word generation is recorded. Once a word generation event is detected, the system immediately captures and saves the current precise time, associating this time with that specific word generation.
[0111] In the embodiments of this disclosure, the most recent token generation delay refers to the time interval from the start of token processing to the completion of generation, calculated for each process based on the completion timestamp of its most recently recorded token generation event. For the pre-filling process, the most recent token generation delay may be the Time To First Token (TTFT). For the decoding process, the most recent token generation delay may be the Time Per Output Token (TPOT).
[0112] In the embodiments of this disclosure, the most recent lexical generation delay for each process is calculated based on the recorded completion timestamp. Using the completion timestamp and combining it with the start time of the lexical processing (which can be obtained by recording the request start time or the completion time of the previous lexical), a subtraction operation is performed to obtain the actual time spent generating the lexical.
[0113] In the embodiments of this disclosure, the most recent lexical generation delay for each process is determined as the performance metric for that process. For example, the system assigns the calculated pre-filling process delay of 123.456 milliseconds to the variable representing the pre-filling process performance metric. The calculated decoding process delay of 50 milliseconds is also assigned to the variable representing the decoding process performance metric. The system then obtains the latest and most accurate performance feedback data for these two processes.
[0114] Through the embodiments of this disclosure, the lexical generation latency is captured and calculated in real time using an event-driven approach, resulting in extremely low latency and high accuracy in obtaining performance metrics. This provides immediate and accurate feedback signals for the resource dynamic adjustment algorithm, ensuring that the resource allocation strategy can quickly respond to real-time changes in workload.
[0115] In some embodiments of this disclosure, determining the actual number of streaming multiprocessors for each kernel executing the pre-filling process and the decoding process based on the scaling ratio and the predicted number of streaming multiprocessors includes: determining an allocation ratio based on the scaling ratio and the predicted number of streaming multiprocessors, wherein the allocation ratio characterizes the proportional relationship between the number of streaming multiprocessors allocated to different processes; and determining the actual number of streaming multiprocessors for each kernel executing the pre-filling process and the decoding process based on the allocation ratio and the total number of streaming multiprocessors of the GPU.
[0116] In the embodiments of this disclosure, the allocation ratio refers to an intermediate calculation variable whose specific value characterizes the relative amount of streaming multiprocessor resources that the pre-filling process and the decoding process should occupy at a certain moment or for a specific kernel scheduling decision. The allocation ratio is not a fixed value, but is dynamically calculated based on the current scaling ratio of each process and its predicted kernel resource requirements.
[0117] For example, the allocation ratio can be expressed as P process weight : D process weight, or it can be expressed as a specific ratio.
[0118] In embodiments of this disclosure, an allocation ratio is determined based on the scaling ratio and the predicted number of streaming multiprocessors. This allocation ratio characterizes the proportional relationship between the number of streaming multiprocessors allocated to different processes. The scaling ratio and the predicted number of streaming multiprocessors are transformed into a unified proportional scale capable of arbitrating resources between two competing processes. The allocation ratio reflects the relative weight that the system believes should be allocated resources to the two processes in the current state.
[0119] For example, suppose a pre-filling process is about to launch a core, which is predicted to require 20 streaming multiprocessors, and its current scaling ratio is 1.2. A decoding process is about to launch a core, which is predicted to require 15 streaming multiprocessors, and its current scaling ratio is 0.8. The system can calculate an allocation ratio based on this data. The resource demand intensity of each process can be calculated as "predicted number × scaling ratio," and the ratio between the two can be calculated. The demand intensity of the pre-filling process is 20 × 1.2 = 24, and the demand intensity of the decoding process is 15 × 0.8 = 12. The allocation ratio (pre-filling: decoding) can be calculated as 24:12, or 2:1. This ratio indicates that the system believes the pre-filling process should receive approximately twice the resources of the decoding process.
[0120] In embodiments of this disclosure, the actual number of streaming multiprocessors for each kernel executing the pre-filling process and the decoding process is determined based on the allocation ratio and the total number of streaming multiprocessors on the GPU. Using the allocation ratio and the total number of streaming multiprocessors on the GPU, the specific number of streaming multiprocessors each process's kernel should receive for execution is calculated.
[0121] For example, with an allocation ratio of 2:1 (pre-fill: decode), the total number of GPU streaming multiprocessors is 90. The kernel for the pre-fill process receives 60 streaming multiprocessors, and the kernel for the decode process receives 30 streaming multiprocessors. This determines the actual number of streaming multiprocessors used to execute each kernel, reflecting the intention of dynamic adjustment while ensuring that the total allocation does not exceed the physical limit.
[0122] Through the embodiments of this disclosure, by introducing an allocation ratio, the resource requirements and scaling strategies of individual processes are transformed into a globally unified and quantifiable basis for resource allocation. This enables quantitative and proportional resource allocation between two competing processes, ensuring the fairness of resource allocation. Consequently, limited GPU computing resources can be allocated more rationally according to real-time performance requirements and strategies.
[0123] In some embodiments of this disclosure, determining the actual number of stream multiprocessors for each kernel executing the pre-filling process and the decoding process based on the allocation ratio and the total number of stream multiprocessors of the GPU includes: determining the stream multiprocessor allocation status of the peer process of the kernel based on the allocation ratio; when the allocation status indicates that the peer process of the kernel has not been allocated a stream multiprocessor, determining the total number of stream multiprocessors of the GPU as the actual number of stream multiprocessors for executing the kernel.
[0124] In embodiments of this disclosure, the peer process refers to the other process that corresponds to the process about to issue the kernel execution request among two processes (the pre-filling process and the decoding process). For example, if the kernel execution request about to be issued is by the pre-filling process, then its peer process is the decoding process, and vice versa.
[0125] In the embodiments of this disclosure, the streaming multiprocessor allocation state refers to the state of the peer process at the current moment, whether it has a kernel in use or has been allocated GPU streaming multiprocessor resources. It is mainly divided into two states: "allocated" (a kernel is executing) and "unallocated" (no kernel is executing, all SMs are idle).
[0126] In embodiments of this disclosure, the streaming multiprocessor allocation status of the kernel's peer processes is determined based on the allocation ratio. Before performing regular resource allocation according to the allocation ratio, the system first checks the real-time resource usage of the peer processes.
[0127] For example, suppose the current system is about to determine the actual number of streaming multiprocessors for a kernel (denoted as P-Kernel) of a pre-filled process. The system first checks the status of its peer process, the decoding process. If the query finds that the decoding process currently has no submitted or executing kernels (e.g., its command queue is empty), then the streaming multiprocessor allocation status of the decoding process is determined to be "unallocated".
[0128] In the embodiments of this disclosure, when the allocation state indicates that the peer process of the kernel has not been allocated a streaming multiprocessor, the total number of streaming multiprocessors of the GPU is determined as the actual number of streaming multiprocessors used to execute the kernel. When it is determined that the peer process is in an "unallocated" state, the system will go beyond the conventional calculation based on the allocation ratio and directly allocate all available GPU computing resources to the kernel of the current process, thereby achieving full utilization and maximization of resources.
[0129] For example, when the decoding process status is determined to be "unallocated," the system will ignore any previously calculated proportion-based allocation scheme. For the P-Kernel about to be launched, the system directly determines its actual number of stream multiprocessors as the total number of GPU stream multiprocessors (e.g., 80). The P-Kernel can exclusively execute all SMs, accelerating its completion speed.
[0130] Figure 3 This is a schematic diagram of a resource allocation method according to another embodiment of the present disclosure.
[0131] like Figure 3 As shown, the timing diagram visually compares the allocation of SM resources under conditions of no preemption mechanism and preemption mechanism (i.e., the mechanism in the method of this embodiment).
[0132] In embodiments of this disclosure, the horizontal axis represents the progression of time. The vertical axis represents the number of streaming multiprocessors (SMs), from 0 to a maximum value (e.g., 80). The figure illustrates the alternating execution of the kernels (P Kernel 1-5) of the pre-filling process and the kernels (D Kernel 1-5) of the decoding process over time.
[0133] In the embodiments of this disclosure, without a preemption mechanism, when the kernel of one process finishes execution while the kernel of another process is not yet ready, there will be idle gaps in SM resources. For example, after P Kernel 1 finishes execution, PKernel 2 may not have arrived yet, causing some SMs to be temporarily idle.
[0134] In embodiments of this disclosure, with a preemption mechanism, when the kernel of one process finishes execution and the kernel of the other process is not ready, the currently ready kernel can preempt all idle SMs for execution. For example... Figure 3As shown, with the preemption mechanism, D Kernel 2 can start immediately after D Kernel 1 ends, occupying all SMs and eliminating idle gaps, resulting in a fuller and more continuous SM utilization curve. P Kernel 3 can start immediately after P Kernel 2 ends, occupying all SMs. D Kernel 5 can start immediately after D Kernel 4 ends, occupying all SMs. After PK Kernel 5 starts, the SM resources occupied by P Kernel 5 and D Kernel 5 are reallocated.
[0135] The embodiments of this disclosure effectively solve the resource idle problem that may occur when resources are allocated in a fixed ratio through a preemption mechanism. When one process is temporarily free of tasks, the kernel of another process is allowed to occupy all resources momentarily, thereby significantly improving the overall utilization of GPU streaming multiprocessors in the time dimension, accelerating the execution speed of individual kernels, and reducing the overall job completion time.
[0136] In some embodiments of this disclosure, the method further includes: adjusting the kernel execution request based on the actual number of streaming multiprocessors, sending the adjusted request to the GPU, and updating the execution status identifier of the process to which the kernel execution request belongs to indicate that the kernel of the process is executing; monitoring GPU event callbacks associated with the kernel execution request; and updating the execution status identifier in response to the triggering of a GPU event callback to indicate that the kernel execution of the process is complete.
[0137] In embodiments of this disclosure, the execution status flag refers to a variable (e.g., a counter or flag) stored in shared memory on the CPU side, used to record and indicate whether a kernel is currently executing on the GPU for the pre-filling process or the decoding process. For example, an execution status flag of 1 indicates that a kernel is executing, and 0 indicates that it is idle.
[0138] In the embodiments of this disclosure, the kernel execution request is adjusted based on the actual number of streaming multiprocessors (SMs), the adjusted request is sent to the GPU, and the execution status flag of the process to which the kernel execution request belongs is updated to indicate that the kernel of that process is executing. After determining the actual number of SMs available to the kernel, the interception library modifies the configuration data of the kernel execution request, then sends it to the GPU, and updates the "execution status flag" of the process in shared memory.
[0139] For example, suppose a kernel in the decoding process is calculated to actually occupy 48 SMs. The interception library modifies the kernel's boot configuration to limit its maximum use to 48 SMs, and then issues a request. The interception library updates the value of the "decoding process execution status flag" in shared memory from 0 to 1, indicating that a kernel is running in the decoding process.
[0140] In the embodiments of this disclosure, GPU event callback refers to an asynchronous notification mechanism in GPU programming. When an "event" is created in association after the kernel starts and a callback function is specified, the GPU driver automatically triggers the execution of this callback function on the CPU side when the kernel finishes execution on the GPU, in order to notify the application that the kernel has ended.
[0141] In embodiments of this disclosure, GPU event callbacks associated with kernel execution requests are monitored. Instead of actively polling its status after a kernel is launched, the system passively listens for a completion signal (GPU event callback) bound to that kernel.
[0142] For example, during the kernel launch of the aforementioned decoding process, the interception library simultaneously creates and associates a GPU event callback. The system then enters a listening state, waiting for the completion signal of the event. This listening process can be implemented by registering the callback function with the Compute Unified Device Architecture (CUDA) runtime or by using a synchronous application programming interface (API).
[0143] In embodiments of this disclosure, in response to a GPU event callback being triggered, the execution status flag is updated to indicate that the kernel execution of the process has completed. When the callback is triggered, it indicates that the kernel has finished executing and GPU resources have been released. The execution status flag of this process is immediately updated so that the peer process can make decisions such as resource preemption.
[0144] For example, once the kernel of the decoding process has finished executing on the GPU, its associated callback function is invoked. Within this callback function, the interception library can atomically update the value of the decoding process execution status flag in shared memory from 1 back to 0.
[0145] Figure 4 This is a schematic diagram of a resource allocation method according to yet another embodiment of the present disclosure.
[0146] like Figure 4 As shown, the pre-filling process and the decoding process are two independent processes. Each process contains an interception library, and each process maintains its own independent execution status flag and scaling ratio in the shared memory on the CPU side. The graphics processing unit (GPU) contains a command queue and a compute unit (SM). The kernel execution requests of the two processes are sent to their respective independent command queues for queuing, and are finally executed by the compute unit.
[0147] In the embodiments of this disclosure, the interception library for the pre-filled process adjusts the kernel execution request based on the actual number of SMs, sends the adjusted request to the GPU command queue, and updates the execution status flag corresponding to the current process. After the GPU's computing unit completes kernel execution, a callback event is triggered, and the CPU-side interception library updates the execution status flag in shared memory.
[0148] Through the embodiments of this disclosure, by updating the status flag before launch and updating the status flag again after launch via callback, a precise, real-time, and reliable GPU occupancy status signal is maintained for each process in shared memory. The method of this embodiment ensures the correct and conflict-free execution of the dynamic resource allocation strategy, achieving closed-loop management of fine-grained resource coordination.
[0149] Figure 5 This is a schematic diagram of the structure of an electronic device according to an embodiment of the present disclosure.
[0150] In this embodiment of the disclosure, the electronic device 500 includes: an interface 510 for data interaction with a GPU; and a processor 520 for: intercepting kernel execution requests from a pre-filling process and a decoding process, and obtaining the grid size and block size in the kernel execution request; determining the number of prediction streaming multiprocessors required to execute the kernel based on the grid size, block size, and the total number of streaming multiprocessors of the GPU; obtaining performance metrics for the pre-filling process and the decoding process, the performance metrics characterizing the latency of the process generating tokens; determining the scaling ratio of each of the pre-filling process and the decoding process based on the performance metrics, the scaling ratio being used to scale the number of prediction streaming multiprocessors; and determining the actual number of streaming multiprocessors used to execute the kernels of each of the pre-filling process and the decoding process based on the scaling ratio and the number of prediction streaming multiprocessors.
[0151] In this embodiment of the disclosure, the processor 520 may execute the resource allocation method described above.
[0152] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.
[0153] Figure 6 This is a schematic block diagram of an example electronic device used to implement embodiments of the present disclosure.
[0154] like Figure 6As shown, an electronic device 600 according to an embodiment of this application includes a processor 601, which can perform various appropriate actions and processes according to a program stored in a read-only memory 602 or a program loaded from a storage portion 608 into a random access memory 603. The processor 601 may include, for example, a general-purpose microprocessor, an instruction set processor and / or an associated chipset and / or a dedicated microprocessor. The processor 601 may also include onboard memory for caching purposes. The processor 601 may include a single processing unit or multiple processing units for executing different steps of the method flow according to an embodiment of this application.
[0155] Random access memory 603 stores various programs and data required for the operation of electronic device 600. Processor 601, read-only memory 602, and random access memory 603 are interconnected via bus 604. Processor 601 executes various steps of the method flow according to embodiments of this application by executing programs in read-only memory 602 and / or random access memory 603. It should be noted that programs may also be stored in one or more memories other than read-only memory 602 and random access memory 603. Processor 601 may also execute various steps of the method flow according to embodiments of this application by executing programs stored in one or more memories.
[0156] According to embodiments of this application, the electronic device 600 may further include an input / output interface 605, which is also connected to a bus 604. The electronic device 600 may also include one or more of the following components connected to the input / output interface 605: an input section 606 including a keyboard, mouse, etc.; an output section 607 including a cathode ray tube, liquid crystal display, etc., and a speaker, etc.; a storage section 608 including a hard disk, etc.; and a communication section 609 including a network interface card, such as a local area network card, modem, etc. The communication section 609 performs communication processing via a network such as the Internet. A drive 610 is also connected to the input / output interface 605 as needed. A removable medium 611, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on the drive 610 as needed so that computer programs read from it can be installed into the storage section 608 as needed.
[0157] Embodiments of this application also provide a computer-readable storage medium, which may be included in the device / apparatus / system described in the above embodiments; or it may exist independently and not assembled into the device / apparatus / system. The computer-readable storage medium carries one or more programs, which, when executed, implement the method according to the embodiments of this application.
[0158] According to embodiments of this application, the computer-readable storage medium can be a non-volatile computer-readable storage medium, such as including but not limited to: portable computer disks, hard disks, random access memory, read-only memory, erasable programmable read-only memory, portable compact disk read-only memory, optical storage devices, magnetic storage devices, or any suitable combination thereof. In embodiments of this application, the computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. For example, according to embodiments of this application, the computer-readable storage medium may include the read-only memory 602 described above, and / or random access memory 603, and / or one or more memories other than read-only memory 602 and random access memory 603.
[0159] Embodiments of this application also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowchart. When the computer program product is run on a computer system, the program code is used to cause the computer system to implement the methods provided in the embodiments of this application.
[0160] In one embodiment, the computer program may rely on a tangible storage medium such as an optical storage device or a magnetic storage device. In another embodiment, the computer program may also be transmitted and distributed in the form of signals over a network medium, and downloaded and installed via the communication section 609, and / or installed from the removable medium 611. The program code contained in the computer program can be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination thereof.
[0161] In embodiments of this application, the computer program can be downloaded and installed from a network via communication section 609, and / or installed from removable medium 611. When the computer program is executed by processor 601, it performs the functions defined in the system of embodiments of this application. According to embodiments of this application, the systems, devices, apparatuses, modules, units, etc., described above can be implemented by computer program modules.
[0162] According to embodiments of this application, program code for executing the computer programs provided in the embodiments of this application can be written in any combination of one or more programming languages. Specifically, these computational programs can be implemented using high-level procedural and / or object-oriented programming languages, and / or assembly / machine languages. The program code can be executed entirely on the user's computing device, partially on the user's device, partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).
[0163] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0164] Those skilled in the art will understand that the features described in the various embodiments of this application can be combined and / or combined in various ways, even if such combinations or combinations are not explicitly described in this application. In particular, the features described in the various embodiments of this application can be combined and / or combined in various ways without departing from the spirit and teachings of this application. All such combinations and / or combinations fall within the scope of this application.
Claims
1. A resource allocation method, comprising: Intercept kernel execution requests from the pre-filling process and the decoding process, and obtain the grid size and block size from the kernel execution requests; Based on the grid size and the block size, as well as the total number of streaming multiprocessors in the GPU, determine the predicted number of streaming multiprocessors required to execute the kernel; Obtain the performance metrics of the pre-filling process and the decoding process, wherein the performance metrics characterize the latency of the process generating tokens; Based on the performance metrics, the scaling ratios of the pre-filling process and the decoding process are determined, and the scaling ratios are used to scale the number of prediction stream multiprocessors. Based on the scaling ratio and the predicted number of streaming multiprocessors, the actual number of streaming multiprocessors used to execute the pre-filling process and the decoding process is determined.
2. The method according to claim 1, wherein determining the scaling ratio of the pre-filling process and the decoding process based on the performance metric comprises: The performance metrics of each process are compared with the preset thresholds for each process; Based on the comparison results, the scaling ratios of the pre-filling process and the decoding process are adjusted.
3. The method according to claim 2, wherein adjusting the scaling ratio of the pre-filling process and the decoding process based on the comparison result includes: When the performance metric of any process is greater than or equal to a preset threshold, increase the scaling ratio of that process and decrease the scaling ratio of other processes. When the performance metric of any process is lower than its preset threshold, the scaling ratio of that process is reset to the baseline value.
4. The method according to claim 1, wherein determining the predicted number of streaming multiprocessors required to execute the kernel based on the grid size, the block size, and the total number of streaming multiprocessors of the GPU comprises: Based on the grid size and the block size, as well as the total number of stream multiprocessors in the GPU, the initial number of stream multiprocessors is determined, which is the number of stream multiprocessors required to meet the theoretical performance. Obtain the performance ratio, which characterizes the proportional relationship between the target performance and the theoretical performance; Based on the performance ratio and the initial number of streaming multiprocessors, the predicted number of streaming multiprocessors is determined, which is the number of streaming multiprocessors required to meet the target performance.
5. The method according to claim 4, wherein obtaining the performance ratio includes: Get the initial value of the performance ratio; The streaming multiprocessor utilization of the GPU is obtained when the kernels of the pre-filling process and the decoding process are executed simultaneously during a historical period. Based on the streaming multiprocessor utilization, the initial value of the performance ratio is adjusted to obtain the performance ratio.
6. The method according to claim 1, wherein obtaining the performance metrics of the pre-filling process and the decoding process includes: Monitor the lexical generation events of the pre-filling process and the decoding process; In response to the detection that any process has completed a word generation, record the completion timestamp of that word generation; Based on the completion timestamp of the record, calculate the delay of the most recent word generation for each process; The most recent lexical generation delay for each process is determined as the performance metric for that process.
7. The method according to claim 1, wherein determining the actual number of streaming multiprocessors for executing the pre-filling process and the decoding process, based on the scaling ratio and the predicted number of streaming multiprocessors, comprises: Based on the scaling ratio and the predicted number of streaming multiprocessors, an allocation ratio is determined, wherein the allocation ratio characterizes the proportional relationship between the number of streaming multiprocessors allocated to different processes. Based on the allocation ratio and the total number of streaming multiprocessors of the GPU, the actual number of streaming multiprocessors used to execute the pre-filling process and the decoding process is determined.
8. The method according to claim 7, wherein determining the actual number of streaming multiprocessors for executing the pre-filling process and the decoding process, based on the allocation ratio and the total number of streaming multiprocessors of the GPU, comprises: Based on the allocation ratio, the streaming multiprocessor allocation status of the kernel's peer process is determined; When the allocation status indicates that the peer process of the kernel has not been allocated a streaming multiprocessor, the total number of streaming multiprocessors of the GPU is determined as the actual number of streaming multiprocessors used to execute the kernel.
9. The method according to claim 1, further comprising: The kernel execution request is adjusted based on the actual number of streaming multiprocessors, the adjusted request is sent to the GPU, and the execution status flag of the process to which the kernel execution request belongs is updated to indicate that the kernel of that process is executing. Monitor GPU event callbacks associated with the kernel execution request; In response to the GPU event callback being triggered, the execution status flag is updated to indicate that the kernel execution of the process has completed.
10. An electronic device, comprising: An interface used for data interaction with the GPU; Processor, used for: Intercept kernel execution requests from the pre-filling process and the decoding process, and obtain the grid size and block size from the kernel execution requests; Based on the grid size and the block size, as well as the total number of streaming multiprocessors in the GPU, determine the predicted number of streaming multiprocessors required to execute the kernel; Obtain the performance metrics of the pre-filling process and the decoding process, wherein the performance metrics characterize the latency of the process generating tokens; Based on the performance metrics, the scaling ratios of the pre-filling process and the decoding process are determined, and the scaling ratios are used to scale the number of prediction stream multiprocessors. Based on the scaling ratio and the predicted number of streaming multiprocessors, the actual number of streaming multiprocessors used to execute the pre-filling process and the decoding process is determined.