Heterogeneous platform tensor parallelism-based large language model deployment optimization method and system
By storing the target operator weights of a large language model in CPU memory and dividing them into subsets, combined with asynchronous transmission and parallel computing, the deployment method of the large language model is optimized, solving the inference latency problem in resource-constrained environments and improving throughput and real-time performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- UNIV OF SCI & TECH OF CHINA
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies face challenges when deploying large language models on resource-constrained hardware platforms, including high GPU memory requirements, high inference latency, and difficulty in meeting real-time requirements. In particular, when GPU resources are limited, traditional tensor parallel methods lack a fine design for the computational pipeline between intra-layer and inter-layer operators.
By storing the weight matrix of the target operator in CPU memory and dividing it into first and second weight subsets, a fine pipeline strategy is designed to optimize the deployment method of large language models by utilizing asynchronous transmission and parallel computing, combining the computing power of CPU and GPU. This includes computing the preceding operator on the GPU and transmitting the weight subset in parallel, and merging local results to minimize the total system latency.
It effectively distributes the computational burden on the GPU, improves the throughput of large language models in memory-constrained environments, meets real-time requirements, and enhances overall inference efficiency.
Smart Images

Figure CN122309142A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of artificial intelligence and high-performance computing technology, and specifically relates to a method and system for optimizing the deployment of large language models based on tensor parallelism on heterogeneous platforms. Background Technology
[0002] To efficiently deploy large language models and achieve high-throughput inference on hardware platforms, various large language model inference frameworks have been proposed in the industry. Among them, frameworks such as vLLM, SGLang, and TensorRT-LLM, which deeply integrate GPU computing libraries such as CUDA, focus on large model inference on high-end GPUs, making them difficult to apply under resource-constrained conditions.
[0003] FlexGen is a high-performance language model inference framework implemented in PyTorch. Its key feature is offloading some weights to CPU memory or disk, loading them during computation. While this reduces GPU memory requirements and allows deployment in GPU-constrained environments, it is limited by PCIe bandwidth. The inference process is often dominated by data transfer time, and GPU computing units remain idle while waiting for data loading, resulting in high overall inference latency and making it difficult to meet real-time requirements.
[0004] The Hetegen inference framework, building upon the Flexgen framework, utilizes CPU and GPU to achieve tensor parallelism for single operators by splitting the matrix multiplication of operators into two parts for separate execution. This approach successfully reduces overall latency by introducing the CPU to complete a portion of the computation, but it has significant limitations in practical deployments. Traditional Hetegen logic is often limited to parallelism within a single operator, lacking fine-grained design for the computational pipeline between operators within and between layers. Summary of the Invention
[0005] To address the aforementioned technical problems, this invention provides a method for optimizing the deployment of large language models based on tensor parallelism on heterogeneous platforms, comprising the following steps:
[0006] Step S1: To maximize GPU memory utilization, select target operators from each network layer of the large language model, store the weight matrix of the target operator in CPU memory, and store the weight matrices of all other operators in GPU.
[0007] Step S2: Divide the weight matrix of the target operator according to the segmentation ratio. Divide into a first weighted subset and a second weighted subset;
[0008] Step S3: Calculate the preorder operator of the target operator on the GPU to generate the intermediate activation tensor X; simultaneously, asynchronously transfer the first weight subset from the CPU to the GPU in parallel; after the preorder operator is calculated and the intermediate activation tensor is generated, immediately trigger the calculation of the target operator on the CPU side, which is completed in parallel with the calculation of the target operator on the GPU side after the first weight subset is transferred; finally, merge the local results on the CPU side and the local results on the GPU side on the GPU to obtain the complete output of the target operator.
[0009] Step S4: Targeting Different sampling points nearby record the time consumed in each step. Based on the principle of balancing the total time consumed by GPU-side transmission and computation, and the total time consumed by GPU-side pre-processing and CPU computation, the optimal segmentation ratio that minimizes the total system latency is fitted. and with Perform subsequent reasoning.
[0010] Beneficial effects:
[0011] This invention provides an optimization method for deploying large language models based on heterogeneous platform tensor parallelism, where the CPU truly participates in the core computation path (step S33). This method utilizes the computing power of the CPU to process... The proportion of computational tasks, and the GPU's Some computational tasks (step S34) are executed in parallel, effectively distributing the computational load on the GPU. This invention employs a sophisticated pipeline overlap strategy, utilizing the time window for GPU computation of preceding operators to mask the transmission latency of some target operator weights. In single-card or multi-card environments with limited GPU memory, this method improves the throughput of large language model inference to a certain extent. Attached Figure Description
[0012] Figure 1 This is a schematic diagram of the process for optimizing the deployment of a large language model based on tensor parallelism on a heterogeneous platform, according to the present invention.
[0013] Figure 2 Flowchart for optimizing tensor parallel deployment on heterogeneous platforms;
[0014] Figure 3 This is a schematic diagram comparing the throughput of the method of the present invention with that of the prior art;
[0015] Figure 4 This is a structural block diagram of a large language model deployment optimization system based on heterogeneous platform tensor parallelism according to the present invention. Detailed Implementation
[0016] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.
[0017] This invention is implemented on a typical resource-constrained heterogeneous computing server, and the implementation environment and system configuration are as follows:
[0018] Hardware environment:
[0019] CPU: Intel Core i7-7700K (4 cores / 8 threads).
[0020] GPU: 2 NVIDIA GeForce GTX 1080Ti (11GB VRAM per card), representing a consumer-grade graphics card environment with limited VRAM resources.
[0021] Target Model: The ifly-spark-13B (iFlytek Spark 13B Large Language Model) was selected as the test object. This model is a large language model consisting of input / output processing layers and 40 Transformer layers with identical structures. Each Transformer layer can be decomposed into seven operators that implement different computations: qkv, attn_proj, attn_norm, ffn_norm, gate_proj, up_proj, and down_proj. Each operator has parameter tensors that depend on its computation and need to be stored.
[0022] Example 1
[0023] like Figure 1 As shown in the figure, an embodiment of the present invention provides a method for optimizing the deployment of large language models based on tensor parallelism on heterogeneous platforms, comprising the following steps:
[0024] Step S1: To maximize GPU memory utilization, select target operators from each network layer of the large language model, store the weight matrices of the target operators in CPU memory, and store the weight matrices of all other operators in GPU.
[0025] Step S2: Divide the weight matrix of the target operator according to the segmentation ratio. Divide into a first weighted subset and a second weighted subset;
[0026] Step S3: Calculate the preorder operator of the target operator on the GPU to generate the intermediate activation tensor X; simultaneously, asynchronously transfer the first weight subset from the CPU to the GPU in parallel; after the preorder operator is calculated and the intermediate activation tensor is generated, immediately trigger the calculation of the target operator on the CPU side, and complete it in parallel with the calculation of the target operator on the GPU side after the first weight subset is transferred; finally, merge the local results on the CPU side and the local results on the GPU side to obtain the complete output of the target operator.
[0027] Step S4: Targeting Different sampling points nearby record the time consumed in each step. Based on the principle of balancing the total time consumed by GPU-side transmission and computation, and the total time consumed by GPU-side pre-processing and CPU computation, the optimal segmentation ratio that minimizes the total system latency is fitted. and with Perform subsequent reasoning.
[0028] In one embodiment, step S1 above, which aims to maximize GPU memory utilization, involves selecting target operators from each network layer of the large language model, storing the weight matrices of the target operators in CPU memory, and storing the weight matrices of all other operators in the GPU. Specifically, this includes:
[0029] Based on the GPU memory usage of each operator Computation time with CPU To calculate the operator priorities stored on the CPU , The larger the value, the higher the priority. They are stored in the CPU in descending order of priority until the GPU memory is sufficient to store the weights of all other remaining operators.
[0030] .
[0031] The purpose of this step is to maximize model capacity and throughput under limited video memory, and an operator priority quantization index is designed. To evaluate the CPU offload priority of the operator. Select Operators with larger values, i.e., those whose weight matrices occupy a large amount of GPU memory but have a short computation time on the CPU, are selected as target operators. Their weight matrices are stored in the CPU until the resource-constrained environment can complete the normal inference task.
[0032] This embodiment of the invention first parses the configuration file of the target model and, based on the aforementioned target operator selection strategy, identifies the downprojection operator `down_proj` in the MLP layer (feedforward neural network layer) as the target operator. This operator has a large number of parameters and is located at the end of the computation path within the layer, making it suitable for heterogeneous segmentation.
[0033] In one embodiment, step S2 above: The weight matrix of the target operator is divided according to the segmentation ratio. Divided into a first weighted subset and a second weighted subset, specifically including:
[0034] The weight matrix W of the target operator stored in CPU memory is divided into two parts:
[0035] The ratio is The first weighted subset W gpu The ratio is The second weight subset W cpu ;
[0036] in, The value is estimated based on the current device's transmission bandwidth and the computing power of the GPU and CPU.
[0037] According to the embodiments of the present invention The weight matrix W of down_proj is physically segmented along the output feature dimension:
[0038] The first weighted subset W of the GPU portion gpu The size is During inference, this weight needs to be dynamically transferred from the CPU to the GPU memory via the PCIe bus to complete the calculation.
[0039] The second weighted subset W of the CPU portion cpu The size is W, this part of the weight is directly retained in CPU memory for calculation. To improve W... gpu To improve transmission efficiency, the CPU memory storing the complete weights is allocated in pinned memory to support asynchronous transfers via subsequent direct memory access (DMA) technology.
[0040] In one embodiment, step S3 above involves: calculating the preceding operator of the target operator on the GPU to generate an intermediate activation tensor X; simultaneously, asynchronously transferring the first weight subset from the CPU to the GPU in parallel; after the preceding operator calculation is completed and the intermediate activation tensor is generated, immediately triggering the calculation of the target operator on the CPU side, which is completed in parallel with the calculation of the target operator on the GPU side after the first weight subset transfer is finished; finally, merging the local results on the CPU side and the local results on the GPU side on the GPU to obtain the complete output of the target operator, specifically including:
[0041] Step S31: Utilize the main computing flow of the GPU to execute all preceding operators before the target operator in the current layer until the intermediate activation tensor X required for the target operator is generated; wherein, the preceding operators are those executed before the target operator in the same network layer and after the target operator in the previous network layer, and which do not have direct data dependency conflicts with the target operator;
[0042] This step not only completes the necessary calculations, but also adds a "time window" for the transmission of the weight subset in the subsequent step S32, masking part of the transmission time.
[0043] Step S32: In parallel with step S31, the first weight subset is asynchronously prefetched from CPU memory to GPU memory using an independent CPU-to-GPU transport stream H2D (Host to Device).
[0044] Step S33: After step S31 is completed and the intermediate activation tensor X is generated, X is asynchronously copied to the CPU's locked memory using a D2H (Device to Host) stream; then, the CPU thread pool is used to concurrently compute X and the second weight subset W. cpu The product of the two is used to obtain the local result on the CPU side; finally, the local result on the CPU side is asynchronously sent back to the GPU through the H2D stream; if step S32 is not completed after step S31 is completed, this step and step S32 continue to be executed in parallel.
[0045] Step S34: After the transmission in step S32 is completed and the intermediate activation tensor X in step S31 is ready, use GPU computing stream parallel computation to compute X and the first weight subset W. gpu The product of is used to obtain the local result on the GPU side; if step S33 is not completed after step S32 is completed, this step and step S33 continue to be executed in parallel.
[0046] Step S35: Using the CUDA event synchronization mechanism, after steps S33 and S34 have been completed, perform a reduction operation on the GPU to obtain the final output of the target operator; repeat steps S31 to S35 for the target operator of the next layer.
[0047] like Figure 2 As shown, step S3 of the present invention designs a refined pipeline based on multi-stream concurrency, which decomposes the computation process of the model in the inference stage into four parallel or dependent steps (S31~S34) to maximize the masking of communication latency and computation time, thereby transforming the originally serial "data loading-computation" process into an efficient pipeline of "computation masking transmission" and "dual-end parallel computation".
[0048] In one embodiment, step S4 above: for Different sampling points nearby record the time consumed in each step. Based on the principle of balancing the total time consumed by GPU-side transmission and computation, and the total time consumed by GPU-side pre-processing and CPU computation, the optimal segmentation ratio that minimizes the total system latency is fitted. and with Performing subsequent reasoning includes:
[0049] Step S41: In A series of discrete sampling points are set within the nearby interval. ;
[0050] Due to the complexity of the hardware environment (PCIe bandwidth fluctuations, CPU instruction set efficiency), theoretical calculations... There are often deviations. This invention uses an adaptive fitting method based on preheating testing to determine the optimal... Before the inference service starts, for the target inference scenario (specific batch size and sequence length), theoretical calculations are performed. A series of discrete sampling points are set within the nearby interval.
[0051] Step S42: Record the results of steps S31 to S34 in different... The following time consumption data: computation time of the preceding operator in step S31 The transmission time of the first weighted subset in step S32 Step S33: CPU-side heterogeneous computing and transmission time And the GPU-side heterogeneous computing time in step S34. ;
[0052] Step S43: Construct the total latency model:
[0053] ;
[0054] ;
[0055] ;
[0056] in, This indicates the total time spent on data transfer and GPU computation; This represents the total time spent on preprocessing and CPU computation.
[0057] Curve fitting of the sampled data using polynomial regression yielded the following results: and and Relationship, solve equation = The intersection point, the point corresponding to The value is the optimal segmentation ratio. This makes the total time spent Shortest. Used for subsequent reasoning.
[0058] like Figure 3 As shown, the ifly-spark-13B model was deployed on an Nvidia 2×1080Ti server using the method of this invention, and compared with the original Hetegen benchmark scheme. From Figure 3 The comparison shows that, in inference scenarios with different batch sizes and prompt word lengths, the heterogeneous platform tensor parallel deployment optimization method of this invention has achieved a certain degree of improvement in throughput compared to Hetegen's baseline solution.
[0059] Example 2
[0060] like Figure 4 As shown, this embodiment of the invention provides a large language model deployment optimization system based on heterogeneous platform tensor parallelism, including the following modules:
[0061] The target operator module 51 is used to select target operators from each network layer of the large language model with the premise of maximizing the utilization of GPU memory, and store the weight matrix of the target operator in CPU memory, while the weight matrices of all other operators are stored in GPU.
[0062] The weight subset partitioning module 52 is used to partition the weight matrix of the target operator according to the partitioning ratio. Divide into a first weighted subset and a second weighted subset;
[0063] Tensor parallel deployment optimization module 53 is used to compute the preorder operator of the target operator on the GPU and generate the intermediate activation tensor X. At the same time, the first weight subset is asynchronously transferred from the CPU to the GPU in parallel. After the preorder operator is computed and the intermediate activation tensor is generated, the computation of the target operator on the CPU side is immediately triggered. After the first weight subset is transferred, the computation of the target operator on the GPU side is completed in parallel. Finally, the local results on the CPU side and the local results on the GPU side are merged on the GPU to obtain the complete output of the target operator.
[0064] Segmentation ratio optimization module 54 is used for targeting Different sampling points nearby record the time consumed in each step. Based on the principle of balancing the total time consumed by GPU-side transmission and computation, and the total time consumed by GPU-side pre-processing and CPU computation, the optimal segmentation ratio that minimizes the total system latency is fitted. and with Perform subsequent reasoning.
[0065] A large language model deployment optimization device based on heterogeneous platform tensor parallelism includes one or more electronic devices, wherein the one or more electronic devices are used to implement the large language model deployment optimization method based on heterogeneous platform tensor parallelism.
[0066] An electronic device includes: one or more processors; and a memory for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors enable the one or more processors to implement a large language model deployment optimization method based on heterogeneous platform tensor parallelism.
[0067] A computer-readable storage medium having stored executable instructions thereon, which, when executed by a processor, enable the processor to implement a method for optimizing the deployment of large language models based on heterogeneous platform tensor parallelism.
[0068] A non-transitory computer-readable storage medium storing a computer program that, when executed by a processor, implements an optimization method for deploying large language models based on tensor parallelism on heterogeneous platforms.
[0069] The above description is merely a specific embodiment of the present invention, enabling those skilled in the art to understand or implement this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, the present invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features claimed herein.
Claims
1. A method for optimizing the deployment of large language models based on tensor parallelism on heterogeneous platforms, characterized in that, include: Step S1: To maximize GPU memory utilization, select target operators from each network layer of the large language model, store the weight matrix of the target operator in CPU memory, and store the weight matrices of all other operators in GPU. Step S2: Divide the weight matrix of the target operator according to the segmentation ratio. Divide into a first weighted subset and a second weighted subset; Step S3: Calculate the preorder operator of the target operator on the GPU to generate the intermediate activation tensor X; simultaneously, asynchronously transfer the first weight subset from the CPU to the GPU in parallel; after the preorder operator is calculated and the intermediate activation tensor is generated, immediately trigger the calculation of the target operator on the CPU side, which is completed in parallel with the calculation of the target operator on the GPU side after the first weight subset is transferred; finally, merge the local results on the CPU side and the local results on the GPU side on the GPU to obtain the complete output of the target operator. Step S4: Targeting Different sampling points nearby record the time consumed in each step. Based on the principle of balancing the total time consumed by GPU-side transmission and computation, and the total time consumed by GPU-side pre-processing and CPU computation, the optimal segmentation ratio that minimizes the total system latency is fitted. and with Perform subsequent reasoning.
2. The method for optimizing the deployment of large language models based on heterogeneous platform tensor parallelism according to claim 1, characterized in that, Step S1: To maximize GPU memory utilization, target operators are selected from each network layer of the large language model, and the weight matrices of the target operators are stored in CPU memory. Specifically, this includes: Based on the GPU memory usage of each operator Computation time with CPU To calculate the operator priority stored on the CPU , The larger the value, the higher the priority. They are stored in the CPU in descending order of priority until the GPU memory is sufficient to store the weights of all other remaining operators. 。 3. The method for optimizing the deployment of large language models based on heterogeneous platform tensor parallelism according to claim 2, characterized in that, Step S2: The weight matrix of the target operator is divided according to the segmentation ratio. Divided into a first weighted subset and a second weighted subset, specifically including: The weight matrix W of the target operator stored in CPU memory is divided into two parts: The ratio is The first weighted subset W gpu The ratio is The second weight subset W cpu ; in, The value is estimated based on the current device's transmission bandwidth and the computing power of the GPU and CPU.
4. The method for optimizing the deployment of large language models based on heterogeneous platform tensor parallelism according to claim 3, characterized in that, Step S3: The preceding operator of the target operator is computed on the GPU to generate an intermediate activation tensor X; simultaneously, the first weight subset is asynchronously transferred from the CPU to the GPU in parallel; after the preceding operator computation is completed and the intermediate activation tensor is generated, the computation of the target operator on the CPU side is immediately triggered, and is completed in parallel with the computation of the target operator on the GPU side after the first weight subset transfer is completed; finally, the local results on the CPU side and the local results on the GPU side are merged on the GPU to obtain the complete output of the target operator; specifically including: Step S31: Utilize the main computing flow of the GPU to execute all preceding operators before the target operator in the current layer until the intermediate activation tensor X required for the target operator is generated; wherein, the preceding operators are operators executed before the target operator in the same network layer and after the target operator in the previous network layer, and which do not have direct data dependency conflicts with the target operator; Step S32: In parallel with step S31, the first weight subset is asynchronously prefetched from CPU memory to GPU memory using an independent CPU-to-GPU transport stream H2D stream. Step S33: After step S31 is completed and the intermediate activation tensor X is generated, X is asynchronously copied to the CPU's locked memory using the D2H stream; then, the CPU thread pool is used to concurrently compute X and the second weight subset W. cpu The product of the two is used to obtain the local result on the CPU side; finally, the local result on the CPU side is asynchronously sent back to the GPU through the H2D stream; if step S32 is not completed after step S31 is completed, this step and step S32 continue to be executed in parallel. Step S34: After the transmission in step S32 is completed and the intermediate activation tensor X in step S31 is ready, use GPU computing stream parallel computation to compute X and the first weight subset W. gpu The product of is used to obtain the local result on the GPU side; if step S33 is not completed after step S32 is completed, this step and step S33 continue to be executed in parallel. Step S35: Using the CUDA event synchronization mechanism, after steps S33 and S34 have been completed, perform a reduction operation on the GPU to obtain the final output of the target operator; repeat steps S31 to S35 for the target operator of the next layer.
5. The method for optimizing the deployment of large language models based on heterogeneous platform tensor parallelism according to claim 4, characterized in that, Step S4: For Different sampling points nearby record the time consumed in each step. Based on the principle of balancing the total time consumed by GPU-side transmission and computation, and the total time consumed by GPU-side pre-processing and CPU computation, the optimal segmentation ratio that minimizes the total system latency is fitted. and with Performing subsequent reasoning specifically includes: Step S41: In A series of discrete sampling points are set within the nearby interval. ; Step S42: Record the results of steps S31 to S34 in different... The following time consumption data: computation time of the preceding operator in step S31 The transmission time of the first weighted subset in step S32 Step S33: CPU-side heterogeneous computing and transmission time and the GPU-side heterogeneous computing time in step S34. ; Step S43: Construct the total latency model: ; ; ; in, This indicates the total time spent on data transfer and GPU computation; This represents the total time spent on preprocessing and CPU computation. Curve fitting of the sampled data using polynomial regression yielded the following results: and and Relationship, solve equation = The intersection point, the point corresponding to The value is the optimal segmentation ratio. Used for subsequent reasoning.
6. A large language model deployment optimization system based on heterogeneous platform tensor parallelism, characterized in that, Includes the following modules: The target operator selection module is used to select target operators from each network layer of the large language model while maximizing GPU memory utilization. The weight matrices of the target operators are stored in CPU memory, and the weight matrices of all other operators are stored in GPU. The weight subset partitioning module is used to partition the weight matrix of the target operator according to the partitioning ratio. Divide into a first weighted subset and a second weighted subset; The tensor parallel deployment optimization module is used to compute the preorder operator of the target operator on the GPU to generate the intermediate activation tensor X. Simultaneously, the first weight subset is asynchronously transferred from the CPU to the GPU in parallel. After the preorder operator is computed and the intermediate activation tensor is generated, the computation of the target operator on the CPU side is immediately triggered and completed in parallel with the computation of the target operator on the GPU side after the first weight subset is transferred. Finally, the local results on the CPU side and the local results on the GPU side are merged on the GPU to obtain the complete output of the target operator. The segmentation ratio optimization module is used to optimize the segmentation ratio for... Different sampling points nearby record the time consumed in each step. Based on the principle of balancing the total time consumed by GPU-side transmission and computation, and the total time consumed by GPU-side pre-processing and CPU computation, the optimal segmentation ratio that minimizes the total system latency is fitted. and with Perform subsequent reasoning.
7. A large language model deployment optimization device based on heterogeneous platform tensor parallelism, characterized in that, It includes one or more electronic devices, wherein the one or more electronic devices are used to implement the method of any one of claims 1 to 5.
8. An electronic device, characterized in that, include: One or more processors; A memory for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors cause the one or more processors to implement the method of any one of claims 1 to 5.
9. A computer-readable storage medium, characterized in that, It stores executable instructions that, when executed by a processor, cause the processor to perform the method described in any one of claims 1 to 5.
10. A non-transitory computer-readable storage medium, characterized in that, It stores a computer program that, when executed by a processor, implements the method as described in any one of claims 1 to 5.