Cost optimization method for heterogeneous multi-lora fine-tuning task

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By optimizing pipeline execution through task feature vector clustering and greedy GPU resource allocation, the problems of resource waste and inefficiency in heterogeneous multi-LoRA fine-tuning tasks are solved, achieving more efficient GPU resource utilization and cost reduction.

CN122240312APending Publication Date: 2026-06-19EAST CHINA NORMAL UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: EAST CHINA NORMAL UNIV
Filing Date: 2026-03-19
Publication Date: 2026-06-19

Application Information

Patent Timeline

19 Mar 2026

Application

19 Jun 2026

Publication

CN122240312A

IPC: G06F9/50; G06F9/48

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

In multi-LoRA fine-tuning tasks, existing methods have failed to effectively address the resource waste and pipeline inefficiency caused by differences in resource requirements among heterogeneous tasks, and have been unable to maximize GPU resource utilization and reduce computational costs.

⚗Method used

By combining task feature vector clustering and greedy GPU resource allocation with dynamic programming and V-shaped rearrangement strategies, the pipeline execution of multi-LoRA fine-tuning tasks is optimized, reducing resource waste and improving overall efficiency.

🎯Benefits of technology

It significantly reduces the overall GPU time for multi-LoRA fine-tuning tasks, improves resource utilization, and reduces computational costs.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122240312A_ABST

Patent Text Reader

Abstract

This invention discloses a cost optimization method for heterogeneous LoRA fine-tuning tasks, comprising GPU resource allocation and pipeline scheduling. GPU resource allocation clusters fine-tuning tasks into multiple replicas based on their resource requirements, ensuring that the resource requirements of tasks within each replica are as similar as possible. A greedy GPU allocation is then performed among the replicas to improve overall efficiency and minimize total GPU time. Within a single replica, a load-balanced pipeline execution plan is constructed through heterogeneous computing latency-aware packaging and scheduling to optimize pipeline utilization and reduce memory consumption caused by batch filling. Finally, combining GPU resource allocation and pipeline scheduling, a cost optimization system, Cappuccino, for heterogeneous multi-LoRA fine-tuning tasks is proposed, reducing the cost of multi-LoRA fine-tuning tasks in a multi-tenant shared cluster while maintaining acceptable runtime overhead.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to fine-tuning optimization techniques for pre-trained large language models, specifically resource orchestration and pipeline execution optimization for heterogeneous multi-LoRA fine-tuning tasks. More specifically, this invention proposes a cost optimization method based on resource requirement analysis, aiming to optimize GPU resource allocation and pipeline scheduling for multi-LoRA fine-tuning tasks to reduce overall GPU time, thereby improving resource utilization and lowering fine-tuning costs in multi-tenant shared cluster environments. Background Technology

[0002] In recent years, the pre-training-fine-tuning paradigm has become the mainstream approach for developing large language models. By fine-tuning pre-trained large models (such as Llama and Qwen) on specific datasets, developers can provide customized language models for various downstream applications, significantly reducing costs and shortening development cycles. With the increasing demand for customization, large language model fine-tuning is increasingly delivered through a "Model as a Service" (MaaS) architecture and runs in multi-tenant shared cluster environments. In a MaaS architecture, service providers host pre-trained base models, while tenants submit fine-tuning tasks including datasets and hyperparameter configurations. This approach allows tenants to share the same base model, thereby reducing development costs and computational resource consumption. However, to further increase service provider revenue and reduce tenant fine-tuning costs, overall GPU time must be optimized, i.e., considering both the number of GPU resources used and the fine-tuning execution time.

[0003] Compared to full-parameter fine-tuning, LoRA is a common and efficient fine-tuning technique. This method significantly reduces resource consumption while maintaining model quality by freezing the pre-trained model and introducing a trainable low-rank adapter. Multiple LoRA fine-tuning tasks are typically executed concurrently while sharing the same pedestal model, with each task attaching a specific LoRA adapter to the pedestal model. To further improve resource utilization, existing methods optimize the LoRA adapter computation process by alternating forward and backward propagation in the pipeline, while also reducing kernel startup frequency and fixed startup overhead through batch fine-tuning.

[0004] While these optimizations improve throughput and resource utilization per replica, they typically rely on the assumption that multi-LoRA fine-tuning tasks are ideally homogeneous. However, in real-world production environments, the resource requirements and computational characteristics of multi-LoRA fine-tuning tasks are often highly heterogeneous. This heterogeneity includes differences in task datasets and hyperparameter configurations, such as variations in sequence length distribution, batch size, and LoRA rank among different tasks. This multi-dimensional heterogeneity leads to significant differences in computational and GPU memory requirements for each task, preventing existing optimization methods based on shared pedestal models from fully leveraging the advantages of resource sharing, thereby reducing overall performance and cost-effectiveness. Specifically, heterogeneous multi-LoRA workloads face two main problems: First, resource inefficiency. To meet the resource requirements of the heaviest tasks, the system typically configures the pipeline depth based on these requirements, causing lighter tasks to execute in excessively deep pipelines. This introduces unnecessary cross-stage communication and significantly increases pipeline bubbles. Second, pipeline execution inefficiency. Within a single replica, due to the heterogeneity of LoRA adapters, multiple tasks share the same pipeline for forward and backward propagation. Computationally demanding tasks can block pipeline progress, causing multiple stages to frequently idle due to inter-stage dependencies, forming numerous pipeline bubbles and reducing resource utilization. Therefore, how to design efficient GPU resource configuration and pipeline execution scheduling strategies to maximize GPU resource utilization and reduce overall execution costs under heterogeneous multi-LoRA fine-tuning tasks, while comprehensively considering the multidimensional heterogeneity between tasks, has become a critical issue that urgently needs to be addressed. Summary of the Invention

[0005] The purpose of this invention is to provide a cost optimization method for heterogeneous multi-LoRA fine-tuning tasks. It aims to reduce the overall GPU time under multi-LoRA workloads by optimizing the GPU resource allocation between replicas and the pipeline execution scheduling within replicas, thereby improving resource utilization and reducing computing costs.

[0006] The specific technical solution for achieving the objective of this invention is as follows:

[0007] A cost optimization method for heterogeneous multi-LoRA fine-tuning tasks includes the following steps:

[0008] Step 1: Set up a set of multi-LoRA fine-tuning tasks, including the base model, dataset and training parameter configuration, and use the performance model to analyze the multi-LoRA fine-tuning tasks and their latency and GPU memory requirements under batch fine-tuning.

[0009] Step 2: Based on the memory requirement analysis results in Step 1, construct the resource requirement feature vector for each task, and cluster LoRA fine-tuning tasks with similar resource requirements into the same replica to obtain the replica partitioning results; then allocate the minimum required GPU resources to each replica, and further perform greedy allocation among the replicas according to the overall GPU resource budget, allocating the GPU to the replica that can generate the maximum marginal benefit, thereby reducing resource waste within the replica and improving the resource efficiency of the GPU;

[0010] Step 3: Based on the replica partitioning in Step 2 and the GPU resources allocated to each replica, generate a pipeline execution plan for each replica; merge different micro-batches into packaged micro-batches for batch fine-tuning, while minimizing the extra padding waste caused by forced alignment of micro-batches across tasks, so as to reduce pipeline bubbles introduced by micro-batch heterogeneity and optimize the pipeline efficiency of each replica; use a V-shaped sorting strategy based on Johnson's rule to rearrange the packaged micro-batches to further improve the load imbalance of packaged micro-batches, minimize the iteration time within each replica, and thus reduce the overall cost required for multi-LoRA fine-tuning tasks.

[0011] Furthermore, step 2 specifically includes:

[0012] 2-1: LoRA Fine-tuning Task Clustering: Employing a task feature vector representation mechanism, each task... Represented as eigenvectors And construct resource requirement features to characterize the multidimensional heterogeneity of tasks; the resource requirement features include: memory requirement feature vector This is used to characterize the impact of batch size, high quantile sequence length, LoRA rank, and the number of target modules on GPU memory usage; it is used to calculate the required feature vector. This is used to characterize the computational cost corresponding to the average sequence length and its dispersion; the task... The feature vector is represented as The tasks are then clustered based on the feature vectors to obtain task grouping results. This clustering process employs a constraint combining minimization of intra-group heterogeneity and regularization of replica count to reduce resource requirement differences among tasks within the same group and suppress replica fragmentation. This is defined as:

[0013]

[0014] in, This represents a set of replicas, where each replica instantiates a base model replica and is assigned to multiple tasks. shared, It is a measure of dissimilarity in the feature space. Used to regularize the number of replicas; the HDBSCAN algorithm is used in the implementation to automatically discover task groups in the feature space;

[0015] 2-2; GPU resource allocation among replicas: After completing the LoRA task clustering, each cluster group is mapped to the corresponding replica, and the GPU allocation for each replica is determined under global GPU resource constraints. For each replica, the minimum number of GPUs that meet the memory constraints is allocated as the initial configuration. Then, the number of GPUs is gradually increased as the remaining GPU budget allows. Each time a GPU is added, the marginal saving of GPU time brought by allocating one GPU is calculated, and the replica with the largest marginal saving is selected from all replicas for allocation. After allocation, the next marginal saving of the replica is recalculated and the selection order is updated. The process of gradually increasing the GPU allocation for the replica with the largest marginal saving is repeated until the GPU budget is exhausted or there is no positive marginal saving, thus obtaining the final number of GPUs configured for each replica.

[0016] Furthermore, step 3 specifically includes:

[0017] 3-1: Parallel Iteration Time Modeling for Multi-LoRA Pipelines: Under the 1F1B pipeline scheduling mode, a single multi-LoRA pipeline parallel iteration includes a pipeline filling phase, a steady-state phase, and a draining phase. The filling phase initializes the pipeline through the forward propagation of the first packaged micro-batch; the steady-state phase is executed by overlapping forward and backward propagation of the intermediate packaged micro-batches; and the draining phase is completed by the backward propagation of the last packaged micro-batch. Since the packaging result of the micro-batch is unpredictable before scheduling, the execution time of the filling and draining phases is approximated as the execution time corresponding to the longest-running packaged micro-batch under this partition. Therefore, the iteration time is defined as:

[0018]

[0019] in, and These represent packaging micro-batch. Forward and backward propagation times, This indicates the pipeline depth within that copy;

[0020] 3-2: Construction of Packed Microbatches: Sort all microbatches in one iteration according to the maximum sequence length, and only allow batch fine-tuning of microbatches with similar sequence lengths to minimize the additional padding overhead caused by forced alignment of microbatches across tasks; divide the ordered list into several consecutive groups, and each group is regarded as a packaged microbat in subsequent pipeline execution. The goal is to minimize the iteration time; since directly minimizing the iteration time involves a non-additive maximum term, a time upper bound is introduced. To limit the execution time of each packaged micro-batch, dynamic programming is used to determine the optimal micro-batch grouping scheme to minimize the execution time. The state transition equation for the dynamic programming is as follows:

[0021]

[0022]

[0023] in For the front Micro-batch Divided into The minimum cumulative execution time required for a series of consecutive micro-batches, while the execution time of each group must not exceed [a certain limit]. ; Indicates merging micro-batch The execution time of the generated packaged micro-batches, and this execution time is subject to the time limit for each group. Restrictions;

[0024] 3-3: Packed Micro-batch Rearranging: Even with optimal packed micro-batch construction, heterogeneity between tasks can lead to slight but unavoidable load imbalance in packed micro-batches. In a 1F1B pipeline, the execution order of heterogeneous packed micro-batches affects the duration of pipeline filling and emptying, as well as the degree of overlap in the stabilization phase, ultimately impacting the overall iteration time. A V-shaped rearrangement strategy based on Johnson's rule is adopted, placing packed micro-batches with shorter execution times at both ends to shorten the filling and emptying phases, while placing packed micro-batches with longer execution times in the middle to maximize the alternation of forward and backward propagation, thereby reducing idle time in the pipeline.

[0025] This invention proposes a cost optimization method for heterogeneous multi-LoRA fine-tuning tasks, addressing the problem of efficiently configuring and scheduling GPU resources in such tasks. By introducing LoRA task clustering technology, fine-tuning tasks with similar resource requirements are grouped into the same replica to reduce resource waste. A greedy algorithm based on marginal benefit is employed to optimize the allocation of GPU resources among replicas, thereby improving overall resource utilization and reducing GPU time. Furthermore, this invention designs a pipeline execution optimization strategy for multi-LoRA fine-tuning tasks. Through a reasonable micro-batch packaging strategy and dynamic programming-based micro-batch grouping, pipeline bubbles are minimized and the iteration efficiency of each replica is improved. Ultimately, this invention effectively improves the overall execution efficiency of multi-LoRA fine-tuning tasks in heterogeneous environments, reduces training costs, and has significant practical application value. Attached Figure Description

[0026] Figure 1System architecture diagram for implementing the present invention;

[0027] Figure 2 This is a flowchart of the present invention;

[0028] Figure 3 Example diagram of configuring replica-level GPU resources. Detailed Implementation

[0029] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

[0030] This invention proposes a cost optimization method called Cappuccino for heterogeneous multi-LoRA fine-tuning tasks, in order to reduce the fine-tuning cost in multi-tenant shared clusters.

[0031] like Figure 1 As shown, the process of implementing this invention includes: First, obtaining multiple fine-tuning job sets based on the same base model in a multi-tenant shared cluster environment, and analyzing the latency and GPU memory requirements of multi-LoRA fine-tuning tasks through a performance model to provide a basis for subsequent optimization. Based on these analysis results, a workload-aware GPU allocator performs inter-replica optimization to determine a cost-effective replica orchestration scheme, thereby minimizing overall GPU time. Next, a heterogeneous LoRA pipeline scheduler performs pipeline scheduling optimization for each replica, generating a pipeline execution plan suitable for each replica, thereby minimizing the iteration time of heterogeneous multi-LoRA workloads.

[0032] See Figure 2 The specific steps of this invention are as follows:

[0033] We acquire a set of multi-LoRA fine-tuning tasks, including base models, datasets, and training parameter configurations, and use performance models to analyze the multi-LoRA fine-tuning tasks and their latency and GPU memory requirements under batch fine-tuning.

[0034] Next, the workload-aware GPU allocator determines the replica orchestration scheme. This is achieved by constructing a resource requirement feature vector for each task and clustering LoRA fine-tuning tasks with similar resource requirements into the same replica, resulting in replica partitioning. Then, the minimum required GPU resources are allocated to each replica, and a greedy allocation is performed among the replicas based on the overall GPU resource budget, assigning GPUs to replicas that generate the greatest marginal benefit. This reduces resource waste within replicas and improves GPU resource efficiency. Figure 3 As shown, the replica configuration is first determined through LoRA clustering, and then each task is represented using a task feature vector representation mechanism. Represented as eigenvectors Furthermore, resource requirement features are constructed to characterize the multidimensional heterogeneity of tasks. These resource requirement features include: a memory requirement feature vector. This is used to characterize the impact of batch size, high quantile sequence length, LoRA rank, and the number of target modules on GPU memory usage; it is used to calculate the required feature vector. This is used to characterize the computational cost corresponding to the average sequence length and its dispersion. The task... The feature vector is represented as The tasks are then clustered based on the feature vectors to obtain task grouping results. The clustering process employs a constraint combining minimization of intra-group heterogeneity and replica number regularization to reduce resource requirement differences between tasks within the same group and suppress replica fragmentation, defined as:

[0035]

[0036] in, This represents a set of replicas, where each replica instantiates a base model replica and is assigned to multiple tasks. shared, It is a measure of dissimilarity in the feature space. Used to regularize the number of replicas. The implementation uses the HDBSCAN algorithm to automatically discover task groups in the feature space;

[0037] Next, GPU resource allocation among replicas is performed. After completing the LoRA task clustering, each cluster group is mapped to its corresponding replica, and the GPU allocation for each replica is determined under global GPU resource constraints. For each replica, the minimum number of GPUs that meet memory constraints is initially allocated as the initial configuration, and then the number of GPUs is gradually increased as the remaining budget allows. Increasing the number of GPUs (i.e., increasing pipeline depth) has a dual effect: on the one hand, increasing pipeline depth can reduce the residency pressure of model weights and activation states on each GPU, thereby freeing up memory for building a more balanced batch fine-tuning pipeline. This helps reduce pipeline bubbles and core startup overhead, thus shortening the overall execution time of the replica; on the other hand, a deeper pipeline introduces additional inter-stage communication and requires more micro-batches to fill and drain the pipeline, which may lead to an increase in replica execution time. When the additional overhead exceeds the efficiency improvement, the time consumed by each GPU increases, thereby increasing the overall GPU time of the replica.

[0038] To weigh this trade-off, the benefit is quantified by calculating the marginal GPU time saving brought about by allocating an additional GPU. A greedy GPU allocation strategy is then employed: each time a GPU is added, the marginal GPU time saving from allocating one GPU is calculated, and the replica with the largest marginal saving is selected for GPU allocation. After allocation, the next marginal saving for that replica is recalculated, and the selection order is updated. This process is repeated until the budget is exhausted or there are no positive marginal savings, thus obtaining the final replica GPU resource configuration scheme.

[0039] Next, the heterogeneous LoRA pipeline scheduler performs pipeline scheduling optimization for each replica to minimize the completion time of each replica. This objective can be transformed into optimizing the iteration time of each multi-LoRA pipeline in parallel operation. Under the 1F1B pipeline scheduling method, a single multi-LoRA pipeline parallel iteration includes a pipeline filling phase, a steady-state phase, and a draining phase. The filling phase initializes the pipeline through the forward propagation of the first packaged micro-batch; the steady-state phase is executed by overlapping forward and backward propagation of the intermediate packaged micro-batches; and the draining phase is completed by the backward propagation of the last packaged micro-batch. Since the packaging result of the micro-batch is unpredictable before scheduling, the execution time of the filling and draining phases is approximated as the execution time corresponding to the longest-running packaged micro-batch under this partition. Therefore, the iteration time is defined as:

[0040]

[0041] in, and These represent packaging micro-batch. Forward and backward propagation times, This indicates the pipeline depth within the copy.

[0042] To minimize the additional padding overhead caused by forced alignment of micro-batches across tasks, all micro-batches in a single iteration are sorted according to their maximum sequence length, and micro-batches with similar sequence lengths are grouped together. The ordered list is divided into several consecutive groups, and each group is treated as a grouped micro-batch in subsequent pipeline executions. The goal is to minimize the iteration time; since directly minimizing the iteration time involves a non-additive maximum term, a time upper bound is introduced. To limit the execution time of each packaged micro-batch, dynamic programming is used to determine the optimal micro-batch grouping scheme to minimize the execution time. The state transition equation for the dynamic programming is as follows:

[0043]

[0044]

[0045] in For the front Micro-batch Divided into The minimum cumulative execution time required for a series of consecutive micro-batches, while the execution time of each group must not exceed [a certain limit]. ; Indicates merging micro-batch The execution time of the generated packaged micro-batches, and this execution time is subject to the time limit for each group. The limitations. To account for the impact of the number of micro-batch packages on the bubble ratio in the pipeline, the estimation formula for iteration time is further modified as follows:

[0046] The number of micro-batches in one iteration is , This refers to the quantity of the packaged micro-batch.

[0047] In the implementation process, a finite set of candidate time limits is generated for searching by pre-calculating the execution cost and peak memory requirements of each consecutive range of segments. Then, for each candidate time limit, a two-dimensional table is populated using dynamic programming, considering prefix length and the number of groups, and two feasibility checks are performed: first, to ensure that the memory requirements of each segment do not exceed the GPU memory limit. Secondly, ensure that the execution time of each group does not exceed the current time limit. During dynamic programming, the best predecessor for each state is recorded, enabling the recovery of the optimal continuous partitioning scheme. Next, the algorithm iterates through the number of feasible micro-batches and evaluates the iteration time for each scheme, incorporating a bubble ratio correction estimate. When a better candidate scheme is found, the algorithm calls a backtracking function to recover the optimal continuous partitioning scheme from the final dynamic programming state. This method reduces pipeline bubbles introduced by micro-batch heterogeneity, merges different micro-batches into packaged micro-batches for batch fine-tuning, and minimizes the extra padding waste caused by forced alignment of cross-task micro-batches, thereby optimizing the pipeline efficiency of each replica.

[0048] Even with optimal micro-batch construction, heterogeneity between tasks can lead to slight but unavoidable load imbalances in micro-batches. In a 1F1B pipeline, the execution order of heterogeneous micro-batches affects the duration of pipeline filling and emptying, as well as the overlap of stabilization phases, ultimately impacting the overall iteration time. To mitigate this impact, a V-shaped rearrangement strategy based on Johnson's rule is employed. Shorter micro-batches are placed at both ends to shorten the filling and emptying phases, while longer micro-batches are placed in the middle to maximize the alternation of forward and backward propagation, thereby reducing idle time in the pipeline.

[0049] Based on the above functions, cost optimization for heterogeneous multi-LoRA fine-tuning tasks has been achieved, reducing the fine-tuning cost in multi-tenant shared clusters.

[0050] Experimental Example

[0051] To verify the performance of the proposed algorithm in heterogeneous multi-LoRA fine-tuning, a heterogeneous multi-LoRA workload was constructed, fine-tuning 16 LoRA adapters in parallel on three open-source LLM pedestal models. The selected models span two representative series with different parameter sizes, including Llama-2-7B, Llama-2-13B, and Qwen2.5-32B. Each model instantiated 16 adapters, corresponding to instruction fine-tuning datasets covering four tasks: summarization, mathematical reasoning, code generation, and domain-specific understanding. The sequence lengths of the datasets varied significantly (average length approximately 70 to 3.9k tags), leading to differences in computational intensity and padding behavior. Each job corresponds to a single LoRA adapter, with independent datasets and hyperparameter settings (such as batch size, LoRA rank, and target module), resulting in runtime heterogeneity. The experiment was conducted on a server equipped with eight NVIDIA A100 GPUs (40GB VRAM) interconnected via PCIe 4.0 ×16. The server also included 80 vCPUs and 720 GB of main memory.

[0052] Evaluation Benchmarks and Evaluation Metrics: Cappuccino is compared with three benchmark methods: (i) Sequential-LoRA, (ii) mLoRA + (iii) LoRAFusion + In Sequential-LoRA, each LoRA adapter executes as an independent job sequentially and is allocated the minimum number of GPUs required to meet its memory needs; in mLoRA... +In this architecture, all LoRA adapters share a single copy of the base model and are allocated sufficient GPU memory to accommodate all adapter parameters and activation memory simultaneously. mLoRA + Parallelism of multiple LoRA pipelines is achieved by simply alternating the execution of micro-batches. In LoRAFusion... + In the middle, resource allocation and mLoRA + Similar to LoRA, but employing a MILP-based micro-batch build strategy, it improves pipeline utilization by filling different LoRA adapter samples. In evaluation, we used GPU time as the primary metric, directly reflecting the end-to-end overhead of multi-LoRA fine-tuning. To further evaluate the effectiveness of Cappuccino, four additional metrics were reported: completion time, pipeline bubble ratio, fill ratio, and runtime overhead.

[0053] Table 1 Comparison of Sequential-LoRA and mLoRA + LoRAFusion + GPU time (in seconds) for the Cappuccino strategy under different experimental configurations.

[0054]

[0055] As shown in Table 1, the proposed Cappuccino consistently achieves the lowest end-to-end GPU time when evaluating 16 heterogeneous LoRA fine-tuning jobs across three pedestal models. This is in comparison to Sequential-LoRA and mLoRA. + and LoRAFusion + In comparison, GPU time was reduced by up to 41.8%, 21.3%, and 57.3%, respectively, demonstrating a significant cost-efficiency improvement under heterogeneous multi-LoRA workloads. The performance gap compared to Sequential-LoRA mainly stems from the inefficiency of sequential job execution. Sequential-LoRA allocates a minimum number of GPUs to each job to meet memory constraints, avoiding explicit resource over-allocation. However, sequential execution prevents the amortization of job-level fixed overheads (such as model loading) across multiple jobs, thus significantly increasing overall GPU time. Compared to mLoRA... + and LoRAFusion + In comparison, Cappuccino's advantage lies in its explicit consideration of the heterogeneity of resource requirements across operations. mLoRA + and LoRAFusion +Placing all jobs on a single pipeline replica aims to amortize memory overhead through a shared pedestal model. However, this single-replica configuration requires resources to handle peak memory demands, causing lightweight jobs to run in excessively deep pipelines, resulting in additional cross-stage communication and pipeline bubble overhead, thus reducing overall efficiency. Furthermore, mLoRA... + Each adapter is built with independent micro-batches, and multiple LoRA pipelines are parallelized only through alternating scheduling. The runtime differences between micro-batches exacerbate pipeline bubbles and prolong pipeline iteration time. LoRAFusion + The performance degradation is more pronounced. Its micro-batch construction method for data recombination optimized for sequence filling has limited benefits in sequence filling training systems and may further reduce efficiency under heterogeneous workloads, showing its limitations in adaptability.

[0056] The embodiments of the present invention also provide a cost optimization system for heterogeneous multi-LoRA fine-tuning tasks. The system includes: an analyzer module for acquiring a set of multi-LoRA fine-tuning tasks submitted by the user, including a base model, a dataset, and training parameter configurations; a performance model module for predicting the latency and GPU memory requirements of each task under batch fine-tuning using a performance model; a load-aware GPU allocator module for clustering heterogeneous LoRA tasks according to task resource requirement feature vectors and mapping them to different replicas, allocating appropriate GPU resources to each replica to reduce resource waste within the replica and improve GPU utilization; and a heterogeneous LoRA pipeline scheduling module for generating pipeline execution plans for each replica based on GPU resource configuration results, minimizing pipeline bubbles and fill waste, thereby optimizing the iteration time of each replica.

[0057] The system acquires multi-LoRA fine-tuning tasks submitted by users in a multi-tenant cluster, including base models, datasets, and training parameter configurations. The system can automatically complete GPU resource configuration and pipeline scheduling, and minimize overall GPU time and computational costs while ensuring task execution efficiency.

Claims

1. A cost optimization method for heterogeneous multi-LoRA fine-tuning tasks, characterized in that, Includes the following steps: Step 1: Set up a set of multi-LoRA fine-tuning tasks, including the base model, dataset and training parameter configuration, and use the performance model to analyze the multi-LoRA fine-tuning tasks and their latency and GPU memory requirements under batch fine-tuning. Step 2: Based on the memory requirement analysis results in Step 1, construct the resource requirement feature vector for each task, and cluster LoRA fine-tuning tasks with similar resource requirements into the same replica to obtain the replica partitioning results; then allocate the minimum required GPU resources to each replica, and further perform greedy allocation among the replicas according to the overall GPU resource budget, allocating the GPU to the replica that can produce the maximum marginal benefit. Step 3: Based on the replica partitioning in Step 2 and the GPU resources allocated to each replica, generate a pipeline execution plan for each replica; Different micro-batches are merged into packaged micro-batches for batch fine-tuning, while minimizing the extra padding waste caused by forced alignment of micro-batches across tasks, thus optimizing the pipeline efficiency of each replica. A V-shaped sorting strategy based on Johnson's rule is used to rearrange the packaged micro-batches, minimizing the iteration time within each replica, thereby reducing the overall cost required for multi-LoRA fine-tuning tasks.

2. The cost optimization method of claim 1, wherein, Step 2 specifically includes: 2-1: LoRA fine-tuning task clustering: adopt task feature vector representation mechanism, represent each task as a feature vector , and construct resource demand features for depicting task multi-dimensional heterogeneity; the resource demand features include: memory demand feature vector , used to represent the influence of batch size, high quantile sequence length, LoRA rank size and target module number on GPU video memory occupation; calculation demand feature vector , used to represent the calculation overhead corresponding to the average sequence length and its dispersion; represent the feature vector of task as , and perform clustering processing on the tasks based on the feature vectors to obtain task grouping results; the clustering processing adopts a constraint mode combining intra-group heterogeneity minimization and replica number regularization to reduce the resource demand difference between tasks in the same group and inhibit replica fragmentation, defined as: wherein, represents a set of replicas, each instantiating a replica of the base model and being served by a plurality of tasks assigned to the replica shared, is a measure of dissimilarity in the feature space, for regularizing the number of replicas; in implementations, the HDBSCAN algorithm is used to automatically discover task groupings in the feature space; 2-2; GPU resource allocation among replicas: After completing the LoRA task clustering, each cluster group is mapped to the corresponding replica, and the GPU allocation of each replica is determined under global GPU resource constraints. For each replica, the minimum number of GPUs that meet the memory constraints is allocated as the initial configuration. Then, the number of GPUs is gradually increased as the remaining GPU budget allows. Each time a GPU is added, the marginal saving of GPU time brought by allocating one GPU is calculated, and the replica with the largest marginal saving is selected from all replicas for allocation. After allocation, the next marginal saving of the replica is recalculated and the selection order is updated. The process of gradually increasing the GPU allocation for the replica with the largest marginal saving is repeated until the GPU budget is exhausted or there is no positive marginal saving, thus obtaining the final number of GPUs configured for each replica.

3. The cost optimization method of claim 1, wherein, Step 3 specifically includes: 3-1: Parallel Iteration Time Modeling for Multi-LoRA Pipelines: Under the 1F1B pipeline scheduling method, a single multi-LoRA pipeline parallel iteration includes a pipeline filling phase, a steady-state phase, and a draining phase. The filling phase initializes the pipeline through the forward propagation of the first packaged micro-batch. The steady-state phase is executed by overlapping forward and backward propagation of the intermediate packaged micro-batches. The draining phase is completed by the backward propagation of the last packaged micro-batch. Since the packaging result of the micro-batch is unpredictable before scheduling, the execution time of the filling and draining phases is approximated as the execution time corresponding to the longest-running packaged micro-batch under this partition. Therefore, the iteration time is defined as: wherein, and respectively denote the forward propagation and backward propagation time of a packed micro-batch of a micro-batch, denotes the pipeline depth of the pipeline within the replica; 3-2: Construction of packed micro-batches: sort all micro-batches in one iteration by the maximum sequence length, only allow packing micro-batches with similar sequence length for batch fine-tuning to minimize the extra padding overhead caused by cross-task micro-batch forced alignment; divide the sorted list into several consecutive groups, each group is regarded as one packed micro-batch in the subsequent pipeline execution , the goal is to minimize the iteration time; since directly minimizing the iteration time involves the non-additive maximum term, by introducing a time upper bound , to limit the execution time of each packed micro-batch; under this constraint, use dynamic programming to determine the optimal micro-batch grouping scheme to minimize the execution time, the state transition equation of dynamic programming is as follows: in For the front Micro-batch Divided into The minimum cumulative execution time required for a series of consecutive micro-batches, while the execution time of each group must not exceed [a certain limit]. ; Indicates merging micro-batch The execution time of the generated packaged micro-batches, and this execution time is subject to the time limit for each group. Restrictions; 3-3: Packed Micro-batch Rearrangement: Even with optimal packed micro-batch construction, heterogeneity between tasks can lead to slight but unavoidable load imbalance in packed micro-batches. A V-shaped rearrangement strategy based on Johnson's rule is adopted, placing packed micro-batches with short execution times at both ends to shorten the filling and emptying phases, while placing packed micro-batches with long execution times in the middle to maximize the alternation of forward and backward propagation and reduce idle time in the pipeline.