A model training duration estimation method, device, equipment and storage medium
By determining the operator flow and communication operator through simulation, and combining the cluster hardware configuration and sample data volume, the training time of the model is estimated. This solves the problems of high cost and single strategy in the existing technology, and achieves more accurate and flexible training time estimation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- RUIJIE NETWORKS CO LTD
- Filing Date
- 2024-12-24
- Publication Date
- 2026-06-26
AI Technical Summary
Existing distributed parallel training techniques require the actual training process to obtain training time and only support a single parallel strategy, resulting in high cost and poor flexibility in training time estimation.
The operator stream and communication operator of the model to be trained are determined by simulation. Combined with the cluster hardware configuration and sample data volume, the computation and communication time is estimated, and multiple parallel strategies are supported.
It reduces reliance on GPU cards, lowers estimated costs, and improves the accuracy and applicability of training duration estimation.
Smart Images

Figure CN122285239A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a method, apparatus, device and storage medium for estimating model training time. Background Technology
[0002] With the rapid development of artificial intelligence, traditional single-machine training models are no longer feasible, and distributed parallel training technology has been widely used. Distributed parallel training technology refers to training models simultaneously on multiple GPU cards to improve the speed and scale of model training.
[0003] Existing distributed parallel training techniques involve setting up multiple sub-computing systems to distribute the model and training data across these systems for multiple rounds of iterative training. The training time is recorded along with the data transfer volume and bandwidth to calculate the training duration. Training time is closely related to the resources and costs required for training. However, this approach requires a real training process to obtain the training time, and existing distributed parallel training techniques typically only support a single parallel strategy, lacking flexibility.
[0004] Being able to estimate model training time before training is crucial for companies to make decisions related to model training. Summary of the Invention
[0005] This application provides a method, apparatus, device, and storage medium for estimating model training time, in order to improve the accuracy of model training time estimation.
[0006] Firstly, this application provides a method for estimating model training time, the method comprising:
[0007] Based on the model architecture of the model to be trained, the operator flow of the model to be trained is determined, and the operator flow includes multiple computation operators; each computation operator has its own computational amount indicated by its parameter configuration.
[0008] Based on the parallel strategy of the model to be trained, multiple communication operators located in the operator stream are determined; each communication operator has its own communication volume indicated by parameter configuration.
[0009] Based on the hardware configuration of the cluster running the model to be trained and the amount of sample data of the model to be trained, the computation time of each computation operator in the operator stream and the communication time of each communication operator in each training round are determined by simulation, thereby obtaining the training time of the model to be trained.
[0010] The model training time estimation method provided in this application first determines multiple computation operators based on the model architecture of the model to be trained, and multiple communication operators based on the parallel strategy of the model to be trained. Then, based on the cluster configuration and sample data volume of the model to be trained, the computation time of each computation operator and the communication time of each communication operator are determined through simulation. Finally, the training time of the model to be trained is determined based on the computation time of each computation operator and the communication time of each communication operator. Since this scheme determines the computation time and communication time through simulation, it is not necessary to actually execute the training task of the model to be trained using a GPU card to estimate the estimated training time, reducing the dependence on GPU cards and reducing the estimation cost. Moreover, this application considers not only computation time but also communication time when estimating the training time, improving the accuracy of the training time estimation of the model to be trained.
[0011] In one possible design, based on the parallel strategy of the model to be trained, multiple communication operators located in the operator stream are determined, including:
[0012] Based on the number of model layers in the model to be trained and the pipeline parallelism (PP) degree in the parallel strategy, the computation operators corresponding to multiple training stages and the PP communication operators between training stages are determined; the computation operator corresponding to any training stage is the computation operator for its respective model layer number; and / or,
[0013] Based on the data parallelism (DP) degree in the parallel strategy, determine the DP communication operator corresponding to the training phase; and / or,
[0014] Based on the tensor parallelism (TP) degree in the parallel strategy, the computational operators executed by each GPU card in the training phase and the TP communication operators between GPU cards are determined.
[0015] This application is applicable not only to pipeline parallelism (PP) strategies, but also to tensor parallelism (TP) strategies and data parallelism (DP) strategies. Compared with existing model training time prediction schemes that only support PP strategies or a single parallel strategy, this application supports multiple parallel strategies, thereby improving the applicability and flexibility of the model training time prediction method.
[0016] In one possible design, before determining the computation time of each computation operator and the communication time of each communication operator in the operator stream in each training round through simulation, the following steps are also included:
[0017] Based on the parallel strategy of the model to be trained and the hardware configuration of the cluster running the model to be trained, the allocation of GPU cards in the cluster during the model training process is determined.
[0018] Based on the parallel strategy of the model to be trained and the amount of sample data of the model to be trained, the training rounds of the model training are determined.
[0019] In one possible design, the computation time of each computation operator and the communication time of each communication operator in the operator stream are determined through simulation, including:
[0020] For any computational operator in the operator stream, the computation time of the computational operator is determined based on the computational amount of the computational operator and the hardware configuration of the GPU card executing the computational operator;
[0021] For any communication operator in the operator stream, the communication time of the communication operator is determined based on the transmission bandwidth between the communication operator and the communication object and the communication volume of the communication operator.
[0022] Determining the computation time of a computation operator based on the GPU card's hardware configuration and the computational load of the computation operator can improve the accuracy of computation time determination. Determining the communication time of a communication operator based on the transmission bandwidth between communication objects and the communication load of the communication operator can improve the accuracy of communication time determination. Estimating the model training time based on more accurate computation and communication times can improve the accuracy of prediction.
[0023] In one possible design, the computation time of the computation operator is determined based on the amount of computational data and the hardware configuration of the GPU card executing the computation operator, including:
[0024] If the computation operator is a computationally intensive operator, then the maximum value between the memory access time and the computation time of the computation operator is taken as the computation time of the computation operator; wherein, the memory access time is determined based on the amount of computation data of the computation operator and the Layer 2 bandwidth of the GPU card executing the computation operator; the computation time is determined based on the amount of computation data of the computation operator and the hardware computing power of the GPU card executing the computation operator;
[0025] If the computation operator is a memory-intensive operator, then the memory access time of the computation operator is determined as the computation time of the computation operator.
[0026] This application subdivides computation operators into computationally intensive operators and memory-intensive operators, and designs different methods for determining computation time for these two types of computation operators, ensuring the rationality and accuracy of the methods for determining the computation time of computation operators.
[0027] In one possible design, determining the communication time of the communication operator based on the transmission bandwidth between the communication operator and the communication object, and the communication volume of the communication operator, includes:
[0028] If the communication objects are located on the same server, the effective bandwidth corresponding to the communication operator is determined based on the trend graph between communication volume and effective bandwidth.
[0029] The communication time of the communication operator is determined based on the communication volume of the communication operator and the effective bandwidth corresponding to the communication operator.
[0030] When the communication objects are located on the same server, the communication time is not determined by a single fixed effective bandwidth. This application takes into account that the effective bandwidth during transmission is related to the amount of communication transmitted. Different amounts of communication correspond to different effective bandwidths. Therefore, this application determines the effective bandwidth corresponding to the transmission of the communication operator based on the communication amount of the communication operator and the trend graph between the communication amount and the effective bandwidth, which can improve the accuracy of determining the communication time of the communication operator.
[0031] In one possible design, determining the communication time of the communication operator based on the transmission bandwidth between the communication operator and the communication object, and the communication volume of the communication operator, includes:
[0032] If the communication objects are located on different servers, the communication time of the communication operator is determined based on the communication volume of the communication operator, the inter-server communication bandwidth, and the transmission delay of the switch.
[0033] When communication objects are located on different servers, the transmission of communication operators usually requires the use of switches. Switches have a certain transmission delay. Therefore, when communication objects are located on different servers, determining the communication time of communication operators based on the communication volume of communication operators, inter-server communication bandwidth, and the transmission delay of switches can improve the accuracy of communication time determination.
[0034] In one possible design, the computation time of each computation operator and the communication time of each communication operator in the operator stream are determined through simulation to obtain the training duration of the model to be trained, including:
[0035] The computation time of each computation operator and the communication time of each communication operator in the operator stream during the forward phase are determined by simulation, thus obtaining the forward phase time.
[0036] The computation time of each computation operator and the communication time of each communication operator in the operator stream during the backward phase are determined by simulation, thus obtaining the backward phase time.
[0037] The computation time of each computation operator and the communication time of each communication operator in the recomputation phase of the operator stream in each training round are determined by simulation, and the recomputation phase time is obtained.
[0038] The time for each training round is determined based on the time spent in the forward phase, the backward phase, the recalculation phase, and the idle waiting time in each training round.
[0039] The training duration of the model to be trained is determined based on the time consumed in each training round and the number of training rounds.
[0040] Secondly, this application also provides a model training time prediction device, which includes: a determination unit and a simulation unit;
[0041] The determining unit is used to determine the operator stream of the model to be trained based on the model architecture of the model to be trained. The operator stream includes multiple computation operators; each computation operator has its own computational amount indicated by its parameter configuration.
[0042] The determining unit is further configured to determine multiple communication operators located in the operator stream based on the parallel strategy of the model to be trained; each communication operator has a communication volume indicated by its own parameter configuration.
[0043] The simulation unit is used to determine the computation time of each computation operator and the communication time of each communication operator in the operator stream in each training round by means of simulation, based on the hardware configuration of the cluster running the model to be trained and the amount of sample data of the model to be trained, thereby obtaining the training time of the model to be trained.
[0044] In one possible design, the determining unit is configured to, based on the parallel strategy of the model to be trained, determine multiple communication operators located in the operator stream as follows: based on the number of model layers of the model to be trained and the pipeline parallelism (PP) degree in the parallel strategy, determine the computation operators corresponding to multiple training stages and the PP communication operators between training stages; the computation operator corresponding to any training stage is the computation operator of the model layer number; and / or, based on the data parallelism (DP) degree in the parallel strategy, determine the DP communication operator corresponding to the training stage; and / or, based on the tensor parallelism (TP) degree in the parallel strategy, determine the computation operators executed by each GPU card in the training stage and the TP communication operators between GPU cards.
[0045] In one possible design, before determining the computation time of each computation operator and the communication time of each communication operator in the operator stream during each training round through simulation, the determining unit is further configured to determine the allocation of GPU cards in the cluster during the model training process based on the parallel strategy of the model to be trained and the hardware configuration of the cluster running the model to be trained; and to determine the training rounds of the model training based on the parallel strategy of the model to be trained and the amount of sample data of the model to be trained.
[0046] In one possible design, the simulation unit is used to determine the computation time of each computation operator and the communication time of each communication operator in the operator stream in each training round through simulation: for any computation operator in the operator stream, the computation time of the computation operator is determined based on the computational amount of the computation operator and the hardware configuration of the GPU card executing the computation operator; for any communication operator in the operator stream, the communication time of the communication operator is determined based on the transmission bandwidth between the communication operator and the communication object and the communication volume of the communication operator.
[0047] In one possible design, the simulation unit is configured to determine the computation time of the computation operator based on the amount of computational data of the computation operator and the hardware configuration of the GPU card executing the computation operator: if the computation operator is a computationally intensive operator, the maximum value between the memory access time and the computation time of the computation operator is taken as the computation time of the computation operator; wherein, the memory access time is determined based on the amount of computational data of the computation operator and the Layer 2 bandwidth of the GPU card executing the computation operator; the computation time is determined based on the amount of computational data of the computation operator and the hardware computing power of the GPU card executing the computation operator; if the computation operator is a memory-intensive operator, the memory access time of the computation operator is determined as the computation time of the computation operator.
[0048] In one possible design, the simulation unit is used to determine the communication time of the communication operator based on the transmission bandwidth between the communication operator and the communication object and the communication volume of the communication operator: if the communication object is located in the same server, the effective bandwidth corresponding to the communication operator is determined based on the trend graph between the communication volume and the effective bandwidth; the communication time of the communication operator is determined based on the communication volume of the communication operator and the effective bandwidth corresponding to the communication operator.
[0049] In one possible design, the simulation unit is used to determine the communication time of the communication operator based on the transmission bandwidth between the communication operator and the communication object and the communication volume of the communication operator: if the communication object is located in different servers, the communication time of the communication operator is determined based on the communication volume of the communication operator, the inter-server communication bandwidth, and the transmission delay of the switch.
[0050] In one possible design, the simulation unit is used to determine the computation time of each computation operator and the communication time of each communication operator in the operator stream of each training round through simulation, in order to obtain the training duration of the model to be trained: determining the computation time of each computation operator and the communication time of each communication operator in the forward phase of the operator stream of each training round through simulation, to obtain the forward phase time; determining the computation time of each computation operator and the communication time of each communication operator in the backward phase of the operator stream of each training round through simulation, to obtain the backward phase time; determining the computation time of each computation operator and the communication time of each communication operator in the recomputation phase of the operator stream of each training round through simulation, to obtain the recomputation phase time; determining the duration of each training round based on the forward phase time, backward phase time, recomputation phase time, and idle waiting time in each training round; and determining the training duration of the model to be trained based on the duration of each training round and the training round.
[0051] Thirdly, this application also provides a model training time estimation device, which includes: a processor and a memory communicatively connected to the processor;
[0052] The memory stores computer-executed instructions;
[0053] The processor executes computer execution instructions stored in the memory to implement the method described in the first aspect above.
[0054] Fourthly, this application also provides a computer-readable storage medium comprising a program that, when executed on a device, causes the device to perform the method as described in any one of the first aspects above.
[0055] Fifthly, this application also provides a computer program product, the computer program product comprising a computer program that, when executed by a processor, implements the method described in the first aspect above. Attached Figure Description
[0056] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0057] Figure 1 A flowchart illustrating a model training duration estimation method provided in this application embodiment;
[0058] Figure 2A schematic diagram of an attention head structure provided in an embodiment of this application;
[0059] Figure 3 This application provides a schematic diagram of the computational operator flow corresponding to a Transformer architecture.
[0060] Figure 4 A schematic diagram of computation and communication operators during the training phase is provided for an embodiment of this application;
[0061] Figure 5 A schematic diagram of a computational operator and a communication operator for determining a TP strategy is provided in an embodiment of this application;
[0062] Figure 6 A schematic diagram of a GPU card's computation and communication operators provided in an embodiment of this application;
[0063] Figure 7 A trend graph between communication volume and effective bandwidth is provided for embodiments of this application;
[0064] Figure 8 A schematic diagram illustrating the interaction between computational modeling, communication modeling, and process modeling modules provided in this application embodiment;
[0065] Figure 9 Schematic diagram of the model training time prediction device provided in the embodiments of this application Figure 1 ;
[0066] Figure 10 Schematic diagram of the model training time prediction device provided in the embodiments of this application Figure 2 . Detailed Implementation
[0067] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0068] The application scenarios described in this application are for the purpose of more clearly illustrating the technical solutions of this application, and do not constitute a limitation on the technical solutions provided in this application. Those skilled in the art will understand that with the emergence of new application scenarios, the technical solutions provided in this application are also applicable to similar technical problems. In the description of this application, unless otherwise stated, "multiple" means two or more.
[0069] Current methods for estimating model training time are mainly for predicting the training time of computer vision (CV) models. CV models are mainly used in server scenarios, such as autonomous driving, medical image analysis, retail management, security monitoring, smart homes, industrial automation and robot navigation, and agricultural inspection and management.
[0070] Existing methods for estimating model training time suffer from two main drawbacks: firstly, they rely on GPUs to actually train the model and record the training time, which are expensive, thus increasing the cost of estimating training time; secondly, they only support a single parallel strategy, resulting in limited applicability and poor flexibility. To address these shortcomings, this application proposes the following... Figure 1 The method for estimating model training time shown can be executed by a server, a chip within the server, or a functional module within the server; this application does not limit the specific implementation. The following explanation uses a server as the executing entity. The method includes:
[0071] Step 101: Based on the model architecture of the model to be trained, determine the operator flow of the model to be trained. The operator flow includes multiple computation operators; each computation operator has its own computational amount indicated by its parameter configuration.
[0072] For example, in deep learning, model and model architecture are two different concepts. A model usually refers to a specific network instance, a network trained on a specific dataset with specific weights and parameters, such as a pre-trained generative transformer developed and trained by OpenAI, the US Open AI Research Center. Pre-trained Transformer (GPT) models can generate text content that closely resembles human conversation and are widely used in fields such as intelligent customer service, content generation, and coding. Model architecture typically refers to the overall structure of the model, including the number of layers, the type of each layer, and parameter settings. The types of layers include convolutional layers, pooling layers, and fully connected layers. For example, the Transformer architecture is a common model architecture. It's a neural network architecture based on an attention mechanism, generally divided into four parts: the input part, the encoder, the decoder, and the output part. The encoder and decoder are composed of several layers with the same structure, each with different parameters. The encoder is mainly responsible for converting the input sequence into a fixed-length vector representation, and the decoder decodes this vector into an output sequence. Furthermore, because the Transformer architecture introduces a self-attention mechanism, which allows the model to focus on different positions in the input sequence when processing sequential data, it enables parallel processing of the input sequence, accelerating the model training process.
[0073] This application can predict the training time of mainstream GPT-series and Large Language Visual Assistant (LLava) series models. The LLava series models replicate some of OpenAI's functions in image dialogue, and their key feature is the ability to improve upon other open-source solutions while using a simpler model architecture and less training data. This makes the LLava series models faster, cheaper, and more suitable for inference on consumer-grade hardware. The following explanation uses GPT-3 as an example to illustrate the model training time prediction method proposed in this application.
[0074] For example, GPT-3 uses the classic Transformer architecture and introduces a multi-head self-attention mechanism. The largest version of GPT-3 has 96 layers, each containing 12,288 hidden units, and 96 attention heads. In the multi-head self-attention mechanism, the input sequence undergoes different linear transformations through different weight matrices W. Q W K W V The obtained Q, K, and V vectors are as follows: Figure 2 As shown, Figure 2 The dashed box shows a schematic diagram of an attention head structure. GPT-3 has 96 such attention head structures. The outputs of these 96 attention heads are concatenated together to form the final attention head output vector.
[0075] For example, the parameters of the model to be trained are shown in Table 1, which mainly include vocabulary length, decoder sequence length, hidden layer size, feed-forward network (FFN) hidden layer size, and number of multi-head attention.
[0076]
[0077]
[0078] Table 1
[0079] For example, an operator flow refers to a process consisting of a series of operators during data processing and analysis. The core idea of an operator flow is to decompose data processing tasks into a series of independent operation units. Each operator performs a specific data processing task and then passes the result to the next operator, ultimately completing the entire data processing flow. In the field of deep learning, a model architecture can be abstracted into a computational operator flow; determining the model architecture allows us to determine the corresponding computational operator flow. For example... Figure 3 As shown, Figure 3 This refers to the computation operator stream corresponding to the Transformer architecture used by GPT-3. The computation operator stream includes multiple computation operators and indicates the execution order between each computation operator. Figure 3 The computational operator stream shown includes the following operators: Layer Norm, Wqkv, QK, softmax, QKV, FC, Res add, LayerNorm, h->4h, gelu, 4h->h, and Res add. The execution order of these operators is shown in the reference. Figure 3 The arrow in the image points to... Figure 3 The included computational operators are shown in Table 2:
[0080] input0 input0 input1 input1 weight weight output output Layer Norm s h h 2 s h Wqkv s h h 3h s 3h QK s a a s s s softmax s s s s QKV s s s a s a FC s h h h s h Res add s h s h s h Layer Norm s h h 2 s h h->4h s h h 4h s 4h gelu s 4h 4h h s h 4h->h s 4h 4h h s h Res add s h s h s h
[0081] Table 2
[0082] In Table 2, `input0` corresponds to the rows and columns of one input matrix of the computation operator, and the other input matrix is either `input1` or `weight`. `output` corresponds to the rows and columns of the output matrix, which is the computation result of the computation operator. Specifically, `input0` and `input1` perform matrix addition or multiplication, or `input0` and `weight` perform matrix addition or multiplication. The computation result output by the computation operator is `output`. For example, taking the computation operator QK in Table 2 as an example, the input matrices of computation operator QK are `input0` and `input1`. The number of rows and columns of the input matrix corresponding to `input0` are `s` and `a`, respectively, and the number of rows and columns of the input matrix corresponding to `input1` are `a` and `s`, respectively. The output matrix obtained after performing matrix multiplication on `input0` and `input1` is `output`, and the number of rows and columns of the output matrix `output` is `s`. Table 2 shows the input matrices of each computation operator, the operation or processing method used by the computation operator on the input matrices, and the computation result of the computation operator.
[0083] The following combination Figure 3 The calculation operators in Table 2 are explained below:
[0084] ① The Layer Norm operator is used to normalize the input data. The result of the Layer Norm operator is one of the inputs of the Wqkv operator.
[0085] ② The computation operator Wqkv refers to the entirety of the three operators Wq, Wk, and Wv;
[0086] ③ The inputs to the computation operator QK are Wq and Wk. In the computation operator QK, Wq and Wk are multiplied, and the result is the input to the computation operator softmax.
[0087] ④ The softmax operator is used to convert the calculation result of the input operator QK into a normalized probability distribution and input it into the operator QKV;
[0088] ⑤ One of the inputs to the computation operator QKV is the output of the computation operator softmax, and the other input is Wv. The computation operator QKV performs a multiplication operation on the two input matrices, and the result is one of the inputs to the computation operator FC.
[0089] ⑥ Another input to the computation operator FC is the weight matrix Wo. According to Table 2, the number of rows and columns of the weight matrix Wo is h. The computation operator FC performs a multiplication calculation on the two input matrices, and the result is one of the inputs to the computation operator Res add.
[0090] ⑦ Another input to the computation operator Res add is Figure 3 The initial input values of the computation operator stream are shown. The computation operator Res add performs an addition operation on the two input matrices, and the result is one of the inputs of the computation operator Layer Norm.
[0091] ⑧ The input matrices of the computation operator h->4h are the output of the computation operator Layer Norm and the weight matrix Wa. According to Table 2, the number of rows and columns of Wa are h and 4h, respectively. The computation operator h->4h performs a multiplication operation on these two input matrices, and the result is the input of the computation operator gelu.
[0092] ⑨ The computation operator gelu includes a Gaussian-based activation function to weight the input, and the result is one of the inputs to the computation operator 4h->h;
[0093] ⑩ The other input to the computation operator 4h->h is the weight matrix Wb. According to Table 2, the number of rows and columns of Wb are 4h and h, respectively. The computation operator 4h->h performs a multiplication operation on the two input matrices, and the result is one of the inputs to the computation operator Res add.
[0094] Assuming the vocabulary length (V) of the model to be trained is 51200, the decoder sequence length (s) is 2048, the hidden layer size (h) is 12288, and the number of multi-heads (a) is 128, then Table 2 can be converted into Table 3:
[0095] input0 input0 input1 input1 weight weight output output Layer Norm 2048 12288 12288 2 2048 12288 Wqkv 2048 12288 12288 18432 2048 18432 QK 2048 128 128 2048 2048 2048 softmax 2048 2048 2048 2048 QKV 2048 2048 2048 128 2048 128 FC 2048 6144 6144 12288 2048 12288 Res add 2048 12288 2048 12288 2048 12288 Layer Norm 2048 12288 12288 2 2048 12288 h->4h 2048 12288 12288 24576 2048 24576 gelu 2048 24576 2048 24576 4h->h 2048 24576 24576 12288 2048 12288 Res add 2048 12288 2048 12288 2048 12288
[0096] Table 3
[0097] As can be seen from the above, this application only needs to obtain the model architecture and model parameters of the model to be trained to determine the multiple computational operators included in the operator stream of the model to be trained and the execution order among the multiple computational operators. It should be noted that the computational amount corresponding to each computational operator can be determined according to the model parameters in step 101, or the data amount corresponding to each computational operator can be determined according to the model parameters when determining the computation time of each computational operator in subsequent step 103. This application does not limit this.
[0098] Step 102: Based on the parallel strategy of the model to be trained, determine multiple communication operators located in the operator stream; each communication operator has its own communication volume indicated by its parameter configuration.
[0099] For example, this application supports training the model to be trained using a variety of parallel strategies, such as pipeline parallelism (PP), tensor parallelism (TP), data parallelism (DP), sequence parallelism, context parallelism, and mixed expert model (MOE) parallelism, which are mainstream parallel strategies. The PP strategy divides the model into multiple stages by layer, with each stage handled by a different device. The TP strategy refers to intra-layer partitioning of the model, where the input and parameter matrices are computed in blocks using a reasonable method. The DP strategy divides the sample data into multiple batches, with each batch assigned to a different device for training. Each device has a complete model and computes the loss and gradients locally, updating the model parameters by synchronizing the gradients across all devices. The sequence parallel strategy is mainly used for long sequences, splitting the long sequence into multiple subsequences and processing each subsequence on different devices simultaneously. The context parallel strategy can be seen as an enhancement of the sequence parallel strategy, which involves splitting all inputs and all output activations along the sequence dimension and computing them on different devices, improving the training efficiency of context-dependent tasks. The MOE parallel strategy splits the model into multiple experts, each responsible for processing a portion of the input data, thereby achieving efficient utilization of computing resources and model training.
[0100] Step 101 determines multiple computation operators in the operator stream. The following uses a combination of PP, TP, and DP strategies as an example to illustrate the determination of multiple communication operators in the operator stream.
[0101] For example, the training stages of the model to be trained must first be determined. Each training stage contains at least one GPU card to execute computational operators. Specifically, based on the number of layers in the model and the pipeline parallelism (PP parallelism) in the parallel strategy, the computational operators corresponding to each training stage and the PP communication operators between training stages are determined. For instance, the number of training stages is calculated by dividing the number of model layers by the PP parallelism. Assuming the model has 96 layers and the PP parallelism is 6, it is divided into 16 training stages. The computational operator corresponding to any training stage is the model's... The layer calculation operator; for example, assuming the first training stage corresponds to layers 1-6, the second training stage corresponds to layers 7-12, ..., and the 16th training stage corresponds to layers 91-96, the calculation operator for each layer has been determined in step 101. Thus, the calculation operator for the first training stage is the calculation operator for layers 1-6, the calculation operator for the second training stage is the calculation operator for layers 7-12, and so on, and the calculation operator for the 16th training stage is the calculation operator for layers 91-96. After each training stage completes the calculation operator, the execution result is transferred to the next training stage, for example, ... Figure 4 As shown, Figure 4 The training process consists of 16 training phases. Each rectangle represents the computation operator corresponding to that training phase. After the first training phase completes the computation operator, the computation result is transmitted to the second training phase. The communication operator between the first and second training phases is shown as the circle between the first and second training phases.
[0102] For example, the division of computational operators for each GPU card in each training phase is based on the TP strategy and the DP strategy. Specifically, based on the tensor parallelism TP degree (TP parallelism) in the parallel strategy, the computational operators executed by each GPU card in the training phase and the TP communication operators between GPU cards are determined. Based on the data parallelism DP degree (DP parallelism) in the parallel strategy, the DP communication operators corresponding to the training phase are determined.
[0103] Taking the determination of the computational operators executed by each GPU card and the TP communication operators between GPU cards based on TP parallelism as an example, assuming the TP parallelism is 2, Figure 5 This demonstrates the computing tasks assigned to the two GPU cards, in order to Figure 5 The dashed line in the diagram serves as a boundary. The portion above the dashed line represents the computing tasks assigned to one GPU card (denoted as GPU1), and the portion below the dashed line represents the computing tasks assigned to another GPU card (denoted as GPU2). The computation result of GPU1 is... Figure 5The calculation results of Z1 and GPU2 are as follows: Figure 5 Z2 in Figure 5 At point g, the two computation results are aggregated using All-Reduce to obtain the final output Z. The transmission between Z1 and Z2 is the TP communication operator between GPU1 and GPU2. All-Reduce is a communication algorithm commonly used in parallel technology and distributed computing, mainly used to aggregate data among multiple computing nodes (e.g., GPUs or CPUs) to improve model training efficiency.
[0104] Figure 4 The diagram shows the computational operators for each training stage and the communication operators between training stages. For details on the computational operators executed by each GPU and the communication operators between GPUs, please refer to [reference needed]. Figure 6 , Figure 6 The rectangles in the text represent the computational operators executed by the GPU card. Figure 6 The circles in the diagram represent communication operators between GPU cards.
[0105] Step 103: Based on the hardware configuration of the cluster running the model to be trained and the amount of sample data of the model to be trained, the computation time of each computation operator in the operator stream and the communication time of each communication operator in each training round are determined by simulation, so as to obtain the training time of the model to be trained.
[0106] For example, before determining the computation time of each computation operator and the communication time of each communication operator in each training round through simulation, it is necessary to first determine the allocation of GPU cards in the cluster during the model training process based on the parallel strategy of the model to be trained and the hardware configuration of the cluster running the model to be trained; and to determine the training rounds of the model training based on the parallel strategy of the model to be trained and the amount of sample data of the model to be trained; wherein, the hardware configuration of the cluster running the model to be trained includes the number of GPU cards, the hardware computing power of the GPU cards, the memory bandwidth of the GPU cards, the layer 2 bandwidth of the GPU cards, etc.
[0107] For example, for any computational operator in the operator stream, this application determines the computation time of the computational operator based on the computational amount of the computational operator and the hardware configuration of the GPU card executing the computational operator. Furthermore, this application classifies computational operators into computationally intensive operators and memory-intensive operators according to computational intensity, and designs different methods for determining the computation time of these two types of computational operators:
[0108] ① If the computation operator is a computationally intensive operator, then the maximum of the memory access time and the computation time of the computation operator is taken as the computation time of the computation operator, that is, computation time = max(memory access time, computation time); where, the memory access time is determined based on the amount of data to be computed by the computation operator and the Layer 2 bandwidth of the GPU card executing the computation operator. The Layer 2 bandwidth of the GPU card can also be called L2 bandwidth. For example, memory access time = amount of data to be computed / L2 bandwidth; the computation time is determined based on the amount of data to be computed by the computation operator and the hardware computing power of the GPU card executing the computation operator. For example, computation time = amount of data to be computed / hardware computing power of the GPU.
[0109] ② If the computation operator is a memory-intensive operator, then the memory access time of the computation operator is determined as the computation time of the computation operator.
[0110] For example, for any communication operator in the operator stream, this application determines the communication time of the communication operator based on the transmission bandwidth between the communication operator and the communication object and the communication volume of the communication operator. Furthermore, this application also considers whether the communication objects are located on the same server.
[0111] ① If the communication objects are located on the same server, first determine the effective bandwidth corresponding to the communication operator based on the trend graph between communication volume and effective bandwidth; then determine the communication time of the communication operator based on the communication volume and the effective bandwidth corresponding to the communication operator. For example, communication time = communication volume / effective bandwidth. The trend graph between communication volume and effective bandwidth is shown below. Figure 7 As shown, different communication volumes correspond to different effective bandwidths. As the communication volume increases to a certain value, the effective bandwidth basically remains stable and no longer increases. Determining the corresponding effective bandwidth based on the communication volume indicated by the communication operator can improve the accuracy of effective bandwidth determination, thereby improving the accuracy of communication time determination.
[0112] ② If the communication objects are located on different servers, the communication time of the communication operator is determined based on the communication volume of the communication operator, the inter-server communication bandwidth, and the transmission delay of the switch. Since the transmission of the communication operator usually needs to rely on the switch when the communication objects are located on different servers, the transmission delay of the switch should also be taken into account when calculating the communication time. For example, communication time = communication volume / inter-server communication bandwidth + transmission delay of the switch.
[0113] Further exemplarily, based on the methods for determining computation time and communication time described above, the computation time of each computation operator and the communication time of each communication operator in the operator stream during the forward phase of each training round are determined by simulation to obtain the forward phase time; the computation time of each computation operator and the communication time of each communication operator in the operator stream during the backward phase of each training round are determined by simulation to obtain the backward phase time; and the computation time of each computation operator and the communication time of each communication operator in the operator stream during the recomputation phase of each training round are determined by simulation to obtain the recomputation phase time.
[0114] The forward propagation phase, also known as the forward pass phase, and the backward propagation phase, also known as the backward pass phase or the backward propagation phase, are the two core processes of model training. The forward propagation phase refers to the process of calculating and passing data layer by layer through each layer of the model, starting from the input layer, and finally obtaining the output result of the model. In the forward propagation, the input data undergoes linear transformation through the weight matrix or bias matrix of each layer, and then undergoes nonlinear transformation through the activation function before being output to the next layer, until the output layer is reached. The forward propagation phase can convert the input data into the output result, realizing the classification and prediction of the data.
[0115] The backpropagation phase refers to calculating the gradient backwards based on the difference between the model's predictions and the true labels using the chain rule. This gradient is then propagated from the output layer back to every layer of the network to update the model's parameters. During backpropagation, the error in the output layer is first calculated, then propagated from the output layer to the hidden layers, then to shallower hidden layers, and finally to the input layer. Through backpropagation, gradient information about each parameter with respect to the loss function can be obtained, thereby optimizing and updating the parameters. The forward and backpropagation phases work together, iterating multiple times until the model's performance reaches the expected level.
[0116] Recomputation refers to performing two forward propagations in each training round. Although it increases the time spent in each training round, it significantly reduces the GPU memory usage. In layman's terms, recomputation is a technique that trades time for space.
[0117] After determining the forward phase time, this application does not simply use the forward phase time as the backward phase time and recalculation phase time, but rather redetermines the backward phase time and recalculation phase time, thereby improving the accuracy of the backward phase time and recalculation phase time.
[0118] For example, the time consumption of each training round is determined based on the time consumption of the forward phase, the time consumption of the backward phase, the time consumption of the recalculation phase, and the idle waiting time in each training round; for example, the time consumption of the forward phase, the time consumption of the backward phase, the time consumption of the recalculation phase, and the idle waiting time are summed, and the summation result is used as the time consumption of each training round.
[0119] For example, the training duration of the model to be trained is finally determined based on the time taken for each training round and the number of training rounds; for example, assuming that the time taken for each training round is t and the number of training rounds is n, then t×n can be used as the training duration of the model to be trained.
[0120] Furthermore, the server executing the model training time prediction method proposed in this application includes computational modeling, communication modeling, and process modeling modules, and the interactions between these modules are as follows: Figure 8 As shown, the computational modeling module performs step 101, the communication modeling module performs step 102, and the process modeling module performs step 103. The process modeling module further includes a computational simulation module and a communication simulation module. The computational simulation module determines the computation time of the computational operator through simulation, and the communication simulation module determines the communication time of the communication operator through simulation. For a more detailed description of each module in the server, please refer to [link to relevant documentation]. Figure 1 The relevant descriptions in the method embodiments shown are directly obtained and will not be repeated here.
[0121] Figure 9 and Figure 10 The diagram illustrates the possible structure of a model training time prediction device provided in the embodiments of this application. These model training time prediction devices can be used to implement the server functionality described in the above method embodiments, and thus also achieve the beneficial effects of the above method embodiments.
[0122] like Figure 9 As shown, the model training time prediction device 900 includes a determination unit 910 and a simulation unit 920. The model training time prediction device 900 is used to implement the above... Figure 1 The server function in the method embodiment shown is as follows:
[0123] The determining unit 910 is used to determine the operator stream of the model to be trained based on the model architecture of the model to be trained. The operator stream includes multiple computation operators. Each computation operator has a computation amount indicated by its own parameter configuration.
[0124] The determining unit 910 is further configured to determine multiple communication operators located in the operator stream based on the parallel strategy of the model to be trained; each communication operator has a communication volume indicated by its own parameter configuration.
[0125] The simulation unit 920 is used to determine the computation time of each computation operator and the communication time of each communication operator in the operator stream in each training round by means of simulation based on the hardware configuration of the cluster running the model to be trained and the amount of sample data of the model to be trained, thereby obtaining the training time of the model to be trained.
[0126] In one possible design, the determining unit 910 is configured to, when determining multiple communication operators located in the operator stream based on the parallel strategy of the model to be trained: determine the computation operators corresponding to multiple training stages and the PP communication operators between training stages based on the number of model layers of the model to be trained and the pipeline parallelism (PP) degree in the parallel strategy; the computation operator corresponding to any training stage is the computation operator of the model layer to which it belongs; and / or, determine the DP communication operator corresponding to the training stage based on the data parallelism (DP) degree in the parallel strategy; and / or, determine the computation operators executed by each GPU card in the training stage and the TP communication operators between GPU cards based on the tensor parallelism (TP) degree in the parallel strategy.
[0127] In one possible design, before determining the computation time of each computation operator and the communication time of each communication operator in the operator stream in each training round through simulation, the determining unit 910 is further configured to determine the allocation of GPU cards in the cluster during the model training process based on the parallel strategy of the model to be trained and the hardware configuration of the cluster running the model to be trained; and to determine the training rounds of the model training based on the parallel strategy of the model to be trained and the amount of sample data of the model to be trained.
[0128] In one possible design, the simulation unit 920 is used to determine the computation time of each computation operator and the communication time of each communication operator in the operator stream in each training round through simulation: for any computation operator in the operator stream, the computation time of the computation operator is determined according to the computational amount of the computation operator and the hardware configuration of the GPU card executing the computation operator; for any communication operator in the operator stream, the communication time of the communication operator is determined according to the transmission bandwidth between the communication operator and the communication object and the communication amount of the communication operator.
[0129] In one possible design, the simulation unit 920 is used to determine the computation time of the computation operator based on the amount of computational data of the computation operator and the hardware configuration of the GPU card executing the computation operator: if the computation operator is a computationally intensive operator, then the maximum value between the memory access time and the computation time of the computation operator is taken as the computation time of the computation operator; wherein, the memory access time is determined based on the amount of computational data of the computation operator and the Layer 2 bandwidth of the GPU card executing the computation operator; the computation time is determined based on the amount of computational data of the computation operator and the hardware computing power of the GPU card executing the computation operator; if the computation operator is a memory-intensive operator, then the memory access time of the computation operator is determined as the computation time of the computation operator.
[0130] In one possible design, the simulation unit 920 is used to determine the communication time of the communication operator based on the transmission bandwidth between the communication operator and the communication object and the communication volume of the communication operator: if the communication object is located in the same server, the effective bandwidth corresponding to the communication operator is determined based on the trend graph between the communication volume and the effective bandwidth; the communication time of the communication operator is determined based on the communication volume of the communication operator and the effective bandwidth corresponding to the communication operator.
[0131] In one possible design, the simulation unit 920 is used to determine the communication time of the communication operator based on the transmission bandwidth between the communication operator and the communication object and the communication volume of the communication operator: if the communication object is located in different servers, the communication time of the communication operator is determined based on the communication volume of the communication operator, the inter-server communication bandwidth and the transmission delay of the switch.
[0132] In one possible design, the simulation unit 920 is used to determine the computation time of each computation operator and the communication time of each communication operator in the operator stream in each training round through simulation, and to obtain the training duration of the model to be trained: When determining the computation time of each computation operator and the communication time of each communication operator in the forward phase of the operator stream in each training round through simulation, the forward phase time is obtained; when determining the computation time of each computation operator and the communication time of each communication operator in the backward phase of the operator stream in each training round through simulation, the backward phase time is obtained; when determining the computation time of each computation operator and the communication time of each communication operator in the recomputation phase of the operator stream in each training round through simulation, the recomputation phase time is obtained; based on the forward phase time, backward phase time, recomputation phase time, and idle waiting time in each training round, the duration of each training round is determined; based on the duration of each training round and the training round, the training duration of the model to be trained is determined.
[0133] For a more detailed description of the aforementioned determining unit 910 and simulation unit 920, please refer to [link / reference]. Figure 1 The relevant descriptions in the method embodiments shown are directly obtained and will not be repeated here.
[0134] like Figure 10 As shown, the model training time prediction device 1000 includes a processor 1010 and an interface circuit 1020. The processor 1010 and the interface circuit 1020 are coupled to each other. It is understood that the interface circuit 1020 can be a transceiver or an input / output interface. Optionally, the model training time prediction device 1000 may also include a memory 1030 for storing instructions executed by the processor 1010, or storing input data required by the processor 1010 to run instructions, or storing data generated after the processor 1010 runs instructions.
[0135] When the model training duration prediction device 1000 is used to implement Figure 1 In the method shown, the processor 1010 is used to implement the function of the determination unit 910, and the interface circuit 1020 is used to implement the function of the simulation unit 920.
[0136] The unit division in this embodiment is illustrative and represents only one logical functional division. In actual implementation, other division methods may be used. Furthermore, the functional units in the various embodiments of this application can be integrated into a single processor, exist as separate physical units, or be integrated into a single unit. The integrated units described above can be implemented in hardware or as software functional units.
[0137] Although preferred embodiments of this application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of this application.
[0138] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.
Claims
1. A method for predicting model training time, characterized in that, The method includes: Based on the model architecture of the model to be trained, the operator flow of the model to be trained is determined, and the operator flow includes multiple computation operators; each computation operator has its own computational amount indicated by its parameter configuration. Based on the parallel strategy of the model to be trained, multiple communication operators located in the operator stream are determined; each communication operator has its own communication volume indicated by parameter configuration. Based on the hardware configuration of the cluster running the model to be trained and the amount of sample data of the model to be trained, the computation time of each computation operator in the operator stream and the communication time of each communication operator in each training round are determined by simulation, thereby obtaining the training time of the model to be trained.
2. The method of claim 1, wherein, Based on the parallel strategy of the model to be trained, multiple communication operators located in the operator stream are determined, including: Based on the number of model layers in the model to be trained and the pipeline parallelism (PP) degree in the parallel strategy, the computation operators corresponding to multiple training stages and the PP communication operators between training stages are determined; the computation operator corresponding to any training stage is the computation operator for its respective model layer number; and / or, Based on the data parallelism (DP) degree in the parallel strategy, determine the DP communication operator corresponding to the training phase; and / or, Based on the tensor parallelism (TP) degree in the parallel strategy, the computational operators executed by each GPU card in the training phase and the TP communication operators between GPU cards are determined.
3. The method of claim 1, wherein, Before determining the computation time of each computation operator and the communication time of each communication operator in the operator stream in each training round through simulation, the following steps are also included: Based on the parallel strategy of the model to be trained and the hardware configuration of the cluster running the model to be trained, the allocation of GPU cards in the cluster during the model training process is determined. Based on the parallel strategy of the model to be trained and the amount of sample data of the model to be trained, the training rounds of the model training are determined.
4. The method of any one of claims 1-3, wherein, The computation time of each computation operator and the communication time of each communication operator in the operator stream are determined by simulation in each training round, including: For any computational operator in the operator stream, the computation time of the computational operator is determined based on the computational amount of the computational operator and the hardware configuration of the GPU card executing the computational operator; For any communication operator in the operator stream, the communication time of the communication operator is determined based on the transmission bandwidth between the communication operator and the communication object and the communication volume of the communication operator.
5. The method of claim 4, wherein, The computation time of the computation operator is determined based on the amount of computational data and the hardware configuration of the GPU card executing the computation operator, including: If the computation operator is a computationally intensive operator, then the maximum value between the memory access time and the computation time of the computation operator is taken as the computation time of the computation operator; wherein, the memory access time is determined based on the amount of computation data of the computation operator and the Layer 2 bandwidth of the GPU card executing the computation operator; the computation time is determined based on the amount of computation data of the computation operator and the hardware computing power of the GPU card executing the computation operator; If the computation operator is a memory-intensive operator, then the memory access time of the computation operator is determined as the computation time of the computation operator.
6. The method of claim 4, wherein, Determining the communication time of the communication operator based on the transmission bandwidth between the communication operator and the communication object and the communication volume of the communication operator includes: If the communication objects are located on the same server, the effective bandwidth corresponding to the communication operator is determined based on the trend graph between communication volume and effective bandwidth. The communication time of the communication operator is determined based on the communication volume of the communication operator and the effective bandwidth corresponding to the communication operator.
7. The method of claim 4, wherein, Determining the communication time of the communication operator based on the transmission bandwidth between the communication operator and the communication object and the communication volume of the communication operator includes: If the communication objects are located on different servers, the communication time of the communication operator is determined based on the communication volume of the communication operator, the inter-server communication bandwidth, and the transmission delay of the switch.
8. The method according to any one of claims 1-3, characterized in that, The computation time of each computation operator and the communication time of each communication operator in the operator stream are determined by simulation in each training round, thus obtaining the training duration of the model to be trained, including: The computation time of each computation operator and the communication time of each communication operator in the operator stream during the forward phase are determined by simulation, thus obtaining the forward phase time. The computation time of each computation operator and the communication time of each communication operator in the operator stream during the backward phase are determined by simulation, thus obtaining the backward phase time. The computation time of each computation operator and the communication time of each communication operator in the recomputation phase of the operator stream in each training round are determined by simulation, and the recomputation phase time is obtained. The time for each training round is determined based on the time spent in the forward phase, the backward phase, the recalculation phase, and the idle waiting time in each training round. The training duration of the model to be trained is determined based on the time consumed in each training round and the number of training rounds. 9.A model training duration prediction apparatus, characterized in that, The device includes: a determination unit and a simulation unit; The determining unit is used to determine the operator stream of the model to be trained based on the model architecture of the model to be trained. The operator stream includes multiple computation operators; each computation operator has its own computational amount indicated by its parameter configuration. The determining unit is further configured to determine multiple communication operators located in the operator stream based on the parallel strategy of the model to be trained; each communication operator has a communication volume indicated by its own parameter configuration. The simulation unit is used to determine the computation time of each computation operator and the communication time of each communication operator in the operator stream in each training round by means of simulation, based on the hardware configuration of the cluster running the model to be trained and the amount of sample data of the model to be trained, thereby obtaining the training time of the model to be trained.
10. A model training duration prediction device, characterized in that, include: A processor, and a memory communicatively connected to the processor; The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory to implement the method as described in any one of claims 1-8.
11. A computer readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the method as described in any one of claims 1-8.
12. A computer program product, characterized in that, Includes a computer program that, when executed by a processor, implements the method of any one of claims 1-8.