Model training duration estimation method and apparatus, device, and storage medium
By determining the computation and communication time through simulation and combining the model architecture and hardware configuration, the training time of the model can be estimated, which solves the problems of high cost and poor flexibility in the existing technology and improves accuracy and applicability.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- RUIJIE NETWORKS CO LTD
- Filing Date
- 2025-07-03
- Publication Date
- 2026-07-02
AI Technical Summary
Existing distributed parallel training techniques rely on recording the actual training process time, which is costly and only supports a single parallel strategy, resulting in poor flexibility and an inability to effectively predict model training time.
The computation time of computation operators and the communication time of communication operators are determined by simulation. Combined with model architecture, parallel strategies and hardware configuration, the training time of the model is estimated. Multiple parallel strategies are supported, reducing the dependence on GPU cards.
It reduces prediction costs, improves the accuracy and applicability of training duration prediction, supports multiple parallel strategies, and is suitable for different hardware environments.
Smart Images

Figure CN2025106907_02072026_PF_FP_ABST
Abstract
Description
Methods, devices, equipment and storage media for estimating model training time
[0001] Cross-reference to related applications
[0002] This application claims priority to Chinese Patent Application No. 202411911107.1, filed on December 24, 2024, with the State Intellectual Property Office of the People's Republic of China, entitled "A Method, Apparatus, Device and Storage Medium for Estimating Model Training Duration", the entire contents of which are incorporated herein by reference. Technical Field
[0003] This application relates to the field of computer technology, and in particular to a method, apparatus, device and storage medium for estimating model training time. Background Technology
[0004] With the rapid development of artificial intelligence, traditional single-machine training models are no longer feasible, and distributed parallel training technology has been widely used. Distributed parallel training technology refers to training models simultaneously on multiple GPU cards to improve the speed and scale of model training.
[0005] Existing distributed parallel training technology sets up multiple sub-computing systems, distributes the model and training data to multiple sub-computing systems for multiple rounds of iterative training, records the time consumption and data transmission volume, and calculates the training time by combining the data transmission volume and bandwidth. Summary of the Invention
[0006] Exemplary embodiments of this application provide a method, apparatus, device, and storage medium for estimating model training time.
[0007] Firstly, this application provides a method for estimating model training time, the method comprising:
[0008] Based on the model architecture and model parameters of the model to be trained, determine multiple computational operators of the operator flow of the model to be trained;
[0009] Based on the parallel strategy of the model to be trained, multiple communication operators located in the operator stream are determined;
[0010] Based on the multiple computation operators, the multiple communication operators, the hardware configuration of the cluster running the model to be trained, and the amount of sample data of the model to be trained, the computation time of each computation operator and the communication time of each communication operator in the operator stream of each training round are determined by simulation, thereby obtaining the training time of the model to be trained.
[0011] The model training time estimation method provided in this application first determines multiple computation operators based on the model architecture of the model to be trained, and multiple communication operators based on the parallel strategy of the model to be trained. Then, based on the cluster configuration and sample data volume of the model to be trained, the computation time of each computation operator and the communication time of each communication operator are determined through simulation. Finally, the training time of the model to be trained is determined based on the computation time of each computation operator and the communication time of each communication operator. Since this scheme determines the computation time and communication time through simulation, it is not necessary to actually execute the training task of the model to be trained using a GPU card to estimate the estimated training time, reducing the dependence on GPU cards and reducing the estimation cost. Moreover, this application considers not only computation time but also communication time when estimating the training time, improving the accuracy of the training time estimation of the model to be trained.
[0012] In one possible design, the model parameters include at least one of the following:
[0013] The vocabulary length, decoder sequence length, hidden layer size, feedforward network (FFN) hidden layer size, or number of multi-head attention.
[0014] In one possible design, the parallel strategy of the model to be trained includes at least one of the following:
[0015] Pipeline parallelism (PP), tensor parallelism (TP), data parallelism (DP), sequence parallelism, context parallelism, or multi-layered hybrid expert MoOE parallelism.
[0016] In one possible design, each of the plurality of computation operators has a computational amount indicated by its own parameter configuration.
[0017] In one possible design, each of the plurality of communication operators has a communication volume indicated by its own parameter configuration.
[0018] In one possible design, based on the parallel strategy of the model to be trained, multiple communication operators located in the operator stream are determined, including:
[0019] Based on the number of model layers in the model to be trained and the pipelined parallelism (PP) degree in the parallel strategy, determine the computation operators corresponding to each of the multiple training stages and the PP communication operators between every two training stages; and / or,
[0020] Based on the data parallelism (DP) degree in the parallel strategy, determine the DP communication operator corresponding to the training phase; and / or,
[0021] Based on the tensor parallelism (TP) degree in the parallel strategy, the computational operators executed by each GPU card in the training phase and the TP communication operators between GPU cards are determined.
[0022] This application is applicable not only to pipeline parallelism (PP) strategies, but also to tensor parallelism (TP) strategies and data parallelism (DP) strategies. Compared with existing model training time prediction schemes that only support PP strategies or a single parallel strategy, this application supports multiple parallel strategies, thereby improving the applicability and flexibility of the model training time prediction method.
[0023] In one possible design, before determining the computation time of each computation operator and the communication time of each communication operator in the operator stream of each training round through simulation, the following steps are also included:
[0024] Based on the parallel strategy of the model to be trained and the hardware configuration of the cluster running the model to be trained, the allocation of GPU cards in the cluster during the model training process is determined.
[0025] Based on the parallel strategy of the model to be trained and the amount of sample data of the model to be trained, the training rounds of the model training are determined.
[0026] In one possible design, the computation time of each computation operator and the communication time of each communication operator in the operator stream of each training round are determined through simulation, including:
[0027] The computation time of the first computation operator is determined based on the computational amount of the first computation operator and the hardware configuration of the GPU card executing the first computation operator, wherein the operator stream in each training round includes the first computation operator.
[0028] Determining the computation time of a computation operator based on the GPU card's hardware configuration and the computational load of the computation operator can improve the accuracy of computation time determination. Determining the communication time of a communication operator based on the transmission bandwidth between communication objects and the communication load of the communication operator can improve the accuracy of communication time determination. Estimating the model training time based on more accurate computation and communication times can improve the accuracy of prediction.
[0029] In one possible design, the computation time of the first computation operator is determined based on the amount of computational data and the hardware configuration of the GPU card executing the first computation operator, including:
[0030] If the first computation operator is a computationally intensive operator, then the maximum value between the memory access time and the computation time of the first computation operator is taken as the computation time of the first computation operator; wherein, the memory access time is determined based on the amount of computation data of the first computation operator and the Layer 2 bandwidth of the GPU card executing the first computation operator; the computation time is determined based on the amount of computation data of the first computation operator and the hardware computing power of the GPU card executing the first computation operator;
[0031] If the first computation operator is a memory-intensive operator, then the memory access time of the first computation operator is determined as the computation time of the first computation operator.
[0032] This application subdivides computation operators into computationally intensive operators and memory-intensive operators, and designs different methods for determining computation time for these two types of computation operators, ensuring the rationality and accuracy of the methods for determining the computation time of computation operators.
[0033] In one possible design, the computation time of each computation operator and the communication time of each communication operator in the operator stream of each training round are determined through simulation, including:
[0034] The communication time of the first communication operator is determined based on the transmission bandwidth between the first communication operator and the communication object and the communication volume of the first communication operator, wherein the plurality of communication operators includes the first communication operator.
[0035] In one possible design, determining the communication time of the first communication operator based on the transmission bandwidth between the first communication operator and the communication object, and the communication volume of the communication operator, includes:
[0036] If the communication objects are located on the same server, the effective bandwidth corresponding to the first communication operator is determined based on the trend graph between the communication volume and the effective bandwidth.
[0037] The communication time of the first communication operator is determined based on the communication volume of the first communication operator and the effective bandwidth corresponding to the first communication operator.
[0038] When the communication objects are located on the same server, the communication time is not determined by a single fixed effective bandwidth. This application takes into account that the effective bandwidth during transmission is related to the amount of communication transmitted. Different amounts of communication correspond to different effective bandwidths. Therefore, this application determines the effective bandwidth corresponding to the transmission of the communication operator based on the communication amount of the communication operator and the trend graph between the communication amount and the effective bandwidth, which can improve the accuracy of determining the communication time of the communication operator.
[0039] In one possible design, determining the communication time of the first communication operator based on the transmission bandwidth between the first communication operator and the communication object, and the communication volume of the first communication operator, includes:
[0040] If the communication objects are located on different servers, the communication time of the first communication operator is determined based on the communication volume of the first communication operator, the inter-server communication bandwidth, and the transmission delay of the switch.
[0041] When communication objects are located on different servers, the transmission of communication operators usually requires the use of switches. Switches have a certain transmission delay. Therefore, when communication objects are located on different servers, determining the communication time of communication operators based on the communication volume of communication operators, inter-server communication bandwidth, and the transmission delay of switches can improve the accuracy of communication time determination.
[0042] In one possible design, the computation time of each computation operator and the communication time of each communication operator in the operator stream of each training round are determined by simulation to obtain the training duration of the model to be trained, including:
[0043] The computation time of each computational operator and the communication time of each communication operator in the operator stream of each training round are determined by simulation, thus obtaining the forward phase time.
[0044] The computation time of each computation operator and the communication time of each communication operator in the operator stream of each training round are determined by simulation, and the time of the backward stage is obtained.
[0045] The computation time of each computation operator and the communication time of each communication operator in the recomputation phase of the operator stream in each training round are determined by simulation, and the recomputation phase time is obtained.
[0046] The time for each training round is determined based on the time spent in the forward phase, the backward phase, the recalculation phase, and the idle waiting time in each training round.
[0047] The training duration of the model to be trained is determined based on the time consumed in each training round and the number of training rounds.
[0048] Secondly, this application also provides a model training time prediction device, the device comprising:
[0049] The determining unit is used to determine multiple computational operators of the operator flow of the model to be trained based on the model architecture and model parameters of the model to be trained.
[0050] The determining unit is further configured to determine multiple communication operators located in the operator stream based on the parallel strategy of the model to be trained.
[0051] The simulation unit is used to determine the computation time of each computation operator and the communication time of each communication operator in each training round by means of simulation, based on the multiple computation operators, the multiple communication operators, the hardware configuration of the cluster running the model to be trained, and the sample data volume of the model to be trained, thereby obtaining the training time of the model to be trained.
[0052] In one possible design, the determining unit is configured to, based on the parallel strategy of the model to be trained, determine multiple communication operators located in the operator stream as follows: based on the number of model layers of the model to be trained and the pipeline parallelism (PP) degree in the parallel strategy, determine the computation operators corresponding to each of the multiple training stages and the PP communication operators between every two training stages; and / or, based on the data parallelism (DP) degree in the parallel strategy, determine the DP communication operators corresponding to the training stage; and / or, based on the tensor parallelism (TP) degree in the parallel strategy, determine the computation operators executed by each GPU card in the training stage and the TP communication operators between GPU cards.
[0053] In one possible design, before determining the computation time of each computation operator and the communication time of each communication operator in the operator stream of each training round through simulation, the determining unit is further configured to determine the allocation of GPU cards of the cluster in the model training process based on the parallel strategy of the model to be trained and the hardware configuration of the cluster running the model to be trained; and to determine the training rounds of the model training based on the parallel strategy of the model to be trained and the amount of sample data of the model to be trained.
[0054] In one possible design, the simulation unit is used to determine the computation time of each computation operator and the communication time of each communication operator in the operator stream of each training round through simulation: the computation time of the first computation operator is determined based on the computational amount of the first computation operator and the hardware configuration of the GPU card executing the first computation operator, wherein the operator stream in each training round includes the first computation operator.
[0055] In one possible design, the simulation unit is configured to determine the computation time of the first computation operator based on the computational data volume of the first computation operator and the hardware configuration of the GPU card executing the first computation operator: if the first computation operator is a computationally intensive operator, then the maximum value between the memory access time and the computation time of the first computation operator is taken as the computation time of the first computation operator; wherein, the memory access time is determined based on the computational data volume of the first computation operator and the Layer 2 bandwidth of the GPU card executing the first computation operator; the computation time is determined based on the computational data volume of the first computation operator and the hardware computing power of the GPU card executing the first computation operator; if the first computation operator is a memory-intensive operator, then the memory access time of the first computation operator is determined as the computation time of the first computation operator.
[0056] In one possible design, the simulation unit is used to determine the communication time of the communication operator based on the transmission bandwidth between the first communication operator and the communication object and the communication volume of the first communication operator: if the communication object is located in the same server, the effective bandwidth corresponding to the first communication operator is determined based on the trend graph between the communication volume and the effective bandwidth; the communication time of the first communication operator is determined based on the communication volume of the first communication operator and the effective bandwidth corresponding to the first communication operator.
[0057] In one possible design, the simulation unit is used to determine the communication time of the first communication operator based on the transmission bandwidth between the first communication operator and the communication object and the communication volume of the first communication operator: if the communication object is located in different servers, the communication time of the first communication operator is determined based on the communication volume of the first communication operator, the inter-server communication bandwidth, and the transmission delay of the switch.
[0058] In one possible design, the simulation unit is used to determine the computation time of each computation operator and the communication time of each communication operator in the operator stream of each training round through simulation, in order to obtain the training duration of the model to be trained: determining the computation time of each computation operator and the communication time of each communication operator in the forward phase of the operator stream in each training round through simulation, thus obtaining the forward phase time; determining the computation time of each computation operator and the communication time of each communication operator in the backward phase of the operator stream in each training round through simulation, thus obtaining the backward phase time; determining the computation time of each computation operator and the communication time of each communication operator in the recomputation phase of the operator stream in each training round through simulation, thus obtaining the recomputation phase time; determining the duration of each training round based on the forward phase time, backward phase time, recomputation phase time, and idle waiting time; and determining the training duration of the model to be trained based on the duration of each training round and the training round.
[0059] Furthermore, the model training time prediction device can implement the method described in the first aspect above through the determination unit and / or the simulation unit, which will not be elaborated further here.
[0060] Thirdly, this application also provides a model training time estimation device, which includes: a processor and a memory communicatively connected to the processor;
[0061] The memory stores computer-executed instructions;
[0062] The processor executes computer execution instructions stored in the memory to implement the method described in the first aspect above.
[0063] Fourthly, this application also provides a computer-readable storage medium comprising a program that, when executed on a device, causes the device to perform the method as described in any one of the first aspects above.
[0064] Fifthly, this application also provides a computer program product, the computer program product comprising a computer program that, when executed by a processor, implements the method described in the first aspect above. Attached Figure Description
[0065] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0066] Figure 1 is a flowchart illustrating a model training time estimation method provided in an embodiment of this application;
[0067] Figure 2 is a schematic diagram of an attention head structure provided in an embodiment of this application;
[0068] Figure 3 is a schematic diagram of the computational operator flow corresponding to a Transformer architecture provided in an embodiment of this application;
[0069] Figure 4 is a schematic diagram of a computational operator and a communication operator for the training phase provided in an embodiment of this application;
[0070] Figure 5 is a schematic diagram of a computational operator and a communication operator for determining a TP strategy provided in an embodiment of this application;
[0071] Figure 6 is a schematic diagram of a GPU card's computation and communication operators provided in an embodiment of this application;
[0072] Figure 7 is a trend chart of communication volume and effective bandwidth provided in an embodiment of this application;
[0073] Figure 8 is a schematic diagram of the interaction between computational modeling, communication modeling and process modeling modules provided in an embodiment of this application;
[0074] Figure 9 is a schematic diagram of the model training time prediction device provided in an embodiment of this application.
[0075] Figure 10 is a schematic diagram of the model training time prediction device provided in the embodiment of this application. Detailed Implementation
[0076] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0077] In the description of the embodiments of this application, unless otherwise stated, " / " means "or". For example, A / B can mean A or B. The "and / or" in the text is merely a description of the relationship between related objects, indicating that there can be three relationships. For example, A and / or B can mean: A exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of this application, "multiple" means two or more.
[0078] Hereinafter, the terms "first" and "second" are used for descriptive purposes only and should not be construed as implying or suggesting relative importance or implicitly indicating the number of indicated technical features. Thus, a feature defined with "first" and "second" may explicitly or implicitly include one or more of that feature, and in the description of the embodiments of this application, unless otherwise stated, "multiple" means two or more.
[0079] When using the terms "comprising," "having," and "including" as described in this application, another component may be added unless explicitly qualifying terms such as "only," "consisting of," etc. are used. Unless otherwise stated, singular terms may include plural forms and should not be construed as having only one quantity.
[0080] The application scenarios described in this application are for the purpose of more clearly illustrating the technical solutions of this application, and do not constitute a limitation on the technical solutions provided in this application. Those skilled in the art will understand that with the emergence of new application scenarios, the technical solutions provided in this application are also applicable to similar technical problems. In the description of this application, unless otherwise stated, "multiple" means two or more.
[0081] With the rapid development of artificial intelligence, traditional single-machine training models are no longer feasible, and distributed parallel training technology has been widely used. Distributed parallel training technology refers to training models simultaneously on multiple GPU cards to improve the speed and scale of model training.
[0082] Existing distributed parallel training techniques involve setting up multiple sub-computing systems to distribute the model and training data across these systems for multiple rounds of iterative training. The training time is recorded along with the data transfer volume and bandwidth to calculate the training duration. Training time is closely related to the resources and costs required for training. However, this approach requires a real training process to obtain the training time, and existing distributed parallel training techniques typically only support a single parallel strategy, lacking flexibility.
[0083] Being able to estimate model training time before training is crucial for companies to make decisions related to model training.
[0084] Current methods for estimating model training time are mainly for predicting the training time of computer vision (CV) models. CV models are mainly used in server scenarios, such as autonomous driving, medical image analysis, retail management, security monitoring, smart homes, industrial automation and robot navigation, and agricultural inspection and management.
[0085] The methods for estimating model training time in related technologies have two main drawbacks: one is that they rely on GPU cards to actually train the model and record the time consumption, and GPU cards are expensive, so the cost of estimating model training time is high; the other is that they can only support a single parallel strategy, which results in a small scope of application and poor flexibility for model training time estimation methods.
[0086] This application provides an exemplary method for estimating model training time, as shown in Figure 1. The execution entity of this method can be a server, a chip within the server, or a functional module within the server; this application does not limit the specific implementation. The following description uses a server as the execution entity. The method includes the following steps.
[0087] Step 101: Based on the model architecture of the model to be trained, determine the operator flow of the model to be trained. The operator flow includes multiple computation operators; each computation operator has its own computational amount indicated by its parameter configuration.
[0088] For example, in deep learning, a model and a model architecture are two different concepts. A model usually refers to a specific network instance, which is a network trained on a specific dataset with specific weights and parameters. For instance, the Generative Pre-trained Transformer (GPT) series of models developed and trained by OpenAI, the US Open AI Research Center, can generate text content that is very similar to human conversations and is widely used in fields such as intelligent customer service, content generation, and code writing. Model architecture typically refers to the overall structure of a model, including the number of layers, the type of each layer, and parameter settings. The type of each layer includes convolutional layers, pooling layers, fully connected layers, etc. For example, the Transformer architecture is a common model architecture. The Transformer architecture is a neural network architecture based on an attention mechanism, generally divided into four parts: input, encoder, decoder, and output. The encoder and decoder are both composed of several layers with the same structure stacked on top of each other, each with different parameters. The encoder is mainly responsible for converting the input sequence into a fixed-length vector representation, and the decoder decodes this vector into an output sequence. Furthermore, because the Transformer architecture introduces a self-attention mechanism, which allows the model to focus on different positions in the input sequence when processing sequential data, it enables the Transformer architecture to process the input sequence in parallel, accelerating the model training process.
[0089] The exemplary embodiments of this application can predict the training time of mainstream GPT-series and Large Language Visual Assistant (LLava) series models. The LLava series models replicate some of OpenAI's functionality in image dialogue, and are characterized by their ability to improve upon other open-source solutions while using simpler model architectures and less training data. This makes the LLava series models faster, cheaper, and more suitable for inference on consumer-grade hardware. The following explanation uses GPT-3 as an example to illustrate the model training time prediction method proposed in this application.
[0090] For example, GPT-3 uses the classic Transformer architecture and introduces a multi-head self-attention mechanism. The largest version of GPT-3 has 96 layers, each containing 12,288 hidden units, and 96 attention heads. In the multi-head self-attention mechanism, the input sequence undergoes different linear transformations through different weight matrices W. Q W K W V The resulting Q, K, and V vectors are shown in Figure 2. The dashed box in Figure 2 is a schematic diagram of the structure of an attention head. GPT-3 has 96 such attention head structures. Finally, the outputs of these 96 attention heads are concatenated together as the final attention head output vector.
[0091] For example, the parameters of the model to be trained are shown in Table 1, which mainly include vocabulary length, decoder sequence length, hidden layer size, feed-forward network (FFN) hidden layer size, and number of multi-head attention.
[0092] Table 1
[0093] For example, an operator flow refers to a process consisting of a series of operators during data processing and analysis. The core idea of an operator flow is to decompose data processing tasks into a series of independent operation units. Each operator performs a specific data processing task and then passes the result to the next operator, ultimately completing the entire data processing flow. In the field of deep learning, a model architecture can be abstracted into a computational operator flow; determining the model architecture determines the corresponding computational operator flow. As shown in Figure 3, Figure 3 shows the computational operator flow corresponding to the Transformer architecture used by GPT-3. The computational operator flow includes multiple computational operators and indicates the execution order between them. The computational operators included in the computational operator flow shown in Figure 3 are Layer Norm, Wqkv, QK, softmax, QKV, FC, Res add, Layer Norm, h->4h, gelu, 4h->h, and Res add. The execution order of the computational operators is indicated by the arrows in Figure 3. The computational operators included in Figure 3 are shown in Table 2.
[0094] Table 2
[0095] In Table 2, `input0` corresponds to the rows and columns of one input matrix of the computation operator, and the other input matrix is either `input1` or `weight`. `output` corresponds to the rows and columns of the output matrix, which is the computation result of the computation operator. Specifically, `input0` and `input1` perform matrix addition or multiplication, or `input0` and `weight` perform matrix addition or multiplication. The computation result output by the computation operator is `output`. For example, taking the computation operator QK in Table 2 as an example, the input matrices of computation operator QK are `input0` and `input1`. The number of rows and columns of the input matrix corresponding to `input0` are `s` and `a`, respectively, and the number of rows and columns of the input matrix corresponding to `input1` are `a` and `s`, respectively. The output matrix obtained after performing matrix multiplication on `input0` and `input1` is `output`, and the number of rows and columns of the output matrix `output` is `s`. Table 2 shows the input matrices of each computation operator, the operation or processing method used by the computation operator on the input matrices, and the computation result of the computation operator.
[0096] The calculation operators in Table 2 are explained below with reference to Figure 3:
[0097] ① The Layer Norm operator is used to normalize the input data. The result of the Layer Norm operator is one of the inputs of the Wqkv operator.
[0098] ②The computational operator Wqkv refers to the entirety of the three operators Wq, Wk, and Wv;
[0099] ③ The inputs to the computation operator QK are Wq and Wk. Multiplication is performed on Wq and Wk in the computation operator QK, and the result is the input to the computation operator softmax.
[0100] ④ The softmax operator is used to convert the calculation result of the input operator QK into a normalized probability distribution and input it into the operator QKV;
[0101] ⑤ One of the inputs to the computation operator QKV is the output of the computation operator softmax, and the other input is Wv. The computation operator QKV performs a multiplication operation on the two input matrices, and the result is one of the inputs to the computation operator FC.
[0102] ⑥ Another input to the computation operator FC is the weight matrix Wo. According to Table 2, the number of rows and columns of the weight matrix Wo is h. The computation operator FC performs a multiplication calculation on the two input matrices, and the result is one of the inputs to the computation operator Res add.
[0103] ⑦ Another input to the computation operator Res add is the initial input value of the computation operator stream shown in Figure 3. The computation operator Res add performs an addition operation on the two input matrices, and the result is one of the inputs to the computation operator Layer Norm.
[0104] ⑧ The input matrices of the computation operator h->4h are the output of the computation operator Layer Norm and the weight matrix Wa. According to Table 2, the number of rows and columns of Wa are h and 4h, respectively. The computation operator h->4h performs a multiplication operation on these two input matrices, and the result is the input of the computation operator gelu.
[0105] ⑨ The computation operator gelu includes a Gaussian-based activation function to weight the input, and the result is one of the inputs to the computation operator 4h->h;
[0106] ⑩ The other input to the computation operator 4h->h is the weight matrix Wb. According to Table 2, the number of rows and columns of Wb are 4h and h, respectively. The computation operator 4h->h performs a multiplication operation on the two input matrices, and the result is one of the inputs to the computation operator Res add.
[0107] Assuming the vocabulary length (V) of the model to be trained is 51200, the decoder sequence length (s) is 2048, the hidden layer size (h) is 12288, and the number of multi-heads (a) is 128, then Table 2 can be converted into Table 3:
[0108] Table 3
[0109] As can be seen from the above, this application only needs to obtain the model architecture and model parameters of the model to be trained to determine the multiple computational operators included in the operator stream of the model to be trained and the execution order among the multiple computational operators. It should be noted that the computational amount corresponding to each computational operator can be determined according to the model parameters in step 101, or the data amount corresponding to each computational operator can be determined according to the model parameters when determining the computation time of each computational operator in subsequent step 103. This application does not limit this.
[0110] Step 102: Based on the parallel strategy of the model to be trained, determine multiple communication operators located in the operator stream; each communication operator has its own communication volume indicated by its parameter configuration.
[0111] For example, this application supports training the model to be trained using a variety of parallel strategies, such as pipeline parallelism (PP), tensor parallelism (TP), data parallelism (DP), sequence parallelism, context parallelism, and mixed of expert (MoE) parallelism, which are mainstream parallel strategies. The PP strategy divides the model into multiple stages by layer, with each stage handled by a different device. The TP strategy refers to intra-layer partitioning of the model, where the input and parameter matrices are computed in blocks using a reasonable method. The DP strategy divides the sample data into multiple batches, with each batch assigned to a different device for training. Each device has a complete model and computes the loss and gradients locally, updating the model parameters by synchronizing the gradients across all devices. The sequence parallel strategy is mainly used for long sequences, splitting the long sequence into multiple subsequences and processing each subsequence on different devices simultaneously. The context parallel strategy can be seen as an enhancement of the sequence parallel strategy, which involves splitting all inputs and all output activations along the sequence dimension and computing them on different devices, improving the training efficiency of context-dependent tasks. The MOE parallel strategy splits the model into multiple experts, each responsible for processing a portion of the input data, thereby achieving efficient utilization of computing resources and model training.
[0112] Step 101 determines multiple computation operators in the operator stream. The following uses a combination of PP, TP, and DP strategies as an example to illustrate the determination of multiple communication operators in the operator stream.
[0113] For example, the training stages of the model to be trained must first be determined. A training stage, also known as a Training Phase, involves at least one GPU card executing computational operators. Specifically, based on the number of layers in the model and the pipeline parallelism (PP parallelism) in the parallel strategy, the computational operators corresponding to each training stage and the PP communication operators between training stages are determined. For instance, the number of training stages is calculated by dividing the number of model layers by the PP parallelism. Assuming the model has 96 layers and the PP parallelism is 6, it is divided into 16 training stages. The computational operator corresponding to any training stage is the computational operator for its respective model layer number. For example, assuming the 1st training stage... The first training stage corresponds to layers 1-6 of the model, the second training stage corresponds to layers 7-12, and so on, up to layers 91-96. Step 101 has already determined the computation operators for each model layer. Thus, the computation operators for the first training stage are those for layers 1-6, the second training stage for layers 7-12, and so on, and the 16th training stage for layers 91-96. After each training stage completes the computation operators, the results are transmitted to the next training stage. For example, as shown in Figure 4, which includes 16 training stages, each rectangle represents the computation operator corresponding to that stage. After the first training stage completes the computation operators, the results are transmitted to the second training stage. The communication operators between the first and second training stages are shown as circles between them.
[0114] For example, the division of computation operators for each GPU card in each training phase is based on the TP strategy and the DP strategy. Specifically, based on the tensor parallelism TP degree (TP parallelism) in the parallel strategy, the computation operators executed by each GPU card in the training phase and the TP communication operators between GPU cards are determined. Based on the data parallelism DP degree (DP parallelism) in the parallel strategy, the DP communication operators corresponding to the training phase are determined.
[0115] Taking the determination of the computational operators executed by each GPU card and the TP communication operators between GPU cards based on TP parallelism as an example, assuming the TP parallelism is 2, Figure 5 shows the computational tasks assigned to two GPU cards. Using the dashed line in Figure 5 as the boundary, the part above the dashed line represents the computational tasks assigned to one GPU card (denoted as GPU1), and the part below the dashed line represents the computational tasks assigned to the other GPU card (denoted as GPU2). The computational result of GPU1 is Z1 in Figure 5, and the computational result of GPU2 is Z2 in Figure 5. At point g in Figure 5, the two computational results are summarized using All-Reduce to obtain the final output result Z. The transmission of Z1 and Z2 is the TP communication operator between GPU1 and GPU2. It can be understood that the All-reduce communication operator refers to, for example, two GPUs communicating through their network cards, exchanging their computational results and performing some calculations such as summation and averaging (not limited in this application), that is, Z1 and Z2 obtain Z through the communication operator. All-Reduce, as described above, is a communication algorithm commonly used in parallel technology and distributed computing. It is mainly used to aggregate data across multiple computing nodes (e.g., GPUs or CPUs) to improve model training efficiency.
[0116] Figure 6 shows the computation operators corresponding to each training stage and the communication operators between each training stage. The computation operators executed by each GPU card and the communication operators between GPU cards can be referred to in Figure 6. The rectangles in Figure 6 are the computation operators executed by the GPU cards, and the circles in Figure 6 are the communication operators between GPU cards.
[0117] Step 103: Based on the hardware configuration of the cluster running the model to be trained and the amount of sample data of the model to be trained, the computation time of each computation operator in the operator stream and the communication time of each communication operator in each training round are determined by simulation, so as to obtain the training time of the model to be trained.
[0118] For example, before determining the computation time of each computation operator and the communication time of each communication operator in each training round through simulation, it is necessary to first determine the allocation of GPU cards in the cluster during the model training process based on the parallel strategy of the model to be trained and the hardware configuration of the cluster running the model to be trained; and to determine the training rounds of the model training based on the parallel strategy of the model to be trained and the amount of sample data of the model to be trained; wherein, the hardware configuration of the cluster running the model to be trained includes the number of GPU cards, the hardware computing power of the GPU cards, the memory bandwidth of the GPU cards, the layer 2 bandwidth of the GPU cards, etc.
[0119] For example, for any computational operator in the operator stream, this application determines the computation time of the computational operator based on the computational amount of the computational operator and the hardware configuration of the GPU card executing the computational operator. Furthermore, this application classifies computational operators into computationally intensive operators and memory-intensive operators according to computational intensity, and designs different methods for determining the computation time of these two types of computational operators:
[0120] ① If the computation operator is a computationally intensive operator, then the maximum of the memory access time and the computation time of the computation operator is taken as the computation time of the computation operator, that is, computation time = max(memory access time, computation time); where, the memory access time is determined based on the amount of data to be computed by the computation operator and the Layer 2 bandwidth of the GPU card executing the computation operator. The Layer 2 bandwidth of the GPU card can also be called L2 bandwidth. For example, memory access time = amount of data to be computed / L2 bandwidth; the computation time is determined based on the amount of data to be computed by the computation operator and the hardware computing power of the GPU card executing the computation operator. For example, computation time = amount of data to be computed / hardware computing power of the GPU.
[0121] ② If the computation operator is a memory-intensive operator, then the memory access time of the computation operator is determined as the computation time of the computation operator.
[0122] For example, for any communication operator in the operator stream, this application determines the communication time of the communication operator based on the transmission bandwidth between the communication operator and the communication object and the communication volume of the communication operator. Furthermore, this application also considers whether the communication objects are located on the same server.
[0123] ① If the communication targets are located on the same server, first determine the effective bandwidth corresponding to the communication operator based on the trend graph between communication volume and effective bandwidth; then determine the communication time of the communication operator based on the communication volume and the effective bandwidth corresponding to the communication operator. For example, communication time = communication volume / effective bandwidth. The trend graph between communication volume and effective bandwidth is shown in Figure 7. Different communication volumes correspond to different effective bandwidths. As the communication volume increases to a certain value, the effective bandwidth basically remains stable and no longer increases. Determining the corresponding effective bandwidth based on the communication volume indicated by the communication operator can improve the accuracy of effective bandwidth determination, thereby improving the accuracy of communication time determination.
[0124] ② If the communication objects are located on different servers, the communication time of the communication operator is determined based on the communication volume of the communication operator, the inter-server communication bandwidth, and the transmission delay of the switch. Since the transmission of the communication operator usually needs to rely on the switch when the communication objects are located on different servers, the transmission delay of the switch should also be taken into account when calculating the communication time. For example, communication time = communication volume / inter-server communication bandwidth + transmission delay of the switch.
[0125] Further exemplarily, based on the methods for determining computation time and communication time described above, the computation time of each computation operator and the communication time of each communication operator in the operator stream during the forward phase of each training round are determined by simulation to obtain the forward phase time; the computation time of each computation operator and the communication time of each communication operator in the operator stream during the backward phase of each training round are determined by simulation to obtain the backward phase time; and the computation time of each computation operator and the communication time of each communication operator in the operator stream during the recomputation phase of each training round are determined by simulation to obtain the recomputation phase time.
[0126] The forward propagation phase, also known as the forward pass phase, and the backward propagation phase, also known as the backward pass phase or the backward propagation phase, are the two core processes of model training. The forward propagation phase refers to the process of calculating and passing data layer by layer through each layer of the model, starting from the input layer, and finally obtaining the output result of the model. In the forward propagation, the input data undergoes linear transformation through the weight matrix or bias matrix of each layer, and then undergoes nonlinear transformation through the activation function before being output to the next layer, until the output layer is reached. The forward propagation phase can convert the input data into the output result, realizing the classification and prediction of the data.
[0127] The backpropagation phase refers to calculating the gradient backwards based on the difference between the model's predictions and the true labels using the chain rule. This gradient is then propagated from the output layer back to every layer of the network to update the model's parameters. During backpropagation, the error in the output layer is first calculated, then propagated from the output layer to the hidden layers, then to shallower hidden layers, and finally to the input layer. Through backpropagation, gradient information about each parameter with respect to the loss function can be obtained, thereby optimizing and updating the parameters. The forward and backpropagation phases work together, iterating multiple times until the model's performance reaches the expected level.
[0128] Recomputation refers to performing two forward propagations in each training round. Although it increases the time spent in each training round, it significantly reduces the GPU memory usage. In layman's terms, recomputation is a technique that trades time for space.
[0129] After determining the forward phase time, this application does not simply use the forward phase time as the backward phase time and recalculation phase time, but rather redetermines the backward phase time and recalculation phase time, thereby improving the accuracy of the backward phase time and recalculation phase time.
[0130] For example, the time consumption of each training round is determined based on the time consumption of the forward phase, the time consumption of the backward phase, the time consumption of the recalculation phase, and the idle waiting time in each training round; for example, the time consumption of the forward phase, the time consumption of the backward phase, the time consumption of the recalculation phase, and the idle waiting time are summed, and the summation result is used as the time consumption of each training round.
[0131] For example, the training duration of the model to be trained is finally determined based on the time taken for each training round and the number of training rounds; for example, assuming that the time taken for each training round is t and the number of training rounds is n, then t×n can be used as the training duration of the model to be trained.
[0132] Furthermore, the server executing the model training time estimation method proposed in this application includes computational modeling, communication modeling, and process modeling modules. The interaction between these modules is shown in Figure 8. The computational modeling module executes step 101, the communication modeling module executes step 102, and the process modeling module executes step 103. The process modeling module further includes a computational simulation module and a communication simulation module. The computational simulation module determines the computation time of the computational operator through simulation, and the communication simulation module determines the communication time of the communication operator through simulation. More detailed descriptions of the various modules in the server can be directly obtained from the relevant descriptions in the method embodiment shown in Figure 1, and will not be elaborated upon here.
[0133] Figures 9 and 10 are schematic diagrams of possible model training time prediction devices provided in embodiments of this application. These model training time prediction devices can be used to implement the server functions in the above method embodiments, and thus can also achieve the beneficial effects of the above method embodiments.
[0134] As shown in Figure 9, the model training time prediction device 900 includes a determination unit 910 and a simulation unit 920. When the model training time prediction device 900 is used to implement the server function in the method embodiment shown in Figure 1 above:
[0135] The determining unit 910 is used to determine the operator stream of the model to be trained based on the model architecture of the model to be trained. The operator stream includes multiple computation operators. Each computation operator has a computation amount indicated by its own parameter configuration.
[0136] The determining unit 910 is further configured to determine multiple communication operators located in the operator stream based on the parallel strategy of the model to be trained; each communication operator has a communication volume indicated by its own parameter configuration.
[0137] The simulation unit 920 is used to determine the computation time of each computation operator and the communication time of each communication operator in the operator stream in each training round by means of simulation based on the hardware configuration of the cluster running the model to be trained and the amount of sample data of the model to be trained, thereby obtaining the training time of the model to be trained.
[0138] In one possible design, the determining unit 910 is configured to, when determining multiple communication operators located in the operator stream based on the parallel strategy of the model to be trained: determine the computation operators corresponding to multiple training stages and the PP communication operators between training stages based on the number of model layers of the model to be trained and the pipeline parallelism (PP) degree in the parallel strategy; the computation operator corresponding to any training stage is the computation operator of the model layer to which it belongs; and / or, determine the DP communication operator corresponding to the training stage based on the data parallelism (DP) degree in the parallel strategy; and / or, determine the computation operators executed by each GPU card in the training stage and the TP communication operators between GPU cards based on the tensor parallelism (TP) degree in the parallel strategy.
[0139] In one possible design, before determining the computation time of each computation operator and the communication time of each communication operator in the operator stream in each training round through simulation, the determining unit 910 is further configured to determine the allocation of GPU cards in the cluster during the model training process based on the parallel strategy of the model to be trained and the hardware configuration of the cluster running the model to be trained; and to determine the training rounds of the model training based on the parallel strategy of the model to be trained and the amount of sample data of the model to be trained.
[0140] In one possible design, the simulation unit 920 is used to determine the computation time of each computation operator and the communication time of each communication operator in the operator stream in each training round through simulation: for any computation operator in the operator stream, the computation time of the computation operator is determined according to the computational amount of the computation operator and the hardware configuration of the GPU card executing the computation operator; for any communication operator in the operator stream, the communication time of the communication operator is determined according to the transmission bandwidth between the communication operator and the communication object and the communication amount of the communication operator.
[0141] In one possible design, the simulation unit 920 is used to determine the computation time of the computation operator based on the amount of computational data of the computation operator and the hardware configuration of the GPU card executing the computation operator: if the computation operator is a computationally intensive operator, then the maximum value between the memory access time and the computation time of the computation operator is taken as the computation time of the computation operator; wherein, the memory access time is determined based on the amount of computational data of the computation operator and the Layer 2 bandwidth of the GPU card executing the computation operator; the computation time is determined based on the amount of computational data of the computation operator and the hardware computing power of the GPU card executing the computation operator; if the computation operator is a memory-intensive operator, then the memory access time of the computation operator is determined as the computation time of the computation operator.
[0142] In one possible design, the simulation unit 920 is used to determine the communication time of the communication operator based on the transmission bandwidth between the communication operator and the communication object and the communication volume of the communication operator: if the communication object is located in the same server, the effective bandwidth corresponding to the communication operator is determined based on the trend graph between the communication volume and the effective bandwidth; the communication time of the communication operator is determined based on the communication volume of the communication operator and the effective bandwidth corresponding to the communication operator.
[0143] In one possible design, the simulation unit 920 is used to determine the communication time of the communication operator based on the transmission bandwidth between the communication operator and the communication object and the communication volume of the communication operator: if the communication object is located in different servers, the communication time of the communication operator is determined based on the communication volume of the communication operator, the inter-server communication bandwidth and the transmission delay of the switch.
[0144] In one possible design, the simulation unit 920 is used to determine the computation time of each computation operator and the communication time of each communication operator in the operator stream in each training round through simulation, and to obtain the training duration of the model to be trained: When determining the computation time of each computation operator and the communication time of each communication operator in the forward phase of the operator stream in each training round through simulation, the forward phase time is obtained; when determining the computation time of each computation operator and the communication time of each communication operator in the backward phase of the operator stream in each training round through simulation, the backward phase time is obtained; when determining the computation time of each computation operator and the communication time of each communication operator in the recomputation phase of the operator stream in each training round through simulation, the recomputation phase time is obtained; based on the forward phase time, backward phase time, recomputation phase time, and idle waiting time in each training round, the duration of each training round is determined; based on the duration of each training round and the training round, the training duration of the model to be trained is determined.
[0145] A more detailed description of the determination unit 910 and the simulation unit 920 can be obtained directly from the relevant description in the method embodiment shown in Figure 1, and will not be repeated here.
[0146] As shown in Figure 10, the model training time prediction device 1000 includes a processor 1010 and an interface circuit 1020. The processor 1010 and the interface circuit 1020 are coupled to each other. It is understood that the interface circuit 1020 can be a transceiver or an input / output interface. Optionally, the model training time prediction device 1000 may also include a memory 1030 for storing instructions executed by the processor 1010, or storing input data required by the processor 1010 to run instructions, or storing data generated after the processor 1010 runs instructions.
[0147] When the model training time prediction device 1000 is used to implement the method shown in Figure 1, the processor 1010 is used to implement the function of the determination unit 910, and the interface circuit 1020 is used to implement the function of the simulation unit 920.
[0148] The unit division in this embodiment is illustrative and represents only one logical functional division. In actual implementation, other division methods may be used. Furthermore, the functional units in the various embodiments of this application can be integrated into a single processor, exist as separate physical units, or be integrated into a single unit. The integrated units described above can be implemented in hardware or as software functional units.
[0149] Although preferred embodiments of this application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of this application.
[0150] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.
Claims
1. A method for predicting model training time, comprising: Based on the model architecture and model parameters of the model to be trained, determine multiple computational operators of the operator flow of the model to be trained; Based on the parallel strategy of the model to be trained, multiple communication operators located in the operator stream are determined; as well as Based on the multiple computation operators, the multiple communication operators, the hardware configuration of the cluster running the model to be trained, and the amount of sample data of the model to be trained, the computation time of each computation operator and the communication time of each communication operator in the operator stream of each training round are determined by simulation, thereby obtaining the training time of the model to be trained.
2. The method as described in claim 1, wherein, The model parameters include at least one of the following: The vocabulary length, decoder sequence length, hidden layer size, feedforward network (FFN) hidden layer size, or number of multi-head attention.
3. The method as described in claim 1 or 2, wherein, The parallel strategy of the model to be trained includes at least one of the following: Pipeline parallelism (PP), tensor parallelism (TP), data parallelism (DP), sequence parallelism, context parallelism, or multi-layered hybrid expert MoOE parallelism.
4. The method according to any one of claims 1-3, wherein, Each of the plurality of computational operators has a computational amount indicated by its own parameter configuration.
5. The method according to any one of claims 1-4, wherein, Each of the plurality of communication operators has its own communication volume indicated by its parameter configuration.
6. The method according to any one of claims 1-5, wherein, Based on the parallel strategy of the model to be trained, multiple communication operators located in the operator stream are determined, including: Based on the number of model layers in the model to be trained and the pipelined parallelism (PP) degree in the parallel strategy, determine the computation operators corresponding to each of the multiple training stages and the PP communication operators between every two training stages; and / or, Based on the data parallelism (DP) degree in the parallel strategy, determine the DP communication operator corresponding to the training phase; and / or, Based on the tensor parallelism (TP) degree in the parallel strategy, the computational operators executed by each GPU card in the training phase and the TP communication operators between GPU cards are determined.
7. The method according to any one of claims 1-6, wherein, Before determining the computation time of each computation operator and the communication time of each communication operator in the operator stream of each training round through simulation, the following steps are also included: Based on the parallel strategy of the model to be trained and the hardware configuration of the cluster running the model to be trained, the allocation of GPU cards in the cluster during the model training process is determined. Based on the parallel strategy of the model to be trained and the amount of sample data of the model to be trained, the training rounds of the model training are determined.
8. The method according to any one of claims 1-7, wherein, The computation time of each computation operator and the communication time of each communication operator in the operator stream of each training round are determined by simulation, including: The computation time of the first computation operator is determined based on the computational amount of the first computation operator and the hardware configuration of the GPU card executing the first computation operator, wherein the operator stream in each training round includes the first computation operator.
9. The method of claim 8, wherein, The computation time of the first computation operator is determined based on the computational complexity of the first computation operator and the hardware configuration of the GPU card executing the first computation operator, including: If the first computational operator is a computationally intensive operator, then the maximum value of the memory access time and the computation time of the first computational operator is taken as the computation time of the first computational operator; wherein, the memory access time is determined based on the amount of computational data of the first computational operator and the Layer 2 bandwidth of the GPU card executing the first computational operator; and the computation time is determined based on the amount of computational data of the first computational operator and the hardware computing power of the GPU card executing the first computational operator; and If the first computation operator is a memory-intensive operator, then the memory access time of the first computation operator is determined as the computation time of the computation operator.
10. The method according to any one of claims 1 to 9, wherein the computation time of each computation operator and the communication time of each communication operator in the operator stream of each training round are determined by simulation, comprising: The communication time of the first communication operator is determined based on the transmission bandwidth between the first communication operator and the communication object and the communication volume of the first communication operator, wherein the plurality of communication operators includes the first communication operator.
11. The method of claim 10, wherein, Determining the communication time of the first communication operator based on the transmission bandwidth between the first communication operator and the communication object, and the communication volume of the first communication operator, includes: If the communication objects are located on the same server, the effective bandwidth corresponding to the first communication operator is determined based on the trend graph between the communication volume and the effective bandwidth. The communication time of the first communication operator is determined based on the communication volume of the first communication operator and the effective bandwidth corresponding to the first communication operator.
12. The method of claim 10, wherein, Determining the communication time of the first communication operator based on the transmission bandwidth between the first communication operator and the communication object, and the communication volume of the first communication operator, includes: If the communication objects are located on different servers, the communication time of the first communication operator is determined based on the communication volume of the first communication operator, the inter-server communication bandwidth, and the transmission delay of the switch.
13. The method according to any one of claims 1-12, wherein, The computation time of each computation operator and the communication time of each communication operator in the operator stream of each training round are determined by simulation, thus obtaining the training duration of the model to be trained, including: The computation time of each computational operator and the communication time of each communication operator in the operator stream of each training round are determined by simulation, thus obtaining the forward phase time. The computation time of each computation operator and the communication time of each communication operator in the operator stream of each training round are determined by simulation, and the time of the backward stage is obtained. The computation time of each computation operator and the communication time of each communication operator in the recomputation phase of the operator stream in each training round are determined by simulation, and the recomputation phase time is obtained. The time for each training round is determined based on the time spent in the forward phase, the backward phase, the recalculation phase, and the idle waiting time in each training round. The training duration of the model to be trained is determined based on the time consumed in each training round and the number of training rounds.
14. A model training duration prediction device, comprising: The determining unit is used to determine multiple computational operators of the operator flow of the model to be trained based on the model architecture and model parameters of the model to be trained. The determining unit is further configured to determine multiple communication operators located in the operator stream based on the parallel strategy of the model to be trained. as well as The simulation unit is used to determine the computation time of each computation operator and the communication time of each communication operator in each training round by means of simulation, based on the multiple computation operators, the multiple communication operators, the hardware configuration of the cluster running the model to be trained, and the sample data volume of the model to be trained, thereby obtaining the training time of the model to be trained.
15. A model training duration prediction device, comprising: A processor, and a memory communicatively connected to the processor; The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory to implement the method as described in any one of claims 1-13.
16. A computer-readable storage medium, wherein, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the method as described in any one of claims 1-13.