Gradient acceleration method and device in model training, electronic equipment and storage medium

By asynchronously transmitting activation information during gradient backpropagation, the problem of reduced GPU computing speed caused by activation checkpointing technology is solved, thereby improving model training speed and GPU memory utilization efficiency.

CN122198017APending Publication Date: 2026-06-12TENCENT TECHNOLOGY (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TENCENT TECHNOLOGY (SHENZHEN) CO LTD
Filing Date
2024-12-12
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In existing technologies, while activation checkpointing reduces GPU memory requirements, it also slows down GPU gradient computation and model training.

Method used

During gradient backpropagation, activation information is transferred from memory to GPU memory via an asynchronous transfer thread, enabling asynchronous parallel processing of gradient calculation and activation information, thereby improving the gradient calculation speed of each network layer.

🎯Benefits of technology

It effectively improves model training speed, flexibly releases GPU memory capacity, avoids the problem of time consumption in activation information transmission, and keeps GPU memory usage flexible and controllable.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122198017A_ABST
    Figure CN122198017A_ABST
Patent Text Reader

Abstract

The application relates to a gradient acceleration method and device in model training, electronic equipment and a storage medium. The method comprises the following steps: in the gradient back propagation process of the model training stage of a to-be-trained model, in response to gradient calculation of a first network layer in the to-be-trained model, an asynchronous transmission thread is called to perform transmission of second activation information from a memory to a video memory; the second activation information is activation information of a second network layer cached in the memory in a forward prediction process of the model training stage; third gradient information corresponding to a third network layer is obtained; under the condition that transmission of first activation information corresponding to the first network layer from the memory to the video memory is completed, a main thread is called to determine first gradient information corresponding to the first network layer according to the first activation information and the third gradient information; the first activation information corresponding to the first network layer is activation information of the first network layer cached in the memory in the forward prediction process. The technical scheme of the application can improve the gradient calculation speed in model training.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of model training, and more particularly to a gradient acceleration method, apparatus, electronic device, and storage medium in model training. Background Technology

[0002] With the widespread use of neural network models, model training has become a major focus, especially the training of large models. Large models typically involve hundreds of billions or even more parameters, and training these large-scale models requires a huge amount of GPU memory. A significant portion of this GPU memory is used to store intermediate output activation information. This activation information is needed in the gradient calculation during the gradient backpropagation process, and therefore needs to be temporarily stored, resulting in a large additional GPU memory usage.

[0003] In related technologies, to reduce the pressure on GPU memory, activation checkpointing is used to transfer activation information that is not currently needed from GPU memory to main memory. When the activation information is needed, it is transferred back from main memory to GPU memory. This reduces the amount of GPU memory occupied by activation information, thus lowering the demand on GPU memory. While activation checkpointing saves GPU memory, it introduces additional data transfer between the CPU's main memory and the GPU's GPU memory. During this transfer, the GPU is forced to pause gradient calculation to wait for the transfer of activation information, thereby reducing the speed of GPU gradient calculation and resulting in slower model training. Summary of the Invention

[0004] This application provides a gradient acceleration method, apparatus, electronic device, and storage medium for model training, which can at least improve the gradient calculation speed of each network layer and the overall gradient processing efficiency during model training. The technical solution of this application is as follows:

[0005] According to a first aspect of the embodiments of this application, a gradient acceleration method in model training is provided, comprising:

[0006] During the gradient backpropagation process in the model training phase of the model to be trained, in response to the gradient calculation of the first network layer in the model to be trained, an asynchronous transmission thread is invoked to execute the transmission of the second activation information corresponding to the second network layer from memory to GPU memory; the second activation information corresponding to the second network layer is the activation information cached in memory by the second network layer during the forward prediction process in the model training phase; according to the prediction order in the forward prediction process, the second network layer is located before the first network layer in the model to be trained.

[0007] Obtain the third gradient information corresponding to the third network layer; according to the prediction order in the forward prediction process, the third network layer is located after the first network layer in the model to be trained;

[0008] When the transfer of the first activation information corresponding to the first network layer from the memory to the video memory is completed, the main thread is invoked to determine the first gradient information corresponding to the first network layer based on the first activation information and the third gradient information; the first activation information corresponding to the first network layer is the activation information of the first network layer cached in the memory during the forward prediction process.

[0009] According to a second aspect of the embodiments of this application, a gradient acceleration device for model training is provided, comprising:

[0010] An asynchronous transmission module is used, during the gradient backpropagation process in the model training phase of the model to be trained, in response to the gradient calculation of the first network layer in the model to be trained, to call an asynchronous transmission thread to execute the transmission of the second activation information corresponding to the second network layer from memory to GPU memory; the second activation information corresponding to the second network layer is the activation information of the second network layer cached in memory during the forward prediction process of the model training phase; according to the prediction order in the forward prediction process, the second network layer is located before the first network layer in the model to be trained;

[0011] The acquisition module is used to acquire the third gradient information corresponding to the third network layer; according to the prediction order in the forward prediction process, the third network layer is located after the first network layer in the model to be trained.

[0012] The gradient processing module is used to call the main thread to determine the first gradient information corresponding to the first network layer based on the first activation information and the third gradient information after the transfer of the first activation information corresponding to the first network layer from the memory to the video memory is completed; the first activation information corresponding to the first network layer is the activation information of the first network layer cached in the memory during the forward prediction process.

[0013] According to a third aspect of the embodiments of this application, an electronic device is provided, comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method as described in any one of the first aspects above.

[0014] According to a fourth aspect of the present application, a computer-readable storage medium is provided, wherein when the instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device is enabled to perform any of the methods described in the first aspect of the present application.

[0015] According to a fifth aspect of the embodiments of this application, a computer program product is provided, including computer instructions that, when executed by a processor, cause a computer to perform the method described in any one of the first aspects of the embodiments of this application.

[0016] The technical solutions provided by the embodiments of this application have at least the following beneficial effects:

[0017] During the gradient backpropagation process in the model training phase of the model to be trained, in response to the gradient calculation of the first network layer in the model to be trained, the asynchronous transmission thread is called to execute the transmission of the second activation information corresponding to the second network layer from memory to GPU memory. This enables the asynchronous transmission of the gradient calculation of the first network layer and the activation information of the previous network layer (the second network layer) to be executed in parallel, which can greatly save the waiting time for the transmission of activation information of the previous network layer, effectively improve the gradient calculation speed of each network layer in the model to be trained, and thus improve the model training speed. This not only flexibly releases GPU memory capacity, but also effectively avoids the time consumption problem of activation information transmission caused by activation checkpointing technology. While maintaining flexible and controllable GPU memory usage, it effectively improves the model training speed.

[0018] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application. Attached Figure Description

[0019] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application, and do not constitute an undue limitation of this application.

[0020] Figure 1 This is a schematic diagram illustrating an application environment according to an exemplary embodiment.

[0021] Figure 2 This is a flowchart illustrating a gradient acceleration method in model training according to an exemplary embodiment.

[0022] Figure 3 This is a schematic diagram illustrating the forward prediction process and gradient backpropagation process in model training according to an exemplary embodiment.

[0023] Figure 4 This is a schematic diagram illustrating the timing comparison of gradient calculations for multiple network layers during gradient backpropagation, according to an exemplary embodiment.

[0024] Figure 5 This is a schematic diagram illustrating a method for training a model according to an exemplary embodiment.

[0025] Figure 6This is a block diagram illustrating a gradient acceleration device in model training according to an exemplary embodiment.

[0026] Figure 7 This is a block diagram illustrating an electronic device for gradient acceleration in model training, based on an exemplary embodiment. Detailed Implementation

[0027] Various exemplary embodiments, features, and aspects of this application will now be described in detail with reference to the accompanying drawings. The same reference numerals in the drawings denote elements that have the same or similar functions. Although various aspects of the embodiments are shown in the drawings, they are not necessarily drawn to scale unless specifically indicated otherwise.

[0028] The term “exemplary” as used herein means “serving as an example, embodiment, or illustration.” Any embodiment illustrated herein as “exemplary” is not necessarily to be construed as superior to or better than other embodiments.

[0029] In this application embodiment, the terms "module" or "unit" refer to a computer program or part of a computer program that has a predetermined function and works with other related parts to achieve a predetermined goal, and can be implemented wholly or partially using software, hardware (such as processing circuitry or memory), or a combination thereof. Similarly, a processor (or multiple processors or memory) can be used to implement one or more modules or units. Furthermore, each module or unit can be part of an overall module or unit that includes the functionality of that module or unit.

[0030] Furthermore, to better illustrate this application, numerous specific details are provided in the following detailed embodiments. Those skilled in the art should understand that this application can be implemented without certain specific details. In some instances, methods, means, components, and circuits well-known to those skilled in the art have not been described in detail in order to highlight the main points of this application.

[0031] Artificial intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or computers-controlled machines to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. AI software technology mainly includes computer vision, speech processing, natural language processing, and machine learning / deep learning.

[0032] In recent years, with the research and progress of artificial intelligence technology, artificial intelligence technology has been widely used in many fields. The solutions provided in the embodiments of this application involve technologies such as machine learning / deep learning, which are specifically illustrated through the following embodiments.

[0033] Please see Figure 1 , Figure 1 This diagram illustrates an application system according to an embodiment of this application. The application system can be used in the gradient acceleration method during model training according to this application. Figure 1 As shown, the application system may include at least server 01 and terminal 02.

[0034] In this embodiment, the server 01 can be used for model training of the model to be trained. For example, it can include gradient acceleration processing during model training. Specifically, the GPU in server 01 can perform model training processing, specifically, the GPU can perform gradient acceleration processing during model training. For example, the model to be trained can include a large language model, etc., and this application does not limit this. The server 01 can include an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms.

[0035] In this embodiment, the terminal 02 can be used to trigger the start of the model training process, for example, the terminal 02 can instruct the training process of the model to be trained. The terminal 02 may include physical devices such as smartphones, desktop computers, tablets, laptops, smart speakers, digital assistants, augmented reality (AR) / virtual reality (VR) devices, and smart wearable devices. The physical device may also include software running on the physical device, such as applications. In this embodiment, the operating system running on the terminal 02 may include, but is not limited to, Android, iOS, Linux, and Windows.

[0036] In addition, it should be noted that, Figure 1 The example shown is merely one application environment of the gradient acceleration method in model training provided in this application.

[0037] In the embodiments described in this specification, the terminal 02 and the server 01 can be directly or indirectly connected through wired or wireless communication, and this application does not limit this connection.

[0038] It should be noted that in the specific implementation of this application, user-related data is involved. When the following embodiments of this application are applied to specific products or technologies, user permission or consent is required, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.

[0039] Before introducing the method embodiments provided in this application, a brief introduction will be given on the application scenarios, related terms or nouns that may be involved in the method embodiments of this application, so as to facilitate the understanding of those skilled in the art.

[0040] GPU: Graphics Computing Unit, also known as a graphics card.

[0041] CPU: Central Computing Unit, the central processing unit of a computer.

[0042] Video memory (VRAM): This refers to the storage system of the graphics card GPU, which can be read and written by the GPU.

[0043] Memory: refers to the computer's memory system, which can be read, written, and accessed by the CPU.

[0044] C2G(Data[i]): Represents the data transfer process that moves data Data[i] from the CPU's memory to the GPU's video memory.

[0045] G2C(Data[i]): This represents the data transfer process that moves data Data[i] from the GPU's video memory to the CPU's memory.

[0046] AC stands for activation checkpointing, a technique that uses CPU memory to temporarily cache data processed by the GPU, in order to alleviate the problem of insufficient GPU memory during the training of large models.

[0047] Figure 2 This is a flowchart illustrating a gradient acceleration method in model training according to an exemplary embodiment. This method can be applied to a graphics processing unit (GPU), such as... Figure 2 As shown, it may include the following steps.

[0048] In step S201, during the gradient backpropagation process in the model training phase of the model to be trained, in response to the gradient calculation of the first network layer in the model to be trained, the asynchronous transmission thread is called to execute the transmission of the second activation information corresponding to the second network layer from memory to GPU memory.

[0049] In the embodiments of this specification, the model to be trained may include a large language model, an image processing model, etc. The training phase may include a forward prediction process and a gradient backpropagation process. The input to the forward prediction process is the training data, which undergoes sequential feature extraction and calculation through multiple network layers in the model to be trained to obtain the prediction result. This prediction result may be a predicted label, a predicted response text, a predicted translated text, etc. The type of the prediction result is consistent with the type of the labeled data in the training data, and this application does not limit this. The gradient backpropagation process refers to the process of calculating the gradient after all training data or a batch of training data has completed the forward prediction process and obtained loss information. The calculated gradient can be used to update the model parameters, thereby completing one iteration of model training. As an example, the training type of the model to be trained may include, but is not limited to, pre-training, fine-tuning, reinforcement learning, etc., and this application does not limit this.

[0050] For example, refer to Figure 3 The model to be trained can include N network layers: P[1]~P[N], where N is a positive integer greater than 1. For example, the nth layer can represent the nth network layer, and the range of n can be [1, N]. n can represent the depth of the network layer in the model to be trained. This depth can also refer to the prediction order in the forward prediction process or the level of the network layer in the model to be trained. The deeper the depth, the later the prediction order, and the larger the level, that is, the greater the level information.

[0051] Reference Figure 3 Forward prediction process can refer to the prediction process of training data from P[1] to P[N] in sequence. Gradient backpropagation process can refer to the gradient backpropagation calculation process of calculating the gradient of P[N] in sequence until the gradient of P[1] is calculated after the loss Loss is obtained, so as to obtain N gradient information: G[1]~G[N].

[0052] As an example, the activation information of each network layer during the forward prediction process (e.g.) Figure 3 The activation shown can be stored in the GPU's video memory, such as... Figure 3 As shown, the activation information corresponding to each of the N network layers can be represented as GPU[1] to GPU[N] in the video memory. The activation information of each network layer can be the output of each network layer. Using activation checkpointing technology, the memory is used to cache GPU[1] to GPU[N] in the video memory. For example, after the activation information of P[1] is cached in the video memory, the GPU will cache GPU[1] in the CPU's memory based on activation checkpointing technology, release the video memory space of GPU[1], and the activation information of P[1] CPU[1] will be stored in the memory. This caching process based on activation checkpointing technology can be described as follows: Figure 3As shown in w1 and w2 in the diagram. During the forward prediction process, activation information is cached from GPU memory to RAM in sequence based on activation checkpointing technology, which can obtain the memory representation of the activation information of each of the N network layers, for example, CPU[1] to CPU[N]. For example, the activation information of P[n+1] can be as follows: Figure 3 In the context of P[n+1], the activation information of P[n+1] can be represented as GPU[n+1] in the GPU's video memory and as CPU[n+1] in the CPU's memory; similarly, the activation information of P[n] can be represented as... Figure 3 The activation information of P[n] can be represented as GPU[n] in the GPU's video memory and as CPU[n] in the CPU's memory.

[0053] During gradient backpropagation, the gradient information G[N] can be calculated from the last layer P[N]. When calculating the gradient of a certain network layer, such as when starting to calculate the gradient of P[n+1], an asynchronous transfer thread can be called to transfer the activation information of the previous layer P[n] from memory to GPU memory. Figure 3 The z1 process is shown. Meanwhile, if... Figure 3 The z2 transfer process in the memory ends, indicating that the activation information of P[n+1] has been transferred from memory to GPU memory. Therefore, the gradient information G[n+1] corresponding to P[n+1] can be calculated based on the gradient information (p) of P[n+2] and the activation information GPU[n+1] of P[n+1]. For example, G[n+1] = GPU[n+1] + (p). By calculating the gradient information of each network layer in reverse order until G[1] is obtained, the gradient information corresponding to each of the N network layers can be obtained: G[1] ~ G[N]. It should be noted that for... Figure 3 The system displays the video memory and main memory status, with darker active information indicating that the video memory or main memory is in a stored state, and lighter active information indicating that the storage space in the video memory or main memory has been released.

[0054] The second activation information corresponding to the second network layer can be the activation information cached in memory during the forward prediction process of the second network layer in the model training phase. According to the prediction order in the forward prediction process, the second network layer can be located before the first network layer in the model to be trained. For example, the first network layer and the second network layer can be adjacent, such as... Figure 3 As shown, the first network layer can be P[n+1], and correspondingly, the second network layer can be P[n]. Optionally, the number of second network layers can be multiple, such as two. In this case, if the first network layer is P[n+1], the second network layers can be P[n] and P[n-1] respectively. This application does not limit this.

[0055] In one possible implementation, during the gradient backpropagation process in the model training phase of the model to be trained, in response to the gradient calculation of the first network layer in the model to be trained, an asynchronous transmission thread is invoked to execute the transmission of the second activation information corresponding to the second network layer from memory to GPU memory. For example Figure 3 As shown, in response to the gradient calculation of P[n+1], this gradient calculation can be executed by the main thread, thereby enabling the asynchronous transfer thread to perform the background transfer of the activation information (i.e., CPU[n]) corresponding to P[n] from memory to video memory, for example... Figure 3 The transmission shown in z1 is used so that it can be called by the GPU when calculating the gradient G[n] corresponding to P[n] later.

[0056] In one optional implementation, the above-mentioned response to the gradient calculation of the first network layer in the model to be trained, calling the asynchronous transfer thread to perform the transfer of the second activation information corresponding to the second network layer from memory to GPU memory, may include:

[0057] In response to the gradient calculation of the first network layer, the first layer information corresponding to the first network layer is obtained; the first layer information can characterize the depth of the first network layer in the model to be trained, and the depth can be positively correlated with the prediction order. For example, when the depth and prediction order are represented by positive integers, the deeper the depth or the later the prediction order, the larger the corresponding positive integer can be. Optionally, as an example, the depth and prediction order of the same network layer can have the same corresponding positive integer.

[0058] The network layer corresponding to each of the at least one layer of information with a depth preceding the first layer of information is designated as the second network layer; the at least one layer of information includes the layer information adjacent to the first layer of information;

[0059] Furthermore, an asynchronous transfer thread can be invoked to execute the transfer of the second activation information corresponding to the second network layer from memory to video memory.

[0060] For example, refer to Figure 3 The layer information can be 1 to N. For example, the first network layer is p[n+1]. In response to the gradient calculation of p[n+1], the first layer information corresponding to p[n+1] can be obtained as n+1. Therefore, the network layer corresponding to at least one layer before depth n+1 can be used as the second network layer, for example, the second network layer is p[n]. Further, an asynchronous transfer thread can be called to execute the transfer of the second activation information corresponding to p[n] from memory to GPU memory, for example... Figure 3 The transmission shown in z1.

[0061] In one example, the above-mentioned designating the network layer corresponding to each of the at least one layer of information with a depth preceding the first layer of information as the second network layer may include: determining a preset number of layer information with a depth preceding the first layer of information as a preset number of second layer information; the preset number is the number of preset asynchronous transmission information indicators, which may be a pre-set number of layers used to indicate asynchronous transmission. Thus, the network layer corresponding to each of the preset number of second layer information can be designated as the second network layer. For example, the preset number can be 1, and this application does not limit it. By setting preset asynchronous transmission information to indicate the number of layers for asynchronous transmission, the asynchronous transmission process can be invoked quickly, making the initiation of asynchronous transmission more efficient.

[0062] In another example, the number of network layers for asynchronous transmission activation information can be dynamically determined. For instance, the target number of network layers corresponding to asynchronous transmission activation information can be determined based on the current memory usage, preset memory usage, preset transmission time, and preset gradient time. This dynamic method of determining the number of network layers for asynchronous transmission activation information can flexibly meet the requirements of memory and time consumption, and can achieve differentiation in the number of asynchronous transmission layers corresponding to different network layers. It can effectively balance the needs of gradient calculation acceleration with memory, preset transmission time, and preset gradient time consumption. Based on this, the method can also include: obtaining the current memory usage. Accordingly, the above-mentioned method of using the network layers corresponding to at least one level of information with a depth before the first level of information as the second network layer can include: determining the target number of network layers corresponding to asynchronous transmission activation information based on the current memory usage, preset memory usage, preset transmission time, and preset gradient time consumption; thereby, the network layers corresponding to the target number of level information with a depth before the first level of information can be used as the second network layer. The target quantity can be positively correlated with the difference in video memory capacity, the preset transfer time, and the preset gradient time. The difference in video memory capacity is the difference between the current video memory usage and the preset video memory usage. For example, the larger the difference in video memory capacity, the higher the preset transfer time, and the higher the preset gradient time, the higher the corresponding target quantity. The preset transfer time and preset gradient time can be preset, and this application does not limit them.

[0063] For example, a first quantity corresponding to each of the memory capacity difference ranges, a second quantity corresponding to a preset transmission time, and a second quantity corresponding to a preset gradient time can be set. Based on this, the memory capacity difference can be matched with the memory capacity difference range to obtain the memory capacity difference range that includes the memory capacity difference, i.e., the matched memory capacity difference range. Thus, the first quantity corresponding to the matched memory capacity difference range can be obtained. Next, the first quantity, the second quantity, and the third quantity corresponding to the matched memory capacity difference range can be weighted to obtain the target quantity mentioned above.

[0064] Alternatively, the current transmission time can be used to replace the preset transmission time, and the current gradient calculation time can be used to replace the preset replacement time. Based on the difference in video memory capacity, the current transmission time, and the current gradient calculation time, the target number of the network layer corresponding to the asynchronous transmission activation information can be determined. In this case, the target number can be positively correlated with the difference in video memory capacity and negatively correlated with the current transmission time and the current gradient calculation time.

[0065] In one alternative implementation, there can be multiple second network layers. Correspondingly, the aforementioned invocation of an asynchronous transfer thread to execute the transfer of the second activation information corresponding to the second network layer from memory to video memory can include:

[0066] Based on the hierarchical distance between the second network layer and the first network layer, the second activation information corresponding to the second network layer is sequentially transferred from memory to video memory. The smaller the hierarchical distance, the earlier the information is transferred in the sequence. For example, the hierarchical distance can be a depth difference or a hierarchical information difference. For instance, the first network layer is P[n+1], and its corresponding first hierarchical information is n+1; multiple second network layers include P[n] and P[n-1], each with second hierarchical information of n and n-1. Thus, the hierarchical distances can be 1 and 2 respectively, allowing the sequential transfer of the second activation information corresponding to P[n] and P[n-1] from memory to video memory. That is, the transfer of the second activation information corresponding to P[n] from memory to video memory is performed first, followed by the transfer of the second activation information corresponding to P[n-1] from memory to video memory, i.e., the transfer of the second activation information corresponding to multiple second network layers from memory to video memory is performed serially.

[0067] Alternatively, the transfer of second activation information corresponding to multiple second network layers from memory to video memory can be performed in parallel. For example, multiple asynchronous transfer threads can be invoked to perform the transfer of second activation information corresponding to multiple second network layers from memory to video memory in parallel in the background.

[0068] By obtaining activation information of multiple network layers in advance, the flexibility of asynchronous transmission is improved by setting the above-mentioned serial transmission method or the above-mentioned parallel transmission method. The former can obtain the activation information of the second network layer adjacent to the first network layer in advance, so that the gradient calculation of the adjacent second network layer can be more timely and the overall time consumption is smaller. The latter can also effectively reduce the overall gradient calculation time from the perspective of considering the overall time consumption of all network layers.

[0069] In an optional implementation, the method may further include: if the current video memory usage reaches a preset video memory usage capacity, suspending the transmission of the second activation information corresponding to the target network layer from memory to video memory. Wherein, if there are multiple second network layers, the target network layer can be a non-adjacent network layer among the multiple second network layers; or, if there is only one second network layer, the target network layer can be that single second network layer. By detecting that the current video memory usage reaches the preset video memory usage capacity, i.e., the current video memory usage is too high, asynchronous transmission or asynchronous transmission of non-adjacent network layers can be suspended, thereby improving the effectiveness of asynchronous transmission or ensuring the effectiveness of asynchronous transmission of adjacent network layers.

[0070] In step S203, the third gradient information corresponding to the third network layer is obtained.

[0071] In the embodiments described in this specification, according to the prediction order in the forward prediction process, the third network layer is located after the first network layer in the model to be trained, that is, the depth of the third network layer is greater than the depth of the first network layer. For example, the third network layer can be adjacent to the first network layer. Based on this, for example, if the first network layer is P[n+1], the corresponding third network layer can be P[n+2], and the depth of the third network layer n+2 is greater than the depth of the first network layer n+1.

[0072] In one possible implementation, a third network layer that is located after and adjacent to the first network layer in the model to be trained can be identified, thereby obtaining the gradient information corresponding to the third network layer as the third gradient information.

[0073] In step S205, after the transfer of the first activation information corresponding to the first network layer from memory to video memory is completed, the main thread is called to determine the first gradient information corresponding to the first network layer based on the first activation information and the third gradient information.

[0074] In the embodiments of this specification, the first activation information corresponding to the first network layer can be the activation information cached in memory during the forward prediction process of the first network layer. For example, the transfer of the first activation information from memory to GPU memory can be initiated in response to the gradient calculation of the third network layer and executed based on an asynchronous transfer thread. That is, the first activation information of the first network layer is asynchronously transferred in advance when the gradient calculation of the third network layer is initiated. Since the activation information of each layer is transferred asynchronously, it is not necessary to initiate the transfer of the activation information of this layer from memory to cache during the gradient calculation of each layer. Instead, it is necessary to determine whether the asynchronous transfer of the activation information of this layer has been completed, i.e., whether it has been stored in GPU memory. Accordingly, if the asynchronous transfer is completed, the gradient calculation of this layer can be executed. That is, if the transfer of the first activation information corresponding to the first network layer from memory to GPU memory is completed, the main thread can be called to determine the first gradient information corresponding to the first network layer based on the first activation information and the third gradient information.

[0075] Reference Figure 4 For example, taking a second network layer with 1 element, the first network layer can be P[n+1]. While the GPU uses the main thread to calculate the gradient information of P[n+1], it can also call an asynchronous transfer thread to execute the transfer of the activation information corresponding to P[n] from memory to GPU memory: C2G[n]. Thus, after the GPU calculates the gradient information of P[n+1] (which can be represented as C[n+1]), it needs to calculate the gradient information of P[n], i.e., C[n]. At this time, it can wait for the transfer of the first activation information corresponding to the first network layer from memory to GPU memory, C2G[n]. After the C2G[n] transfer is complete, the main thread can be called to determine the first gradient information G[n] corresponding to the first network layer based on the first activation information and the third gradient information, and the gradient backpropagation process continues until the gradient information G[1] of P[1] is obtained. Figure 4 A comparison of the gradient calculation timing diagrams reveals that during the gradient backpropagation process, the standard AC alternately executes data transmission [C2G] and gradient calculation [C] instructions. This results in [C2G] consuming a significant amount of time, leading to a longer waiting time for transmission during gradient calculation at each layer. In contrast, the gradient calculation scheme in this embodiment simultaneously initiates C2G[n] in the background to extract the activation information of P[n] during the calculation of C[n+1]. This significantly reduces the waiting time for C2G[n] during the calculation of C[n]. (Refer to...) Figure 4 When calculating the gradient information of three network layers, starting at time t0, the standard AC calculation takes t2-t0, while the gradient calculation scheme in this embodiment only takes t1-t0, effectively improving the gradient calculation efficiency.

[0076] During the gradient backpropagation process in the model training phase of the model to be trained, in response to the gradient calculation of the first network layer in the model to be trained, the asynchronous transmission thread is called to execute the transmission of the second activation information corresponding to the second network layer from memory to GPU memory. This enables the asynchronous transmission of the gradient calculation of the first network layer and the activation information of the previous network layer (the second network layer) to be executed in parallel, which can greatly save the waiting time for the transmission of activation information of the previous network layer, effectively improve the gradient calculation speed of each network layer in the model to be trained, and thus improve the model training speed. This not only flexibly releases GPU memory capacity, but also effectively avoids the time consumption problem of activation information transmission caused by activation checkpointing technology. While maintaining flexible and controllable GPU memory usage, it effectively improves the model training speed.

[0077] In one alternative implementation, the first network layer can be any one of multiple network layers in the model to be trained, for example, it can be the first network layer, such as... Figure 3 P[1] in the text can be either an intermediate network layer or the last network layer, for example... Figure 3 In the case of P[N], it should be noted that, in the case of P[1], there is no preceding network layer, and correspondingly no activation information and gradient calculation are required, so the asynchronous transmission thread does not need to be called; in the case of P[N], there is no gradient information of the following network layer, and gradient calculation requires loss information Loss.

[0078] Reference Figure 5Training data can be divided into multiple training batches. For example, if the model to be trained is a large language model and the training data is training text, multiple batches of training text can be obtained. One batch of training text can then be extracted as the current batch of training text. For instance, the training text can be converted into a series of tokens, and these tokens can then be divided into multiple batches of the same size, resulting in multiple batches of training text. Next, a lookup table can be used to map each token to an embedding feature as input to the large language model. Furthermore, each network layer can sequentially calculate its corresponding activation information. For example, the transform layer of the attention model can transform the input activation into a new activation as output through forward prediction. These activations will be used again in subsequent gradient backpropagation. Caching the activation on the CPU reduces GPU memory usage, making it possible to train larger-scale models. Based on this, it can be determined whether all network layers have completed forward prediction, that is, whether all network layers have completed the output of activation information. If so, the loss can be calculated. The loss can be based on the difference between the output of the last layer of the large language model and the labeled label of the input training text. For example, the loss can be calculated using a preset loss function. This application does not limit this.

[0079] like Figure 5 As shown, next, n can be initialized to N, meaning gradient information can be calculated backwards from the last layer. If the current n is greater than 1, asynchronous transfer of activation information from memory to GPU memory for layer n-1 can be initiated. Before calculating gradient information for layer n-1, an asynchronous transfer thread is called to transfer the activation information from the CPU memory cache to the GPU's video memory in advance. Figure 4 As shown, the gradient calculation of the nth layer can be performed while waiting for the asynchronously transmitted activation information. For example, if the first network layer is the last layer in the forward prediction process, i.e., P[N], after the above-mentioned step of calling the asynchronous transmission thread to execute the transmission of the second activation information corresponding to the second network layer from memory to GPU memory, the method can further include: obtaining the loss information corresponding to the current batch of training data, such as the aforementioned Loss. Further, the main thread can be called to determine the first gradient information corresponding to the first network layer based on the loss information and the first activation information. The first gradient information G[N] = Loss + GPU[N], where GPU[N] is the first activation information corresponding to P[N]. If the first network layer is a network layer other than the last layer, G[n] = G[n+1] + GPU[n], for example... Figure 3As shown, GPU[n] represents the first activation information corresponding to P[n], that is, the activation information transmitted to the video memory by asynchronous transmission z1, and G[n+1] represents the third gradient information of the third network layer.

[0080] Furthermore, after calculating the gradient information of the nth layer, it can be determined whether the current n is greater than 1. If it is, it means that the gradient backpropagation has not ended, and n can be updated to n-1. Then, it is determined whether the updated n is greater than 1. If it is, the above gradient calculation process can be repeated. If not, there is no need to call the asynchronous transmission thread, and the gradient information can be calculated directly. Optionally, after calculating the gradient information of the nth layer, if it is determined that the current n is not greater than 1, the model parameters can be updated, and it can be determined whether all batches of training texts have been completed, i.e., all have been trained and learned. If so, it can be determined that the model training is over and a trained large language model is obtained. If not, the training process of the model to be trained based on the next batch of training data can be started.

[0081] In one example, when the transfer of the second activation information from the memory to the video memory is completed, the main thread is called to determine the second gradient information corresponding to the second network layer based on the second activation information and the first gradient information; further, when the second network layer is the first layer in the forward prediction process, the model parameters of the model to be trained are updated based on the target gradient information, and the training process of the model to be trained based on the next batch of training data is entered. Among them, the target gradient information refers to the gradient information corresponding to each network layer in the model to be trained. The gradient information corresponding to each network layer includes the first gradient information, the second gradient information and the third gradient information, that is, the target gradient information can include G[1]~G[N]. The training speed (number of tokens processed per second) of different sized models was compared. For example, Yi-34B represents a large model with approximately 34 billion parameters, while Llama2-70B represents a large model with approximately 70 billion parameters. On the same single 8-card GPU, the gradient acceleration method of the embodiment in this specification has a significant improvement in training speed, as shown in the table below.

[0082] Model Existing AC solution Gradient acceleration methods in the embodiments of this specification Yi-34B 8,612 10,792 Llama2-70B 5,398 6,985

[0083] By parallelizing the gradient calculation of the current network layer and asynchronously transmitting activation information in the background of the previous layer, the asynchronous transmission thread can run in the background without waiting for its return. When calculating gradients in the previous layer, the waiting time for activation information can be reduced, thereby improving the gradient calculation time in model training. For large models, this can greatly save GPU costs and effectively improve model training efficiency.

[0084] Figure 6 This is a block diagram illustrating a gradient acceleration device in model training according to an exemplary embodiment. (Refer to...) Figure 6The device may include:

[0085] The asynchronous transmission module 601 is used, during the gradient backpropagation process in the model training phase of the model to be trained, in response to the gradient calculation of the first network layer in the model to be trained, to call the asynchronous transmission thread to execute the transmission of the second activation information corresponding to the second network layer from memory to GPU memory; the second activation information corresponding to the second network layer is the activation information of the second network layer cached in memory during the forward prediction process of the model training phase; according to the prediction order in the forward prediction process, the second network layer is located before the first network layer in the model to be trained.

[0086] The acquisition module 603 is used to acquire the third gradient information corresponding to the third network layer; according to the prediction order in the forward prediction process, the third network layer is located after the first network layer in the model to be trained.

[0087] The gradient processing module 605 is used to call the main thread to determine the first gradient information corresponding to the first network layer based on the first activation information and the third gradient information after the transfer of the first activation information corresponding to the first network layer from the memory to the video memory is completed; the first activation information corresponding to the first network layer is the activation information of the first network layer cached in the memory during the forward prediction process.

[0088] During the gradient backpropagation process in the model training phase of the model to be trained, in response to the gradient calculation of the first network layer in the model to be trained, the asynchronous transmission thread is called to execute the transmission of the second activation information corresponding to the second network layer from memory to GPU memory. This enables the asynchronous transmission of the gradient calculation of the first network layer and the activation information of the previous network layer (the second network layer) to be executed in parallel, which can greatly save the waiting time for the transmission of activation information of the previous network layer, effectively improve the gradient calculation speed of each network layer in the model to be trained, and thus improve the model training speed. This not only flexibly releases GPU memory capacity, but also effectively avoids the time consumption problem of activation information transmission caused by activation checkpointing technology. While maintaining flexible and controllable GPU memory usage, it effectively improves the model training speed.

[0089] In one possible implementation, the asynchronous transmission module 601 may include:

[0090] The hierarchical information acquisition unit is used to acquire the first hierarchical information corresponding to the first network layer in response to the gradient calculation of the first network layer; the first hierarchical information represents the depth of the first network layer in the model to be trained, and the depth is positively correlated with the prediction order;

[0091] A network layer determination unit is configured to designate the network layer corresponding to at least one layer of information with a depth preceding the first layer of information as the second network layer; the at least one layer of information includes layer information adjacent to the first layer of information.

[0092] An asynchronous transmission unit is used to call the asynchronous transmission thread to execute the transmission of the second activation information corresponding to the second network layer from memory to video memory.

[0093] In one possible implementation, the network layer determination unit may include:

[0094] A hierarchy information determination subunit is used to determine a preset number of hierarchy information whose depth precedes the first hierarchy information as a preset number of second hierarchy information; the preset number is the number of preset asynchronous transmission information indications;

[0095] The first network layer determining subunit is used to determine the network layer corresponding to each of the preset number of second-level information as the second network layer.

[0096] In one possible implementation, the device may further include:

[0097] The video memory usage acquisition module is used to acquire the current video memory usage capacity of the video memory;

[0098] The network layer determination unit may include:

[0099] The target quantity determination subunit is used to determine the target number of network layers corresponding to the asynchronous transmission activation information based on the current video memory usage, the preset video memory usage, the preset transmission time, and the preset gradient time. The target quantity is positively correlated with the video memory capacity difference, the preset transmission time, and the preset gradient time. The video memory capacity difference is the difference between the current video memory usage and the preset video memory usage.

[0100] The second network layer determining subunit is used to take the network layer corresponding to each of the target number of level information whose depth is before the first level information as the second network layer.

[0101] In one possible implementation, the number of second network layers is multiple; the asynchronous transmission module 601 may include:

[0102] A serial execution unit is used to sequentially execute the transmission of the second activation information corresponding to the second network layer from memory to video memory according to the hierarchical distance between the second network layer and the first network layer; the smaller the hierarchical distance, the earlier the execution order is.

[0103] or,

[0104] The parallel execution unit is used to perform the parallel transfer of multiple second activation information corresponding to the second network layer from memory to video memory.

[0105] In one possible implementation, the device may further include:

[0106] The transmission pause module is used to pause the transmission of the second activation information corresponding to the target network layer from the memory to the video memory when the current video memory usage reaches the preset video memory usage.

[0107] Wherein, if there are multiple second network layers, the target network layer is a network layer among the multiple second network layers that is not adjacent to the first network layer; or, if there is only one second network layer, the target network layer is the single second network layer.

[0108] In one possible implementation, where the first network layer is the last layer in the forward prediction process, the apparatus may further include:

[0109] The loss information acquisition module is used to acquire the loss information corresponding to the current batch of training data;

[0110] The gradient calculation module is used to call the main thread to determine the first gradient information corresponding to the first network layer based on the loss information and the first activation information.

[0111] In one possible implementation, the device may further include:

[0112] The gradient determination module is used to call the main thread to determine the second gradient information corresponding to the second network layer based on the second activation information and the first gradient information after the transfer of the second activation information from the memory to the video memory is completed.

[0113] The model parameter update module is used to update the model parameters of the model to be trained based on the target gradient information when the second network layer is the first layer in the forward prediction process, and to enter the training process of the model to be trained based on the next batch of training data.

[0114] The target gradient information refers to the gradient information corresponding to each network layer in the model to be trained, and the gradient information corresponding to each network layer includes the first gradient information, the second gradient information and the third gradient information.

[0115] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.

[0116] Figure 7 This is a block diagram of an electronic device for gradient acceleration in model training, based on an exemplary embodiment. The electronic device may be a server, and its internal structure diagram may be as follows: Figure 7 As shown, the electronic device includes a processor, memory, and a network interface connected via a system bus. The processor provides computational and control capabilities. The memory includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The network interface is used to communicate with external terminals via a network connection. When the computer program is executed by the processor, it implements a method for accelerating gradients during model training.

[0117] Those skilled in the art will understand that Figure 7 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the electronic device to which the present application is applied. The specific electronic device may include more or fewer components than shown in the figure, or combine certain components, or have different component arrangements.

[0118] In an exemplary embodiment, an electronic device is also provided, including: a processor; and a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement a gradient acceleration method in model training as described in the embodiments of this application.

[0119] In an exemplary embodiment, a computer-readable storage medium is also provided, which, when executed by a processor of an electronic device, enables the electronic device to perform the gradient acceleration method in model training according to the embodiments of this application. The computer-readable storage medium may be a ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, or optical data storage device, etc.

[0120] In an exemplary embodiment, a computer program product containing instructions is also provided, which, when run on a computer, causes the computer to perform the gradient acceleration method in model training as described in the embodiments of this application.

[0121] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. This computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and RAMbus dynamic RAM (RDRAM), etc.

[0122] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this application are indicated by the following claims.

[0123] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.

Claims

1. A gradient acceleration method for model training, characterized in that, Applied to a graphics processor, the method includes: During the gradient backpropagation process in the model training phase of the model to be trained, in response to the gradient calculation of the first network layer in the model to be trained, an asynchronous transmission thread is invoked to execute the transmission of the second activation information corresponding to the second network layer from memory to GPU memory; the second activation information corresponding to the second network layer is the activation information cached in memory by the second network layer during the forward prediction process in the model training phase; according to the prediction order in the forward prediction process, the second network layer is located before the first network layer in the model to be trained. Obtain the third gradient information corresponding to the third network layer; according to the prediction order in the forward prediction process, the third network layer is located after the first network layer in the model to be trained; When the transfer of the first activation information corresponding to the first network layer from the memory to the video memory is completed, the main thread is invoked to determine the first gradient information corresponding to the first network layer based on the first activation information and the third gradient information; the first activation information corresponding to the first network layer is the activation information of the first network layer cached in the memory during the forward prediction process.

2. The method according to claim 1, characterized in that, The step of responding to the gradient calculation of the first network layer in the model to be trained by calling an asynchronous transmission thread to perform the transmission of the second activation information corresponding to the second network layer from memory to GPU memory includes: In response to the gradient calculation of the first network layer, the first layer information corresponding to the first network layer is obtained; the first layer information represents the depth of the first network layer in the model to be trained, and the depth is positively correlated with the prediction order; The network layer corresponding to each of the at least one layer of information whose depth is before the first layer of information is taken as the second network layer; the at least one layer of information includes the layer of information adjacent to the first layer of information; The asynchronous transmission thread is invoked to execute the transmission of the second activation information corresponding to the second network layer from memory to video memory.

3. The method according to claim 2, characterized in that, The step of using the network layer corresponding to each of the at least one layer of information whose depth precedes the first layer of information as the second network layer includes: A preset number of layer information whose depth precedes the first layer information is defined as a preset number of second layer information; the preset number is the number of preset asynchronous transmission information indications. The network layer corresponding to each of the preset number of second-level information is used as the second network layer.

4. The method according to claim 2, characterized in that, The method further includes: Get the current video memory usage capacity; The step of using the network layer corresponding to each of the at least one layer of information whose depth precedes the first layer of information as the second network layer includes: Based on the current video memory usage, the preset video memory usage, the preset transmission time, and the preset gradient time, the target number of network layers corresponding to the asynchronous transmission activation information is determined; the target number is positively correlated with the video memory capacity difference, the preset transmission time, and the preset gradient time, and the video memory capacity difference is the difference between the current video memory usage and the preset video memory usage. The network layers corresponding to the target number of layer information whose depth is before the first layer information are used as the second network layers.

5. The method according to claim 1, characterized in that, The number of second network layers is multiple; the step of calling the asynchronous transmission thread to execute the transmission of the second activation information corresponding to the second network layer from memory to video memory includes: Based on the hierarchical distance between the second network layer and the first network layer, the second activation information corresponding to the second network layer is transferred from memory to video memory in sequence; the smaller the hierarchical distance, the earlier the execution order. Alternatively, the transfer of second activation information corresponding to multiple second network layers from memory to video memory can be performed in parallel.

6. The method according to claim 1 or 5, characterized in that, The method further includes: If the current video memory usage reaches the preset video memory usage, the transmission of the second activation information corresponding to the target network layer from the memory to the video memory is paused. Wherein, if there are multiple second network layers, the target network layer is a network layer among the multiple second network layers that is not adjacent to the first network layer; or, if there is only one second network layer, the target network layer is the single second network layer.

7. The method according to claim 1, characterized in that, When the first network layer is the last layer in the forward prediction process, after the step of calling the asynchronous transfer thread to execute the transfer step of the second activation information corresponding to the second network layer from memory to video memory, the method further includes: Obtain the loss information corresponding to the current batch of training data; The main thread is invoked to determine the first gradient information corresponding to the first network layer based on the loss information and the first activation information.

8. The method according to claim 1, characterized in that, The method further includes: When the transfer of the second activation information from the memory to the video memory is completed, the main thread is invoked to determine the second gradient information corresponding to the second network layer based on the second activation information and the first gradient information. When the second network layer is the first layer in the forward prediction process, the model parameters of the model to be trained are updated based on the target gradient information, and the training process of the model to be trained based on the next batch of training data begins. The target gradient information refers to the gradient information corresponding to each network layer in the model to be trained, and the gradient information corresponding to each network layer includes the first gradient information, the second gradient information and the third gradient information.

9. A gradient acceleration device for training large models, characterized in that, include: An asynchronous transmission module is used, during the gradient backpropagation process in the model training phase of the model to be trained, in response to the gradient calculation of the first network layer in the model to be trained, to call an asynchronous transmission thread to execute the transmission of the second activation information corresponding to the second network layer from memory to GPU memory; the second activation information corresponding to the second network layer is the activation information of the second network layer cached in memory during the forward prediction process of the model training phase. According to the prediction order in the forward prediction process, the second network layer is located before the first network layer in the model to be trained; The acquisition module is used to acquire the third gradient information corresponding to the third network layer; According to the prediction order in the forward prediction process, the third network layer is located after the first network layer in the model to be trained. The gradient processing module is used to call the main thread to determine the first gradient information corresponding to the first network layer based on the first activation information and the third gradient information after the transfer of the first activation information corresponding to the first network layer from the memory to the video memory is completed. The first activation information corresponding to the first network layer is the activation information of the first network layer cached in memory during the forward prediction process.

10. An electronic device, characterized in that, include: processor; Memory used to store the processor's executable instructions; The processor is configured to execute the instructions to implement the gradient acceleration method in model training as described in any one of claims 1 to 8.

11. A computer-readable storage medium, characterized in that, When the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is enabled to perform the gradient acceleration method in model training as described in any one of claims 1 to 8.

12. A computer program product, characterized in that, Includes computer instructions, which, when executed by a processor, cause the computer to perform a gradient acceleration method in model training as described in any one of claims 1 to 8.