Data processing method, and apparatus

By storing the computation results in a multi-level storage resource and prefetching them during neural network training, the problems of high activation value storage overhead and high recomputation cost are solved, thereby improving the efficiency and resource utilization of neural network training.

WO2026138711A1PCT designated stage Publication Date: 2026-07-02HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2025-12-22
Publication Date
2026-07-02

Smart Images

  • Figure CN2025144211_02072026_PF_FP_ABST
    Figure CN2025144211_02072026_PF_FP_ABST
Patent Text Reader

Abstract

Provided in the present application are a data processing method and an apparatus, used for storing computing results of first computing tasks in a multi-level storage resource, and prefetching same during execution of second computing tasks, thereby improving the computing efficiency of second computing tasks. The method comprises: a first computing unit acquires a plurality of computing results, wherein the plurality of computing results are results of computing a plurality of first computing tasks, and each first computing task can obtain one computing result; subsequently, the first computing unit sends to a second storage unit at least one of the plurality of computing results; and then, the first computing unit computes a second computing task on the basis of the obtained first computing result, and during the computing of the current second computing task, can acquire from the second storage unit a first computing result corresponding to a next second computing task and store same in a first storage unit, the second computing task corresponding to the computing result of the at least one first computing task.
Need to check novelty before this filing date? Find Prior Art

Description

A data processing method and apparatus

[0001] This application claims priority to Chinese Patent Application No. 202411968324.4, filed on December 26, 2024, entitled “A Data Processing Method and Apparatus”, the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of artificial intelligence, and more particularly to a data processing method and apparatus. Background Technology

[0003] Typically, the output capability of a neural network is related to its size; for example, the larger the neural network, the higher its output capability after training. Alternatively, neural networks with different functions can be integrated into a single neural network to obtain a large-scale neural network.

[0004] With the development of artificial intelligence, large models are being applied to more scenarios. However, as the scale of neural networks increases, the computational load required to process them also gradually increases. Taking the training process of a large model as an example, during the forward propagation of the neural network, a series of activation values ​​are generated. These activation values ​​are needed to calculate the gradient during the backpropagation. The storage overhead for these activation values ​​is very large; for example, when the input data is large, the storage overhead can be tens of times that of the neural network parameters.

[0005] In an existing large-scale model training scheme, activation values ​​do not need to be stored during forward propagation. During backpropagation, the activation values ​​can be recalculated, and the gradient can be calculated using these recalculated values. However, this scheme incurs additional computational costs to recalculate the activation values. Therefore, how to achieve more efficient data processing with lower computational costs has become an urgent problem to be solved. Summary of the Invention

[0006] This application provides a data processing method and apparatus for storing the calculation results of a first computing task in a multi-level storage resource and prefetching them when executing a second computing task to improve the computing efficiency of the second computing task.

[0007] In view of this, in a first aspect, this application provides a data processing method applied to a first computing unit, wherein the first computing unit, a first storage unit and a second storage unit are deployed in a first system, and the first computing unit is coupled to the first storage unit;

[0008] The method includes: a first computing unit acquiring multiple computing results, the multiple computing results being the result of computing multiple first computing tasks, each first computing task yielding one computing result; then the first computing unit sending at least one of the multiple computing results to a second storage unit to store the at least one first computing result in the second storage unit; subsequently, the first computing unit computing a second computing task based on the acquired first computing results, and while the current second computing task is computing, the first computing result corresponding to the next second computing task can be retrieved from the second storage unit and stored in the first storage unit, the second computing task corresponding to at least one first computing task's computing result, that is, computing the next second task requires the computing results of one or more first computing tasks, and if it includes a first computing result, it needs to be prefetched from the second storage unit.

[0009] Therefore, in this embodiment of the application, the calculation result obtained from the first calculation task can be stored in the second storage space. Before executing the second calculation task, the calculation result required by the second calculation task can be prefetched. Therefore, when calculating the second calculation task, there is no need to wait for the calculation result to be obtained, but the obtained calculation result can be used directly for calculation, thereby achieving more efficient calculation and making full use of the storage resources in the system.

[0010] In one possible implementation, when the aforementioned second computing task is computed, the first computing result corresponding to the next second computing task is obtained from the second storage unit and stored in the first storage unit. Specifically, the first computing unit computes the second computing task for a first duration, and obtains the first computing result corresponding to the next second computing task from the second storage unit and stores it in the first storage unit for a second duration. The first duration covers the second duration. Alternatively, within the time period covered by the first duration, the first computing unit can simultaneously complete the computation of the current second computing task and the prefetching of the computation result required by the next second computing task. This allows the next second computing task to be executed directly using the obtained computation result without waiting to obtain the computation result. The waiting time between computing the second computing tasks is very short, which can improve the overall data processing efficiency.

[0011] In one possible implementation, the aforementioned first system can specifically be a neural network system, that is, a system that deploys a neural network or is used to perform calculations in a neural network. Accordingly, the aforementioned first and second computational tasks can be computational tasks of the neural network. Therefore, the embodiments of this application can be applied to neural network processing systems. For example, in large model processing scenarios, there may be a large number of computational tasks, which will correspondingly generate a large number of computational results. Through the method provided by the embodiments of this application, the storage resources in the system can be fully utilized to improve the overall processing efficiency of the neural network.

[0012] In one possible implementation, the aforementioned neural network includes at least one network layer, and each network layer includes at least one computation module. A first computation task is applied to the forward propagation of the neural network, and the computation result of the first computation task includes activation values, which may include the output values ​​of the computation modules in the neural network. A second computation task is applied to the backpropagation of the neural network, and the second computation task includes gradient calculation tasks in backpropagation. This application embodiment can be applied to the training process of a neural network. Typically, backpropagation requires the use of data generated during forward propagation. The method provided by this application embodiment fully utilizes multi-layered resources to store data during forward propagation, and prefetches data during backpropagation, reducing the waiting time for data during backpropagation. This can improve the overall training efficiency of the neural network and increase the utilization rate of storage resources and communication resources between units within the system.

[0013] In one possible implementation, the aforementioned neural network includes at least one network layer, each network layer including at least one computing module, and the training of the neural network is divided into multiple training batches.

[0014] Accordingly, the first computation task may include computation tasks of the network layer, computation tasks of the computation module, or computation tasks of the training batch; the second computation task may include gradient computation tasks of the network layer, gradient computation tasks of the computation module, or gradient computation tasks of the training batch. In this embodiment, the calculation can be divided at different granularities, and the granularity adapted to the actual scenario can be adaptively selected to calculate the gradient value. This allows for adaptation to various scenarios and demonstrates strong generalization ability.

[0015] In one possible implementation, the aforementioned first computing unit calculates the second computing task during a first duration, and retrieves the first computing result required for the next second computing task from the second storage unit during a second duration and stores it in the first storage unit. This may include: typically, during the backpropagation process during training, gradient values ​​are calculated by backpropagation in the opposite direction to forward propagation. Therefore, at least one first computing result can be retrieved from the second storage unit at the granularity of network layers according to the reverse order of the neural network structure. The first network layer is the network layer of the neural network arranged before the second network layer in the forward propagation direction. The first computing unit calculates the gradient calculation task of the second network layer during the first duration, and retrieves the first computing result required for the gradient calculation task of the first network layer from the second storage unit during the second duration.

[0016] Therefore, when calculating the gradient of the first network layer, the gradient of the first network layer can be calculated directly based on the pre-fetched calculation results, reducing the gradient calculation interval between network layers, improving the overall efficiency of the first node when performing gradient backpropagation, and making full use of the transmission resources between the first computing unit and the second storage unit.

[0017] In one possible implementation, the aforementioned first computing unit calculates one of the second computing tasks in the first duration, and retrieves the calculation results required for the next second computing task from the second storage unit in the second duration and stores them in the first storage unit. This includes: the first computing unit calculates the gradient calculation task of the second computing module in the first duration, and retrieves the calculation results required for the gradient calculation task of the first computing module from the second storage unit in the second duration. In the forward propagation direction of the neural network, the first computing module is arranged before the second computing module, while in the backward propagation direction, the gradient values ​​of the second computing module need to be calculated first, and then the gradient values ​​of the second computing module are calculated. In this embodiment, activation values ​​can be stored and transported at the granularity of computing modules, thereby storing and transporting activation values ​​with smaller granularity. This allows for the use of optimization problems to obtain the storage scheme for the activation values ​​of each module, thereby achieving a smaller overall duration and improving the overall model FLOPs utilization (MFU) of model training. Furthermore, the activation values ​​of the calculation modules can be prefetched. Thus, when calculating gradients in the order of backpropagation, after calculating the gradient of the previous calculation module, the prefetched activation values ​​can be used directly to calculate the gradient of the next calculation module, reducing the interval time between calculation modules and improving gradient calculation efficiency.

[0018] In one possible implementation, the aforementioned first computing unit calculates one of the second computing tasks during a first duration, and retrieves the calculation results required for the next second computing task from the second storage unit during a second duration and stores them in the first storage unit. This includes: the first computing unit calculates the gradient calculation task for the first training batch during the first duration, and retrieves the calculation results required for the gradient calculation task for the second training batch from the second storage unit during the second duration. The first training batch and the second training batch are different training batches used to train the neural network using the training set. In this embodiment, activation values ​​can be moved on a batch-by-batch basis, allowing for advance movement of activation values ​​between training batches. This reduces waiting time caused by moving or recalculating activation values ​​between training batches, improves the utilization of computing resources of the first node, and enhances model training efficiency.

[0019] In one possible implementation, the aforementioned first computing unit sends at least one first calculation result from a plurality of calculation results to the second storage unit, comprising: the first computing unit determining at least one first calculation result from the plurality of calculation results; and the first computing unit sending the at least one first calculation result to the second storage unit. In this embodiment, before the first computing unit sends the first calculation result to the second storage unit, the first calculation result can be selected from the plurality of calculation results, thereby selectively storing all or part of the calculation results as the first calculation result in the second storage unit.

[0020] In one possible implementation, the aforementioned method further includes: a first computing unit determining at least one second computing result from multiple computing results; and then the first computing unit storing the at least one second computing result in a first storage unit. In this embodiment, some computing results can also be stored in the first storage unit, which can reduce the time required to transfer these computing results and improve the efficiency of obtaining them.

[0021] In one possible implementation, the aforementioned method further includes: a first computing unit determining at least one third computing result from multiple computing results; the at least one third computing result being discarded; and before the first computing unit executes the second computing task corresponding to the third computing result, the first computing unit re-executes the first computing task to obtain the third computing result. In this embodiment, a third computing result can also be selected from multiple computing results; for example, a computing result with a very large footprint but a very small computational load can be used as the third computing result, thereby replacing a large storage resource consumption with a very low recomputation cost.

[0022] In one possible implementation, if the aforementioned multiple settlement results include a third calculation result, the method further includes: the first calculation unit recalculates the first calculation task during a third duration to obtain the third calculation result, and the sum of the third duration and the first duration covers the second duration. In this embodiment, before executing the second calculation task, the process of recalculating the third calculation result can also be fully considered to take into account various possible situations.

[0023] In one possible implementation, the aforementioned first computing unit determines at least one first computing result from multiple computing results, including: the first computing unit determines at least one first computing result, a second computing result, or a third result from multiple computing results based on attribute information of the computing results, wherein the attribute information includes at least one of the space occupied or computational amount of the computing results. In this application embodiment, the storage scheme of the computing results can be considered from the space occupied or computational amount of the computing results, so that activation values ​​with a larger computational amount (i.e., the computational amount of recalculating activation values) can be stored with the least possible transmission cost, thereby achieving a shorter overall end-to-end duration during the training process.

[0024] In one possible implementation, the aforementioned first computing unit determines at least one first calculation result from multiple calculation results based on the attribute information of the calculation result. This includes: the first computing unit constructing constraints based on the size of the first storage unit and the size of the second storage unit; under the constraints, the first computing unit determines a storage indication of the calculation result based on the attribute information of the calculation result and the duration of the second calculation task. The storage indication is used to indicate that the calculation result is one of the first calculation result, the second calculation result, or the third calculation result. This allows for a more suitable allocation of the storage location of the calculation result based on the available space size of the first and second storage units, thereby fully utilizing the available multi-level storage resources.

[0025] In one possible implementation, if the storage instruction is to instruct the first storage unit, the first computing unit sends the calculation result to the first storage unit; or, if the storage instruction is to discard the calculation result, the first computing unit discards the calculation result.

[0026] In one possible implementation, the aforementioned first computing unit includes a graphics processing unit (GPU), a network processing unit (NPU), or a tensor processing unit (TPU).

[0027] In one possible implementation, the aforementioned second storage unit is coupled to the second computing unit. For example, the specific second storage unit may be mounted on the second computing unit or managed by the second computing unit. The first computing unit sends the first calculation result to the second storage unit, including: the first computing unit sending the first calculation result to the second storage unit through the second computing unit, such as sending the first calculation result to the second computing unit, or the second computing unit sending the first calculation result to the second storage unit, or obtaining the available address of the second storage unit from the second computing unit and sending the first calculation result to the second computing unit through the available address, etc.

[0028] Optionally, the second computing unit may specifically be a central processing unit (CPU).

[0029] In one possible implementation, the aforementioned second storage unit further includes a disk.

[0030] Secondly, embodiments of this application provide a second storage unit applied to a first system, the first system including a first computing unit, a first storage unit and a second storage unit, the first computing unit being coupled to the first storage unit;

[0031] The second storage unit receives at least one first calculation result, which is the result of the first computing unit calculating the first computing task. The first computing unit calculates the second computing task based on the obtained first calculation result. When the second computing task is being calculated, the second storage unit sends the first calculation result corresponding to the next second computing task to the first computing unit.

[0032] The effects achieved by the second aspect or any optional implementation of the second aspect can be referred to the corresponding descriptions in the foregoing first aspect or any optional implementation of the first aspect, and will not be repeated hereafter.

[0033] In one possible implementation, the aforementioned first system may specifically be a neural network system, that is, a system in which a neural network is deployed or used to perform computations in a neural network. Accordingly, the aforementioned first computation task and second computation task may be computation tasks of the neural network.

[0034] In one possible implementation, the aforementioned first computational task is applied to the forward propagation of the neural network, and the computation result of the first computational task includes activation values. The second computational task is applied to the backward propagation of the neural network, and the second computational task includes gradient computation tasks in the backward propagation.

[0035] In one possible implementation, the aforementioned neural network includes at least one network layer, each network layer including at least one computation module, and the training of the neural network is divided into multiple training batches; accordingly, the first computation task may include the computation task of the network layer, the computation task of the computation module, or the computation task of the training batch; the second computation task may include the gradient computation task of the network layer, the gradient computation task of the computation module, or the gradient computation task of the training batch.

[0036] In one possible implementation, the aforementioned second storage unit sends the first calculation result corresponding to the next second calculation task to the first computing unit during the current second calculation task calculation. Specifically, this may include: when the first computing unit calculates the gradient calculation task of the second network layer for a first duration, the second storage unit sends the first calculation result required for the gradient calculation task of the first network layer to the first computing unit for a second duration.

[0037] In one possible implementation, the aforementioned second storage unit sends the first calculation result corresponding to the next second calculation task to the first computing unit during the current second calculation task calculation. Specifically, this may include: when the first computing unit calculates the gradient calculation task of the second computing module for a first duration, the second storage unit sends the first calculation result required by the gradient calculation task of the first computing module to the first computing unit for a second duration.

[0038] In one possible implementation, the aforementioned second storage unit sends the first calculation result corresponding to the next second calculation task to the first computing unit during the current second calculation task calculation. Specifically, this may include: when the first computing unit calculates the gradient calculation task of the first training batch for a first duration, the second storage unit sends the first calculation result required for the gradient calculation task of the second training batch to the first computing unit for a second duration.

[0039] In one possible implementation, the aforementioned first calculation module is specifically used to: construct constraints based on the size of the first storage unit and the size of the second storage unit; under the constraints, determine a storage indication of the calculation result based on the attribute information of the calculation result and the duration of the second calculation task, wherein the storage indication is used to indicate that the calculation result is one of the first calculation result, the second calculation result, or the third calculation result. This allows for a more suitable allocation of the storage location of the calculation result based on the available space size of the first and second storage units, thereby fully utilizing the available multi-level storage resources.

[0040] In one possible implementation, if the storage instruction is to instruct the first storage unit, the first computing unit sends the calculation result to the first storage unit; or, if the storage instruction is to discard the calculation result, the first computing unit discards the calculation result.

[0041] In one possible implementation, the aforementioned first computing unit includes a graphics processing unit (GPU), a network processing unit (NPU), or a tensor processing unit (TPU).

[0042] In one possible implementation, the aforementioned second storage unit is coupled to the second computing unit, and the first computing unit sends the calculation result to the second storage unit, including: the first computing unit sending the calculation result to the second storage unit through the second computing unit.

[0043] Optionally, the second computing unit may specifically be a central processing unit (CPU), with the second storage unit mounted within the CPU.

[0044] Thirdly, this application provides a first computing unit, wherein the first computing unit, a first storage unit and a second storage unit are deployed in a first system, and the first computing unit is coupled to the first storage unit;

[0045] The first calculation module is used to obtain multiple calculation results, which are the results of calculating multiple first calculation tasks.

[0046] The storage module is used to send at least one first calculation result from a plurality of calculation results to the second storage unit;

[0047] The second calculation module is used to calculate the second calculation task based on the obtained first calculation result. When the second calculation task is calculated, the first calculation result corresponding to the next second calculation task is obtained from the second storage unit and stored in the first storage unit. The second calculation task corresponds to the calculation result of at least one first calculation task.

[0048] In one possible implementation, the aforementioned second computing module is specifically used to compute a second computing task during a first duration, and to retrieve the first computing result corresponding to the next second computing task from the second storage unit during a second duration and store it in the first storage unit, with the first duration covering the second duration.

[0049] In one possible implementation, the aforementioned first computational task is applied to the forward propagation of the neural network, and the computation result of the first computational task includes activation values. The second computational task is applied to the backward propagation of the neural network, and the second computational task includes gradient computation tasks in the backward propagation.

[0050] In one possible implementation, the aforementioned first computation task includes a computation task of the network layer, a computation task of the computation module, or a computation task of the training batch; the second computation task includes a gradient computation task of the network layer, a gradient computation task of the computation module, or a gradient computation task of the training batch.

[0051] In one possible implementation, the aforementioned second calculation module is specifically used to calculate the gradient calculation task of the second network layer during a first duration, and to obtain the first calculation result required for the gradient calculation task of the first network layer from the second storage unit during a second duration.

[0052] In one possible implementation, the aforementioned second computing module is specifically used to calculate the gradient calculation task of the second computing module during a first duration, and to obtain the calculation results required by the gradient calculation task of the first computing module from the second storage unit during a second duration.

[0053] In one possible implementation, the aforementioned second calculation module is specifically used to calculate the gradient calculation task of the first training batch during a first duration, and to obtain the calculation results required for the gradient calculation task of the second training batch from the second storage unit during a second duration.

[0054] In one possible implementation, the aforementioned first calculation module is further configured to determine at least one first calculation result from a plurality of calculation results;

[0055] The storage module is specifically used to send at least one first calculation result to the second storage unit.

[0056] In one possible implementation, the aforementioned first calculation module is further configured to determine at least one second calculation result from a plurality of calculation results;

[0057] The storage module is specifically used to store at least one second calculation result in the first storage unit.

[0058] In one possible implementation, the aforementioned first calculation module is further configured to determine at least one third calculation result from multiple calculation results. At least one third calculation result is discarded. Before the first calculation unit executes the second calculation task corresponding to the third calculation result, the first calculation unit re-executes the first calculation task to obtain the third calculation result.

[0059] In one possible implementation, the aforementioned second calculation module is further configured to recalculate the first calculation task to obtain the third calculation result if the third calculation result is included among the multiple settlement results, and the sum of the durations of the third duration and the first duration covers the second duration.

[0060] In one possible implementation, the aforementioned first calculation module is specifically used to determine at least one first calculation result from multiple calculation results based on the attribute information of the calculation result, wherein the attribute information includes at least one of the space occupied by the calculation result or the amount of calculation.

[0061] In one possible implementation, the aforementioned first computing unit constructs constraints based on the size of the first storage unit and the size of the second storage unit;

[0062] Under the constraints of the conditions, the first computing unit determines the storage indication of the computing result based on the attribute information of the computing result and the duration of computing the second computing task. The storage indication is used to indicate that the computing result is one of the first computing result, the second computing result, or the third computing result.

[0063] In one possible implementation, the aforementioned first computing unit includes a graphics processing unit (GPU), a network processing unit (NPU), or a tensor processing unit (TPU).

[0064] In one possible implementation, the aforementioned second storage unit is coupled to the second computing unit. For example, the second storage unit may be mounted on the second computing unit or managed by the second computing unit. The storage module is specifically used to send the first calculation result to the second storage unit through the second computing unit.

[0065] Optionally, the second computing unit may specifically be a central processing unit (CPU).

[0066] In one possible implementation, the aforementioned second storage unit further includes a disk.

[0067] Fourthly, this application provides a second storage unit applied to a first system, the first system including a first computing unit, a first storage unit and a second storage unit, the first computing unit being coupled to the first storage unit;

[0068] The second storage unit includes:

[0069] A transceiver module is used to receive at least one first calculation result, which is the result of the first calculation unit calculating the first calculation task.

[0070] A storage module is used to store the at least one first calculation result;

[0071] The transceiver module is also used to calculate a second calculation task based on the first calculation result obtained by the first calculation unit, and the second storage unit sends the first calculation result corresponding to the next second calculation task to the first calculation unit when the current second calculation task is being calculated.

[0072] In one possible implementation, the aforementioned first system may specifically be a neural network system, that is, a system in which a neural network is deployed or used to perform computations in a neural network. Accordingly, the aforementioned first computation task and second computation task may be computation tasks of the neural network.

[0073] In one possible implementation, the aforementioned first computational task is applied to the forward propagation of the neural network, and the computation result of the first computational task includes activation values. The second computational task is applied to the backward propagation of the neural network, and the second computational task includes gradient computation tasks in the backward propagation.

[0074] In one possible implementation, the aforementioned neural network includes at least one network layer, each network layer including at least one computation module, and the training of the neural network is divided into multiple training batches; accordingly, the first computation task may include the computation task of the network layer, the computation task of the computation module, or the computation task of the training batch; the second computation task may include the gradient computation task of the network layer, the gradient computation task of the computation module, or the gradient computation task of the training batch.

[0075] In one possible implementation, the aforementioned transceiver module is specifically used to: when the first computing unit calculates the gradient calculation task of the second network layer for a first duration, the second storage unit sends the first calculation result required for the gradient calculation task of the first network layer to the first computing unit for a second duration.

[0076] In one possible implementation, the aforementioned transceiver module is specifically used to: when the first computing unit calculates the gradient calculation task of the second computing module for a first duration, the second storage unit sends the first calculation result required by the gradient calculation task of the first computing module to the first computing unit for a second duration.

[0077] In one possible implementation, the aforementioned transceiver module is specifically used to: when the first computing unit calculates the gradient calculation task of the first training batch during the first duration, the second storage unit sends the first calculation result required for the gradient calculation task of the second training batch to the first computing unit during the second duration.

[0078] The effects achieved by any of the optional embodiments of the third and fourth aspects can be referred to the description of the first aspect or any optional embodiment of the first aspect, and will not be repeated hereafter.

[0079] Fifthly, embodiments of this application provide a first system, including a first computing unit, a first storage unit, and a second storage unit. The first computing unit is coupled to the first storage unit, and the first computing unit can be used to execute the method steps as described in the first aspect or any optional implementation of the first aspect; the second storage unit can be used to execute the method steps as described in the second aspect or any optional implementation of the second aspect.

[0080] In a sixth aspect, embodiments of this application provide a digital processing chip or chip, the chip including a processing unit and a communication interface, the processing unit obtaining program instructions through the communication interface, the program instructions being executed by the processing unit, the processing unit being used to perform processing-related functions as described in any optional embodiment of the first or second aspect above.

[0081] In a seventh aspect, embodiments of this application provide a chip system including a processor for supporting devices in implementing the functions involved in any of the implementations of the first or second aspect described above, such as processing data and / or information involved in the methods described above.

[0082] In one possible design, the chip system also includes a memory for storing necessary program instructions and data for the electronic device. This chip system can be composed of chips or may include chips and other discrete components.

[0083] Eighthly, embodiments of this application provide a computer-readable storage medium including instructions that, when executed on a computer, cause the computer to perform the method in any of the optional embodiments of the first or second aspect described above.

[0084] Ninthly, embodiments of this application provide a computer program product comprising a computer program / instructions, which, when executed by a processor, causes the processor to perform the method in any of the optional embodiments of the first or second aspect described above. Attached Figure Description

[0085] Figure 1 is a schematic diagram of the architecture of a first system provided in an embodiment of this application;

[0086] Figure 2 is a schematic diagram of a server cluster structure provided in an embodiment of this application;

[0087] Figure 3 is a schematic diagram of a neural network structure mentioned in an embodiment of this application;

[0088] Figure 4 is a schematic diagram of another neural network structure provided in an embodiment of this application;

[0089] Figure 5 is a schematic diagram of the structure of a cloud server system mentioned in an embodiment of this application;

[0090] Figure 6 is a flowchart illustrating a data processing method mentioned in an embodiment of this application;

[0091] Figure 7 is a schematic diagram of an application scenario of a method provided in an embodiment of this application;

[0092] Figure 8 is a schematic diagram of another neural network structure provided in an embodiment of this application;

[0093] Figure 9 is a schematic diagram of another neural network structure provided in an embodiment of this application;

[0094] Figure 10 is a schematic diagram of an application scenario of another method provided in the embodiments of this application;

[0095] Figure 11 is a schematic diagram of an application scenario of another method provided in the embodiments of this application;

[0096] Figure 12 is a schematic diagram of an application scenario of another method provided in the embodiments of this application;

[0097] Figure 13 is a schematic diagram of an application scenario of another method provided in the embodiments of this application;

[0098] Figure 14 is a schematic diagram of an application scenario of another method provided in the embodiments of this application;

[0099] Figure 15 is a schematic diagram of an application scenario of another method provided in the embodiments of this application;

[0100] Figure 16 is a schematic diagram of an application scenario of another method provided in the embodiments of this application;

[0101] Figure 17 is a schematic diagram of an application scenario of another method provided in the embodiments of this application;

[0102] Figure 18 is a schematic diagram of the structure of a first computing unit provided in an embodiment of this application;

[0103] Figure 19 is a schematic diagram of the structure of a second computing unit provided in an embodiment of this application;

[0104] Figure 20 is a schematic diagram of the structure of a computing device provided in an embodiment of this application;

[0105] Figure 21 is a schematic diagram of the structure of a chip provided in an embodiment of this application. Detailed Implementation

[0106] The technical solutions of the embodiments of this application will be described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.

[0107] For ease of understanding, some terms or concepts involved in the embodiments of this application will be introduced.

[0108] (1) Neural Network

[0109] Neural networks can be composed of neural units, which can refer to units represented by x. s The arithmetic unit that takes an intercept of 1 as input can be represented as follows:

[0110] Where s = 1, 2, ..., n, n is a natural number greater than 1, W s For x s The weights are denoted by b, where b is the bias of the neural unit. f is the activation function of the neural unit, used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer; the activation function can be the sigmoid function. A neural network is a network formed by connecting multiple of the above-mentioned individual neural units together; that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field, which can be a region composed of several neural units.

[0111] (2) Deep Neural Networks

[0112] A deep neural network (DNN), also known as a multilayer neural network, can be understood as a neural network with multiple intermediate layers. Based on the position of these layers, the internal neural network of a DNN can be divided into three categories: input layer, intermediate layers, and output layer. Generally, the first layer is the input layer, the last layer is the output layer, and the layers in between are considered intermediate layers, or hidden layers.

[0113] Although DNNs appear complex, each layer can be represented as a linear relational expression: in, It is the input vector. It is the output vector. is the offset vector, also known as the bias parameter; w is the weight matrix (also called coefficients); and α() is the activation function. Each layer is simply an adjustment of the input vector. The output vector is obtained through such a simple operation. Because DNNs have many layers, the coefficients W and the offset vector... The number of these parameters is also quite large. The definitions of these parameters in DNNs are as follows: Taking the coefficient w as an example: Assuming a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as... The superscript 3 represents the layer number where coefficient W is located, while the subscript corresponds to the third layer index 2 of the output and the second layer index 4 of the input.

[0114] In summary, the coefficient from the k-th neuron in layer L-1 to the j-th neuron in layer L is defined as...

[0115] It's important to note that the input layer does not have a W parameter. In deep neural networks, more intermediate layers allow the network to better represent complex real-world situations. Theoretically, the more parameters a model has, the higher its complexity and "capacity," meaning it can perform more complex learning tasks. Training a deep neural network is essentially the process of learning the weight matrix, with the ultimate goal of obtaining the weight matrix of all layers in the trained deep neural network (a weight matrix formed by the vectors W from many layers).

[0116] (3) Large Model

[0117] Large models are large-scale models. The "large" in large models can be reflected in many aspects, such as large data scale, large-scale parallel computing capabilities, and larger model structures.

[0118] (4) Language model (LM)

[0119] Language models play a crucial role in Natural Language Processing (NLP), where their task is to predict the probability of a sentence occurring in a language. For example, a language model is typically constructed as a probability distribution p(s) of a string s, where p(s) attempts to reflect the frequency of string s as a sentence. It can be applied to scenarios such as text recognition or machine translation. In the embodiments of this application, the NLP models mentioned below include language models.

[0120] (5) Large Language Model (LLM)

[0121] A large language model (LLM) refers to a language model containing hundreds of billions (or more) parameters trained on massive amounts of text data. It is a natural language processing model based on deep learning. These models can process large amounts of text data to learn the grammatical and semantic rules of natural language. LLMs can be applied to text generation, machine translation, question answering systems, text summarization, and sentiment analysis, offering advantages such as strong generative capabilities, high adaptability, accurate prediction, and strong scalability. For example, in movie recommendation scenarios, a large language model can generate descriptions of movie scenes, including genre, main actors, and plot, enabling the system to better recommend similar movies. Large language models can also generate recommendation reasons; for example, e-commerce websites can use large language models to generate reasons for recommending products, such as product quality, price, and features, allowing users to better understand the value of the product.

[0122] (6) transformer

[0123] A transformer architecture is a feature extraction network that includes both an encoder and a decoder (classified as a convolutional neural network). Of course, in some cases, a transformer architecture may not include an encoder but may include a decoder. The aforementioned language models or large language models can be models based on a transformer architecture.

[0124] (7) Attention mechanism

[0125] Attention mechanisms can quickly extract important features from sparse data. They provide an effective modeling approach for capturing global contextual information through QKV (Queries, Keys, Values). Assuming the input is Q(query), and the context is stored as key-value pairs (K, V), then the attention mechanism is essentially a mapping function from the query to a series of key-value pairs. The essence of the attention function can be described as a mapping from a query to a series of (key-value) pairs. Attention essentially assigns a weight coefficient to each element in the sequence, which can also be understood as soft addressing. If each element in the sequence is stored in (K, V) form, then attention performs addressing by calculating the similarity between Q and K. The similarity calculated between Q and K reflects the importance of the extracted V values, i.e., the weights, and then a weighted sum is obtained to obtain the final feature value.

[0126] Attention calculation mainly consists of three steps. The first step is to calculate the similarity between the query and each key to obtain weights. Common similarity functions include dot product, concatenation, and perceptron. The second step typically uses a softmax function (which can normalize the weights, resulting in a probability distribution where the sum of all weight coefficients is 1, and also highlights the weights of important elements) to normalize these weights. Finally, the weights and their corresponding key values ​​are weighted and summed to obtain the final feature value. The specific calculation formula is as follows:

[0127] Where d is the dimension of matrix Q,K.

[0128] Furthermore, attention includes self-attention and cross-attention. Self-attention can be understood as a special type of attention where the inputs to the QKV features are consistent. Cross-attention, on the other hand, involves inconsistent inputs to the QKV features. Attention integrates the queried features as updated values ​​for the current features using the similarity between features (e.g., inner product) as weights. Self-attention is attention extracted based on the attention drawn from the feature map itself.

[0129] For convolutional networks, the kernel size limits the receptive field, often requiring multiple layers to focus on the entire feature map. Self-attention, on the other hand, has the advantage of global focus; it can obtain global spatial information about the feature map through simple lookups and assignments.

[0130] (8) Loss Function

[0131] In training deep neural networks, to ensure the output closely approximates the desired predicted value, we compare the network's prediction with the target value and update the weight vector of each layer based on the difference. (Of course, there's usually an initialization process before the first update, pre-configuring parameters for each layer.) For example, if the network's prediction is too high, the weight vector is adjusted to predict a lower value. This adjustment continues until the deep neural network can predict the target value or a value very close to it. Therefore, we need to predefine "how to compare the difference between the predicted and target values," which is the loss function or objective function. These are important equations used to measure the difference between the predicted and target values. Taking the loss function as an example, a higher output value (loss) indicates a greater difference, so training the deep neural network becomes a process of minimizing this loss. Common loss functions include mean squared error, cross-entropy, logarithmic, and exponential loss functions. For example, mean squared error can be used as the loss function, defined as... The specific loss function can be selected based on the actual application scenario.

[0132] (9) Backpropagation algorithm

[0133] An algorithm for calculating the gradient of model parameters based on a loss function and updating the model parameters, where the gradient is the derivative vector of the loss function with respect to the parameters. Neural networks can use backpropagation (BP) to correct the initial parameter values ​​during training, thus reducing the reconstruction error loss. Specifically, forward propagation (also called forward pass-through) of the input signal to the output generates an error loss. Backpropagating this error loss information updates the initial parameters of the neural network model, leading to convergence of the error loss. The backpropagation algorithm is an error-loss-driven backpropagation process aimed at obtaining the optimal parameters of the neural network model, such as the weight matrix. Backpropagation includes gradient backpropagation and parameter updates. Gradient backpropagation, also known as gradient reverse propagation, refers to calculating the gradient values ​​of each parameter in reverse order (i.e., the reverse of forward propagation). Parameter updates involve using the calculated gradient values ​​to further calculate new parameters, which are then used as the model parameters output in the next iteration of training.

[0134] (10) Activation value

[0135] The intermediate values ​​generated during the forward propagation of a neural network are called activation values. In a neural network, the output value of an operation can be called an activation value, which can be a numerical value or a vector.

[0136] (11) Storage-related terms

[0137] Copying data from the CPU (Host) to the XPU Device is called Host to Device, or H2D for short.

[0138] Copying data from the XPU Device to the CPU (Host) is called Device to Host, or D2H for short.

[0139] High bandwidth memory (HBM);

[0140] Double data rate synchronous dynamic random access (DDR SDRAM);

[0141] XPU: Generally refers to a dedicated accelerator for neural networks, such as a graphics processing unit (GPU), an embedded neural network processing unit (NPU), or a tensor processing unit (TPU), etc.

[0142] XPU Memory: Device storage space, also known as video memory, such as HBM;

[0143] CPU Memory: The host's storage space, such as DDR SDRAM memory;

[0144] Swap: A general term for H2D and D2H.

[0145] The method provided in this application can be applied to the model training process and can include the training of various models, such as deep neural networks, large-scale deep neural networks, or large language models.

[0146] In common model training processes, especially in large-scale neural network training, a large number of activation values ​​are typically generated during forward propagation. During backpropagation, these activation values ​​are used to calculate updated parameters. For large-scale neural networks, the number of activation values ​​is enormous, potentially dozens of times the number of neural network parameters, thus requiring a significant amount of storage space. However, the storage space of the primary computing unit is usually limited, for example, if model training is performed by a graphics processing unit (GPU). In situations with limited GPU memory, one existing solution involves recalculating activation values ​​during backpropagation; that is, activation values ​​generated during forward propagation are discarded and recalculated when needed during backpropagation. This solution incurs an additional 30%+ in computational overhead, reducing the model FLOPs utilization (MFU) of the computing cluster. Alternatively, another existing solution stores a portion of the activation values ​​in the GPU's storage space, discarding the excess and recalculating the discarded activation values ​​during backpropagation. However, this solution benefits from existing capacity constraints and lacks flexibility. Alternatively, another existing solution addresses the memory constraints of training large models by deploying the optimizer on the CPU. This involves performing forward propagation and calculating gradients on the GPU, transmitting the gradients to the CPU, calculating the updated parameters on the CPU, and then transmitting the calculated parameters back to the GPU. However, this approach incurs significant communication overhead, reducing the overall MFU (Mean Functionality Time) of the training system. Furthermore, practical testing has shown that its training efficiency is low, and the link between the GPU and CPU remains idle during computation, resulting in low resource utilization.

[0147] Therefore, this application provides a model training method that can combine multiple storage resources to store activation values, improve the utilization rate of storage resources in the device, maximize the use of available resources, and minimize computational overhead.

[0148] First, the method provided in this application can be deployed on server clusters, cloud platforms, terminals, or other systems with computing capabilities. This system may include one or more computing units and multiple storage units. The computing units can be used to perform computing tasks, and the storage units can be used to store data.

[0149] For example, the structure of the system used in the embodiments of this application can be as shown in FIG1. ​​The system can also be called the first system. The first system 10 may include a first computing unit 101, a second computing unit 102, a first storage unit 103 and one or more second storage units 104, etc.

[0150] The first computing unit 101 is coupled to the first storage unit 103, and the first computing unit can be used to perform computing tasks. The number of the first computing unit 101 and the first storage unit 103 in the system can be one or more, and there is no specific limitation on this.

[0151] The second storage unit 104 can be an independently deployed storage device in the system, such as a disk or solid-state drive, or it can be a storage device mounted on a management device in the system, such as memory or a disk mounted on the CPU.

[0152] For example, in one possible scenario, the second computing unit 102 may be coupled to one of the second storage units 104. This second computing unit is optional, and the system may not necessarily include it.

[0153] Typically, for a second storage unit mounted on a second computing unit, the first computing unit needs to access the second storage unit through the second computing unit; however, for a second storage unit deployed independently in the system, such as a disk, the first computing unit can access it directly.

[0154] The aforementioned first storage unit may specifically be a memory deployed in a GPU, TPU, NPU, or other dedicated neural network accelerator, such as the GPU's video memory. It can be used to execute the steps performed by the first computing unit in the data processing method provided in the embodiments of this application.

[0155] The aforementioned second computing unit may specifically be a CPU or other devices.

[0156] The second storage unit can be a storage unit mounted on the CPU, such as memory mounted on the CPU or a disk managed by the CPU.

[0157] More specifically, the aforementioned system can be applied to a variety of scenarios, such as directly to server clusters or cloud service systems, and can be used to perform large-scale data operations or train neural networks.

[0158] For example, in one possible implementation, the system can be applied to a server cluster. As shown in Figure 2, the server cluster can contain multiple servers, each of which can include one or more first computing units and multiple second storage units. In Figure 2, the first computing unit is a GPU, and the second storage units include CPU-mounted storage units or independently deployed storage units. Taking a neural network training scenario as an example, the neural network training process can be distributed across multiple servers for execution, and the neural network training can be performed in parallel by multiple first computing units on each server.

[0159] For example, as shown in Figure 3, a neural network can include one or more network layers, such as Layer 1 to Layer m, where m is a positive integer. Each network layer can include one or more computation modules, such as computation module 1 to computation module n, as shown in Figure 3. Each module can include one or more operations, and after each operation, an activation value can be output. Combining the system architecture shown in Figures 1 and 2, the system can include one or more first computation units. When there are multiple first computation units, these multiple first computation units can compute the neural network in parallel. Specifically, the network layers or computation modules in the neural network can be split. As shown in the neural network structure in Figure 3, the neural network can be divided horizontally or vertically. For example, each computation unit can perform the computation of one or more network layers, such as assigning Layer 1 to one first computation unit, assigning Layer 2 to another first computation unit, and so on. Or, each computation unit can perform the computation of one or more modules, such as assigning computation module 1 to one first computation unit, assigning computation module 2 to another first computation unit, and so on.

[0160] Furthermore, network layers can be connected in series or in parallel. For example, as shown in Figure 4, there may be various scenarios, such as series and non-series connection methods. In the series connection method, the network layers are connected in series. In the non-series scenario, the input of a network layer may contain the outputs of multiple network layers. As shown in the non-series connection method in Figure 4, the output of Layer m-1 is merged with the output of Layer m and then input to Layer m+1.

[0161] For example, in one possible scenario, the method provided in this application can be deployed in a cloud server system to provide cloud services to users via a client. Alternatively, the aforementioned system containing computing units can specifically be a cloud platform within a cloud server system.

[0162] For example, Figure 5 shows a schematic diagram of the structure of a cloud service system provided in this application. As shown in Figure 5, the cloud service system 50 may include a data center 11, a cloud platform 12, and a client 13.

[0163] The cloud platform 12 may specifically include a server cluster, the server structure of which can be referred to the description in Figure 2 above. Optionally, the cloud platform 12 can work with other computing devices, such as data storage, routers, load balancers, etc. The cloud platform 12 can use data from the data storage system or call program code in the data storage system to implement the method steps provided in the embodiments of this application.

[0164] Data center 11 can be used to store data for cloud platform 12 to query or write data, etc.

[0165] The cloud platform 12 can provide services to users in the form of a client. Users can operate on the client 13 to interact with the cloud platform 12 on data or to request services from the cloud platform 12. This client can be deployed on personal computers, computer workstations, smartphones, tablets, laptops, and smart cars, etc.

[0166] In one possible implementation, the cloud platform 12 is used to implement the methods provided in the embodiments of this application, and to perform data processing through the methods provided in the embodiments of this application. For example, if the methods provided in the embodiments of this application are applied to a neural network training scenario, the cloud platform 12 can execute the neural network training steps; the trained model can be deployed on the cloud platform 12 or on the client 13.

[0167] It should be noted that client 13 is an optional device, meaning that client deployment is not required. Taking a neural network training scenario as an example, with client 13 deployed, users can provide training sets through the client or deploy the trained model on the client, etc.; without client 13 deployed, the cloud platform can use the training set stored in data center 11 to perform model training.

[0168] The method flow provided in the embodiments of this application will be described below in conjunction with the aforementioned system architecture.

[0169] Referring to Figure 6, a flowchart of a data processing method provided in an embodiment of this application is shown below.

[0170] 601. The first calculation unit obtains multiple calculation results, which are the results of calculating multiple first calculation tasks.

[0171] The first computing unit can be referred to in the description of Figures 1 to 5 above.

[0172] The first computational task can specifically be a task that uses input data to perform calculations and outputs computation results. The input data can include, but is not limited to, images, video data, audio data, text, or sequences obtained based on one or more of the previous items.

[0173] Optionally, the first computational task can be a computation of a neural network. For example, in conjunction with the network structures shown in Figures 3 and 4 above, the first computational task can be a computational task on a network layer, computation module, or training batch in the neural network. Accordingly, the computation result of the first computational task can be the computation result in the neural network. Therefore, the method embodiments provided in this application can be applied to the computation process of a neural network to achieve a more efficient neural network computation process.

[0174] Taking the forward propagation process in neural network training as an example, during the forward propagation of the neural network, samples from the training set can be input into the neural network. The first computing unit performs calculations on multiple computing modules in the neural network. Each module can include one or more operations, and the output of each operation can be used as an activation value. The calculation result of the first computing task can include the activation values ​​of each computing module, each network layer, or each training batch.

[0175] The samples in the training set may include, but are not limited to, one or more of the following: images, text, audio data, or video data. The specific type can be determined based on the actual application scenario, and this application does not limit the type of samples in the training set. Typically, the task performed by the neural network can be determined based on the actual scenario, and the type of training data required may differ depending on the task. For example, if a neural network is to be trained for image processing, the samples in the training set may include images; if a neural network is to be trained for text processing, the samples in the training set may include text or audio data, and so on.

[0176] 602. The first calculation unit sends at least one of the multiple calculation results to the second storage unit.

[0177] In this embodiment of the application, the first computing unit may send all or part of the computing results of one or more first computing tasks to the second storage unit so as to store all or part of the computing results of one or more first computing tasks in the second storage unit.

[0178] Based on the system architectures corresponding to Figures 1 to 5, there may be one or more types of second storage units. In one possible architecture, the second storage unit can be an independently deployed storage unit within the system, which the first computing unit can directly access. Therefore, the first computing unit can directly send its computation results to the second storage unit. In another possible architecture, the second storage unit is coupled to the second computing unit. For example, the second storage unit can be mounted on the CPU within the system, and the first computing unit needs to access the second storage unit through the second computing unit. Therefore, the first computing unit can send its computation results to the second storage unit through the second computing unit. For instance, the second computing unit can forward the computation results to the second storage unit, or, under the management of the second computing unit, the first computing unit can send the computation results to the second storage unit based on a storage address provided by the second computing unit. Therefore, in this application's embodiments, different types of storage resources in the system, including storage resources directly accessible to the first computing unit and computing resources not directly accessible, can all be used to store computation results, fully utilizing the system's storage resources for data storage.

[0179] For the computation results obtained by the first computing unit from multiple first computing tasks, embodiments of this application can utilize various storage resources for storage to achieve multi-level storage. The storage type of each computation result may include storage in the aforementioned second storage unit or first storage unit, or it may be directly discarded.

[0180] In one possible implementation, after the first computing unit calculates the result of the first computing task, it can directly send all the calculation results to the second storage unit. In other words, the first computing unit can store all the calculation results as the first calculation result in the second storage unit to make full use of the available storage resources in the system.

[0181] In one possible implementation, after the first computing node obtains the calculation result from the first computing task, it can send the calculation result to the second storage unit when the available storage space of the first storage unit is less than a first threshold. This allows for full utilization of the storage resources of the second storage unit to store the calculation result when the available storage space of the first storage unit is insufficient, thereby reducing the number of calculation results that need to be discarded.

[0182] In one possible implementation, before the first computing unit sends at least one first calculation result from the plurality of calculation results to the second storage unit, the first computing unit further selects at least one first calculation result from the plurality of calculation results and sends it to the second storage unit. Therefore, the first computing unit can selectively store some calculation results in the second storage unit, for example, it can send calculation results with large computational load but small space occupation to the second storage unit, thereby replacing high computational load with lower communication cost.

[0183] In one possible implementation, the first computing unit further determines at least one second computing result from multiple computing results and stores the at least one second computing result in the first storage unit. For example, computing results with large computational load but small storage resource requirements can be classified as first computing results, while computing results with a transfer time shorter than the recalculation time can be stored in the second storage unit. This achieves storage of computing results with large computational load with less transfer cost, eliminating the need to recalculate such computing results and greatly reducing the time spent calculating such computing results. Alternatively, computing results requiring large computational load and large storage resource requirements can be classified as second computing results, thereby reducing the time spent transferring or calculating such computing results and maximizing the storage resource utilization of the first node.

[0184] Optionally, a discarded computation result can be determined from multiple computation results; for ease of distinction, this can be referred to as the third computation result. That is, after the first node completes inference using this third computation result, it does not need to be stored and can be directly discarded. During the computation of the second computation task, the first node can recalculate the third computation result. For example, when executing the second computation task, the computation result of the previous second computation task can be used as the input of the first computation task corresponding to the current second computation task to obtain the recalculated third computation result. For example, in a neural network training scenario, a computation result whose processing time is greater than the recalculation time can be used as the third computation result, and this third computation result can be recalculated during backpropagation. Therefore, the third computation result does not need to be stored and can be directly discarded, reducing the storage space occupied.

[0185] In one possible implementation, the first computing unit can also determine the storage method of the calculation result based on the attribute information of the calculation result. For example, the first computing unit can determine a storage indication for the calculation result based on the attribute information of the calculation result, which includes at least one of the space occupied by the calculation result or the amount of computation; the storage indication can be used to indicate the storage method of the calculation result. If the storage indication indicates a second storage unit, the first computing unit sends the calculation result to the second storage unit. Therefore, in this embodiment, it is possible to determine whether to store the calculation result in the second storage unit based on the attribute information of the calculation result, thereby allowing for adaptive selection of the storage method of the calculation result based on the actual scenario, and exhibiting stronger generalization.

[0186] In one possible scenario, if the storage indication is to indicate the first storage unit, the first computing unit sends the calculation result to the first storage unit; or, if the storage location indication is to discard the calculation result, the first computing unit discards the calculation result.

[0187] When executing the second computation task, the first computation unit can quickly retrieve the computation result stored in the first storage unit without having to move it. For example, computation results with a large amount of computation and a large space occupation can be stored in the first storage space. This is equivalent to replacing the transportation cost and recomputation cost of moving the result from the second storage unit with the storage cost of the first storage unit.

[0188] For discarded calculation results—that is, results calculated in the first calculation unit—no storage is required; they can be discarded directly. During the execution of the second calculation task, these results can be recalculated. For example, calculation results whose transfer time exceeds the recalculation time can be discarded, and recalculated during backpropagation. Therefore, the third activation value does not need to be stored and can be directly discarded, reducing storage space usage.

[0189] Optionally, the first computing unit constructs constraints based on the size of the first storage unit and the size of the second storage unit; under the constraints, the first computing unit determines the storage indication of the computing result based on the attribute information of the computing result and the duration of the second computing task. Therefore, in this embodiment, constraints can be constructed based on the available storage space size, thereby obtaining a computing result storage scheme that is more adapted to the available storage resources under the constraint of the available storage space size.

[0190] Optionally, when the storage instruction indicates that the calculation result should be discarded, the attributes of the calculation result include the computational amount. This allows the determination of whether a calculation result should be discarded to be based on its computational amount. For example, a large computational amount indicates that if the result is needed later, the computational amount required for recalculation will also be very large. Therefore, in this embodiment, the computational amount of each calculation result can be considered to determine whether it should be discarded, thereby minimizing the computational amount in the overall data processing process, reducing the end-to-end processing time, and improving the overall processing efficiency.

[0191] In essence, in this embodiment, the calculation results are categorized into multiple storage types based on their attribute information. For example, type A calculation results are stored in the first storage unit, type B calculation results are stored in the second storage unit, and type C calculation results are discarded directly. Therefore, this embodiment can utilize multi-level storage resources to store calculation results, making full use of available storage resources.

[0192] In one specific implementation, an optimization problem can be constructed using a first storage unit and a second storage unit. This optimization problem can include an objective function and the aforementioned constraints. The objective function is determined based on the attribute information of the computation results. Under the constraints, a better computation result storage scheme can be output based on the objective function. The objective function includes a function that minimizes the time duration. This time duration can include the time taken for the first computing unit to compute the second computation task. The time taken for the first computing unit to retrieve the computation result from the second storage unit can be determined based on the space occupied by the computation result and the effective bandwidth between the first computing unit and the second storage unit. For example, it can be represented by the space occupied / effective bandwidth. The optimization problem is then solved to identify the storage method of each computation result, obtaining a storage indication. That is, the computation result is classified as a type A computation result, a type B computation result, or a type C computation result, etc. Furthermore, if the computation result may include a type C computation result, the time duration referred to in the aforementioned minimized time duration can also include the time taken to recompile the type C computation result.

[0193] For example, taking the training process of a neural network as an example, the aforementioned calculation result can be the output value of the operation in the neural network. For ease of understanding, this output value can be called the activation value. An optimization problem can be constructed using the first storage unit and the second storage unit. This optimization problem can include an objective function and constraints. It is possible to achieve a better calculation result storage scheme based on the objective function under the constraints of the constraints. The objective function includes a function that minimizes the duration, which can represent the duration of gradient backpropagation for one layer. For example, this duration can specifically include the time it takes for the first computation unit to retrieve activation values ​​from the second storage unit, recompile activation values ​​(in the case of C-type activation values), or calculate gradient values ​​using activation values. The constraints of the optimization problem include constraints formed by the size of the first storage unit and the size of the second storage unit. Subsequently, the optimization problem is solved to obtain the solution result, which may include a partitioning scheme for the activation values ​​into the aforementioned A-type, B-type, and C-type activation values. That is, by solving the optimization problem to classify multiple activation values, the goal is to obtain an activation value storage scheme with a shorter overall end-to-end training duration.

[0194] For example, by solving the aforementioned optimization problem, activation values ​​that require a large amount of computation but less storage resources can be classified as Class B activation values. Activation values ​​whose transfer time is less than their recalculation time can be stored in the second storage unit. This allows for the storage of computationally intensive activation values ​​with minimal transfer costs, eliminating the need to recalculate these values ​​and significantly reducing the time spent calculating them. Alternatively, activation values ​​that require a large amount of computation and also consume significant storage resources can be classified as Class A activation values ​​and stored in the first storage unit. This reduces the time spent transferring or calculating these activation values ​​and maximizes the utilization of the storage resources in the first storage unit.

[0195] 603. The first computing unit calculates the second computing task based on the obtained first computing result. When the second computing task is calculated, the first computing result corresponding to the next second computing task is obtained from the second storage unit and stored in the first storage unit.

[0196] That is, the second calculation task corresponds to one or more of the first calculation tasks in the aforementioned step 601, and the calculation results of one or more of the corresponding first calculation tasks need to be used during the calculation process of the second calculation task.

[0197] Therefore, in this embodiment of the application, when calculating the previous second calculation task, the calculation result required for the next second calculation task can be prefetched from the second storage unit to the first storage unit. Thus, when calculating the next second calculation task, the required calculation result can be directly obtained from the first storage unit, which can reduce the waiting time for obtaining the required calculation result and improve the overall calculation efficiency.

[0198] Specifically, the first computing unit calculates one of the second computing tasks during a first duration, and retrieves the calculation results required for the next second computing task from the second storage unit during a second duration and stores them in the first storage unit. The first duration covers the second duration. Therefore, in this embodiment, it is equivalent to simultaneously executing the second computing task and the task of transferring the calculation results required for the next second computing task within the same time period, realizing the reuse of computing tasks and transfer tasks in the time dimension, and reducing the overall execution time of computing tasks and transfer tasks.

[0199] For example, during the computation of the second computation task, the sources of the required computation results can include multiple sources. For instance, type A computation results can be obtained from the first storage unit, type B computation results can be obtained from the second storage unit, and type C computation results can be obtained through recomputation. The computation results required for the next second computation task can be obtained during the computation of the previous second computation task. For example, for type B computation results, they can be moved from the second storage unit to the first storage unit in advance, and for type C computation results, they can be recomputed and stored in the first storage unit in advance. Therefore, in this embodiment, computation results of various types can be obtained in advance, which is equivalent to simultaneously executing computation result transfer and the second computation task within the same time period, reducing the overall end-to-end time of data processing by utilizing multi-layered storage resources.

[0200] Optionally, in one possible application scenario, the method provided in this application embodiment can be applied to the training process of a neural network, with the first computational task applied to the forward propagation of the neural network and the second computational task applied to the backward propagation of the neural network. Therefore, the method provided in this application embodiment can be applied to the training process of a neural network, which can reduce the overall end-to-end training time of the neural network and improve the training efficiency of the neural network.

[0201] Specifically, the neural network includes at least one network layer, and each network layer includes at least one computation module. The training of the neural network is divided into multiple training batches. The aforementioned first computation task includes the computation task of the network layer, the task of the computation module, or the task of the training batch; the second computation task may include the gradient calculation task of the network layer, the gradient calculation task of the computation module, or the computation task of the training batch.

[0202] For example, the second computational task includes at least one of the following:

[0203] Calculate the gradient value of the m-th network layer, where m is a positive integer and the m-th network layer is the m-th network layer in the neural network arranged in the order of forward propagation;

[0204] Alternatively, calculate the gradient value of the nth computation module, where n is a positive integer and the nth computation module is the nth computation module in the neural network arranged in the order of forward propagation;

[0205] Alternatively, calculate the gradient value of the k-th training batch, where k is a positive integer and the k-th training batch is one of the training batches of the neural network.

[0206] Therefore, in the embodiments of this application, the second computation task can be a variety of gradient computation tasks with different granularities, which can be adapted to a variety of scenarios.

[0207] In one possible implementation, the first computing unit calculates the gradient of the second network layer during a first time interval, and retrieves the calculation results required for the gradient calculation of the first network layer from the second storage unit during a second time interval. The first network layer is arranged before the second network layer in the forward propagation direction. For example, the first computing unit calculates the gradient value of the m-th network layer during the first time interval, and retrieves the activation value of the ma-th network layer from the second storage unit and stores it in the first storage unit during the second time interval, where a is a positive integer. Therefore, after calculating the gradient of the previous network layer, the gradient of the next network layer can be directly calculated based on the pre-fetched activation value, reducing the gradient calculation interval time between network layers. This improves the overall efficiency of the first computing unit when performing gradient backpropagation and makes full use of the transmission resources between the first computing unit and the second storage unit.

[0208] In one possible implementation, in the forward propagation direction of the neural network, the first computation module is arranged before the second computation module. In the backward propagation direction, the gradient values ​​of the second computation module are calculated first, followed by the second computation module. The first computation unit calculates the gradient calculation task of the second computation module in the first time period, and retrieves the calculation results required for the gradient calculation task of the first computation module from the second storage unit in the second time period. The first computation module is arranged before the second computation module in the forward propagation direction. For example, the first computation unit calculates the gradient value of the nth computation module in the first time period, and retrieves the activation value of the nbth computation module from the second storage unit and stores it in the first storage unit in the second time period, where b is a positive integer. In this embodiment, activation values ​​can be stored and moved at the granularity of computation modules, thereby storing and moving activation values ​​with smaller granularity. This allows for the use of optimization problems to obtain the storage scheme for the activation values ​​of each module, thereby achieving a smaller overall training time and improving the overall MFU (Mean Functionality Time) of model training. Furthermore, the activation values ​​of modules can be prefetched, so when calculating gradients in the order of backpropagation, after calculating the gradient of the previous module, the prefetched activation values ​​can be used directly to calculate the gradient of the next module, reducing the interval time between modules when calculating gradients and improving gradient calculation efficiency.

[0209] In one possible implementation, the first computing unit calculates the gradient calculation task for the first training batch during a first duration, and retrieves the calculation results required for the gradient calculation task of the second training batch from the second storage unit during a second duration. For example, the first computing unit calculates the gradient value of the k-th training batch during the first duration, and retrieves the first activation value of the (k+c)-th training batch from the second storage unit and stores it in the first storage unit during the second duration, where c is a positive integer. In this embodiment, activation values ​​can be moved on a batch-by-batch basis, allowing for advance movement of activation values ​​between training batches. This reduces waiting time caused by moving or recalculating activation values ​​between training batches, improves the utilization of computing resources of the first computing unit, and enhances model training efficiency.

[0210] This can be understood as follows: when moving activation values ​​from a training batch, the movement can be done at the layer or module level to accommodate the communication capabilities between the computational power of the first computing unit and the hierarchical storage. For example, when the communication bandwidth between the first computing unit and the second storage unit is large, moving activation values ​​at the layer level can be considered; when the communication bandwidth between the first computing unit and the second storage unit is small, moving activation values ​​at the module level can be considered. The specific granularity of movement can be selected according to the actual application scenario.

[0211] In one possible implementation, if a third calculation result is discarded among the multiple calculation results, the first calculation unit recalculates the first calculation task to obtain the third calculation result during a third duration. The sum of the third duration and the first duration covers the second duration. In this embodiment, the process of recalculating the third calculation result can also be performed before executing the second calculation task, fully considering various possible situations.

[0212] For example, in a neural network training scenario, referring to Figures 3 and 4 above, if the method provided in this application embodiment is deployed, and the calculation results are prefetched at the aforementioned network layer granularity, during gradient backpropagation, if a serial connection is used, the output value can be stored in the second storage unit by unloading the calculation results layer by layer or module by module during forward propagation. During gradient backpropagation, the calculation results can also be obtained layer by layer; for example, when calculating the gradient of layer m+1, the activation values ​​required for the gradient calculation task of layer m can be prefetched. However, if a non-serial connection is used, the calculation result of layer m-1 cannot be unloaded during forward propagation; it must wait until layer m+1 is calculated before unloading. During gradient backpropagation, when the gradient of layer m+3 is backpropagated, the activation value of layer m+2 is prefetched, while when the gradient of layer m+2 is backpropagated, both the calculation results of layer m+1 and layer m+1 need to be prefetched simultaneously. Of course, the specific neural network architecture can be determined according to the actual application scenario. This application embodiment is merely an illustrative example and is not intended to limit the scope.

[0213] Combining the aforementioned first, second, and third time intervals, the temporal processing flow of one of the second computation tasks can be shown in Figure 7. The first gradient task, second gradient task, and third gradient task are executed respectively. When calculating the previous gradient computation task, the activation values ​​required for the next gradient computation task can be pre-fetched. Taking the second gradient computation task as an example, gradient backpropagation (i.e., calculating gradient values) is performed within the first time interval. If there are activation values ​​that need to be recalculated, they are recalculated in the third time interval. In the second time interval, the activation values ​​required for the third gradient computation task can be pre-fetched. The sum of the first and third time intervals covers the second time interval. Specifically, for the gradient computation task of layer m, when calculating the gradient of layer m+1, the activation values ​​required for the gradient computation task of layer m can be pre-fetched; when calculating the gradient of layer m, the activation values ​​required for the gradient computation task of layer m-1 can be pre-fetched, and so on. In the first time interval, the gradient of layer m is calculated, and in the second time interval, the activation values ​​required for layer m-1 are prefetched. If there are activation values ​​that need to be recalculated when calculating the gradient of layer m, the activation values ​​are recalculated in the third time interval. The sum of the first and third time intervals covers the second time interval. Therefore, the activation values ​​required for the next gradient calculation task can be prefetched before the previous gradient calculation task is completed. This allows for parallel execution of gradient calculation and activation value prefetching in terms of timing, greatly improving the overall training efficiency of the neural network.

[0214] Alternatively, after obtaining the gradient values, the parameters of the neural network can be updated using the gradient values. For example, the gradient value can be understood as the derivative vector of the loss function with respect to the neural network parameters. The updated parameters of the neural network can be calculated using this gradient value to obtain the updated neural network, thereby enabling the training of the neural network.

[0215] Therefore, in the embodiments of this application, during the model training process, after the first computing unit performs forward propagation, the first computing unit can store the activation value in the second storage unit, thereby making full use of the storage resources of the second storage unit and combining them with the storage resources of the first computing unit, thereby storing the activation value in a multi-level storage environment, thereby reducing the amount of activation value that needs to be discarded, so as to reduce the computational load of the first computing unit and reduce the computational overhead of the first computing unit.

[0216] The foregoing has described the method flow provided by the embodiments of this application. The following is a more detailed description of the method flow provided by the embodiments of this application in conjunction with specific application scenarios.

[0217] The following description uses the training process of a large language model as an example. The large language model mentioned below can be replaced with other models that need to be trained, which will not be elaborated further. The aforementioned calculation results can be refined into activation values, that is, the output values ​​of each calculation module in the forward propagation process of the large language model. The aforementioned system 10 can specifically be called a neural network system.

[0218] For ease of understanding, the method provided in this application embodiment is divided into multiple stages, such as initialization, activation value storage, activation value transfer and gradient value calculation, and model update. The different stages are described below.

[0219] I. Initialization

[0220] First, the structure of the large language model is illustrated. As shown in Figure 8, this large language model can include multiple network layers, and each layer can contain multiple computational modules, such as the n modules shown in Figure 8. The storage space occupied by these n modules is represented as m0 to m... n .

[0221] Typically, the transport time or computational load of different computing modules during gradient backpropagation may vary, which may result in different occupancy times for different computing modules during gradient backpropagation. The transport time or computational load of each computing module can be determined based on historical data, or the transport time or computational load of each computing module can be calculated after running each computing module in the computing device.

[0222] For example, as shown in Figure 9, the specific computation modules may include, but are not limited to, layer-norm, softmax, dropout, linear, TP_allreduce, GeLU and other computation modules. The position, activation value occupancy or computation amount of each computation module may be different.

[0223] The actual transport time and computational load of each computation module can be calculated by running the gradient backpropagation process of the neural network. For example, two rounds of performance profiling can be performed using a full recalculation and a full transport strategy to obtain the computation / communication time and transport time of each module. For instance, the output activation value size and computational load of each computation module can be shown in Table 1. Here, "position" represents the index of the computation module in the neural network, "activation value size" represents the space occupied by the output activation value, "computational load" represents the amount of computation required by the first computation unit when outputting the activation value, "remarks" describe the computation module, and "computational load / size" represents the ratio between computational load and occupied space.

[0224] Table 1

[0225] The required computation or transfer time may vary depending on the size of the activation value. For example, larger activation values ​​require more storage resources and take longer to transfer; smaller activation values ​​require less storage resources and take shorter transfer times. Therefore, activation values ​​that require more computation but require less storage resources can be stored in the second storage unit; activation values ​​that require less computation and require less storage resources can be stored in the GPU's storage space, thereby reducing the overall time for the GPU to transfer and compute activation values; and activation values ​​that require less computation but require more resources can be discarded and recomputed during backpropagation to avoid consuming more available storage resources with lower computational cost.

[0226] II. Activation Value Storage

[0227] During model training, the activation values ​​output by the model need to be stored so that the gradients corresponding to each parameter can be calculated during gradient backpropagation.

[0228] Model training can be performed by GPUs or NPUs, and the node that performs model training can be called an XPU, which is the first computing unit mentioned above, while the CPU's storage resources are used as an additional second storage unit.

[0229] In this embodiment, to maximize the resource utilization of the neural network system, activation values ​​can be stored or partially recalculated in various ways. For example, as shown in Figure 10, activation values ​​can be divided into three categories: Category A activation values ​​are stored in the XPU Memory (storage space), i.e., the first storage unit, maximizing the utilization of the XPU's storage space; Category A activation values ​​are also the aforementioned second calculation result. Category B activation values ​​are stored in the CPU Memory, i.e., the aforementioned second calculation unit. After obtaining the activation values ​​through forward propagation, the activation values ​​are unloaded from the XPU Memory to the CPU Memory, and prefetched from the CPU Memory before backward propagation calculation; Category B activation values ​​are also the aforementioned first calculation result. Category C activation values ​​are recalculated; the purpose of this classification is to minimize the recalculation overhead; Category C activation values ​​are also the aforementioned third calculation result.

[0230] For Class A activation values, the XPU can read them directly from the storage space. Storing Class A activation values ​​in the XPU's storage space can maximize the utilization of the XPU's storage space and achieve more efficient activation value reading.

[0231] For Class B activation values, the XPU can store them in the CPU's memory space. Here, the CPU's memory space, as referred to in the following embodiments, refers to the memory units mounted or managed by the CPU, and will not be elaborated further. During gradient backpropagation, the XPU needs to move Class B activation values ​​from the CPU's memory space to calculate the gradient. For example, the processing of Class B activation values ​​can be shown in Figure 11: after obtaining activation values ​​through forward propagation, the activation values ​​are unloaded from the XPU Memory to the CPU Memory; during gradient backpropagation, the XPU moves activation values ​​from the CPU Memory to the XPU Memory to calculate the gradient value. Typically, activation values ​​requiring large computational loads but occupying relatively few resources can be classified as Class B activation values, thereby achieving the storage of activation values ​​with a larger computational load with less data transfer cost.

[0232] For Class C activation values, they can be discarded directly, meaning there's no need to store them. During gradient backpropagation, Class C activation values ​​can be recalculated, such as substituting activation values ​​from adjacent operations into the corresponding operations of Class C activation values ​​to obtain recalculated Class C activation values. For instance, activation values ​​that require less computation but occupy a larger amount of space can be used as Class C activation values, thus trading more available storage resources for less computational cost.

[0233] There are several ways to determine the storage scheme for activation values. For example, the activation value can be directly stored in the CPU's memory space, or an optimization algorithm can be used to find a better storage scheme for the activation value. These will be introduced in detail below.

[0234] 1. Solve the activation value storage scheme using an optimization algorithm.

[0235] Typically, moving Class B activation values ​​too late will cause computational delays, while moving them too early will increase storage overhead. Therefore, in this embodiment, the classification of activation values ​​into Class B activation values ​​can be determined based on the moving time or computational load corresponding to each activation value. For example, as shown in Figure 12, for the (l-1)th layer, the solution provided in this embodiment recalculates the activation values ​​in the third time period and calculates the gradient backpropagation of this layer in the first time period. The activation values ​​required for calculating the gradient backpropagation have already been prefetched when calculating the gradient of the (l)th layer. Therefore, if the activation value recalculation is completed, or if there are no activation values ​​that need to be recalculated, the gradient of the (l-1)th layer can be directly calculated, and the gradient values ​​of the (l-2)th layer can be prefetched in the second time period. The sum of the third time period and the first time period covers the second time period. In existing activation value recalculation schemes, for the (l-1)th layer, full recalculation of activation values ​​and gradient calculation are required. The time required for calculating the gradient of the (l-1)th layer differs from the time required for calculating the gradient of the (l-1)th layer in this application. Furthermore, the time required for full activation value recalculation in existing schemes differs from the time required for recalculating only a portion of activation values ​​or not recalculating any activation values ​​in this application embodiment. This difference, specifically the difference between the sum of the third and first time durations and the fourth time duration, represents the performance gain of the scheme provided in this application embodiment. Essentially, it overlaps the gradient backpropagation time of the (l-1)th layer with the prefetching time of the B-type activation values ​​of the (l-1)th layer, achieving just-in-time activation value transfer.

[0236] Therefore, in this embodiment, by optimizing the algorithm and combining the activation value transfer time or computational load of each computation module or network layer, each activation value can be classified into different types of activation values ​​to obtain a shorter end-to-end training process. As shown in Figure 10, compared with full recalculation of activation values, this is equivalent to dividing the activation value part of the full recalculation into type A activation values, type B activation values, or type C activation values. Type A activation values ​​need to be stored in the XPU storage space (i.e., the aforementioned second calculation result), type B activation values ​​are stored in the CPU storage space (i.e., the aforementioned first calculation result), or type C activation values ​​are discarded and need to be recalculated in the gradient calculation task (i.e., the aforementioned third calculation result).

[0237] First, during the initialization phase, the storage usage m0 of each layer and the activation value m of each module can be collected. i And the computation time of this module l i The gradient backpropagation time t of a layer BP .

[0238] Subsequently, a multivariate, multi-constraint optimization problem is constructed based on the activation value information of each computing module. The decision variables are the granularity of the move, the amount of the move, and the timing of the move. The objective is to minimize the end-to-end time. The problem must satisfy storage capacity constraints and peak storage constraints.

[0239] This optimization problem can be represented as:

[0240] Among them, M XPU M CPU Let A, B, and C represent the storage of the XPU and CPU, respectively, and β represent the effective bandwidth between the CPU and XPU. The sets A, B, and C represent a partitioning of all modules without overlap or omission.

[0241] The solution to the optimization problem can be a greedy algorithm or a dynamic programming algorithm, which can be determined according to the actual application scenario. This application does not impose any restrictions on this.

[0242] In some scenarios, if the activation value storage scheme is determined at the network layer level, then the activation value of each of the aforementioned modules can occupy m... i And the computation time of this module l i The values ​​are replaced by the activation value usage and computation time of each network layer. This is only an example of the activation value storage scheme determined at the module level, and is not intended to be a limitation. The specific computation granularity can be determined according to the actual application scenario, and this application does not limit it.

[0243] For example, referring to Figure 9 above, the solution result may be expressed as: the activation value of the XPU Memory is A = {10, 12} (where the numbers represent the location of the computing module), the activation value of the CPU Memory is B = {0, 2, 3, 7, 8}, and the remaining activation values ​​are recalculated.

[0244] Therefore, in this embodiment, an optimization algorithm can be used to solve for an activation value storage scheme with shorter end-to-end time and better suited to the available resources of each first computing unit. It can fully utilize idle network bandwidth and hierarchical storage resources to achieve adaptive and proactive migration of activation values, supplementing storage with network bandwidth and aiding computation with network bandwidth, thus minimizing recomputation overhead and significantly alleviating the memory and computational limitations of large model training. Furthermore, through refined joint modeling of network, storage, and computation, just-in-time proactive data migration can be further realized, adapting to large model training scenarios with hundreds of models and thousands of variations, and significantly improving end-to-end performance.

[0245] Furthermore, if no memory overflow (OOM) occurs in the first computational unit during the storage of activation values, model training can continue. If an OOM occurs, the activation values ​​stored in the XPU can be unloaded to the CPU storage space via swap to avoid training interruption.

[0246] Furthermore, if an OutOfMemoryError (OOM) occurs, the M value in the aforementioned optimization algorithm can be adjusted. XPU Or M CPU Then, the solution is recalculated. Alternatively, a binary search method can be used to obtain an activation value storage scheme that is more suitable for the actual available storage space, and the activation values ​​can be stored using the new activation value storage scheme.

[0247] 2. Stored in the CPU's memory space.

[0248] Typically, the XPU's storage space can store model parameters of a large language model, such as model weights or structural parameters, which will occupy a certain amount of storage space. If the XPU's storage space is insufficient, a portion of the activation values ​​calculated by the XPU can be stored in the XPU's storage space and a portion can be stored in the CPU's storage space; or, the activation values ​​calculated by the XPU can be directly stored in the CPU's storage space.

[0249] III. Activation value transfer and gradient value calculation

[0250] During gradient backpropagation, the XPU needs to use activation values ​​to calculate gradient values. The XPU can pre-fetch activation values ​​stored in the CPU. Therefore, when calculating gradient values, the pre-fetched activation values ​​can be used directly.

[0251] Specifically, when the XPU moves activation values ​​from the CPU's storage space, it can do so at the granularity of the computing module, the granularity of the layer, or the granularity of the training batch. The granularity can be selected according to the actual application scenario.

[0252] In one possible implementation, the granularity of XPU's data transfer can be at the layer level of the model. Transferring data at the layer level can adapt to various model scenarios. For example, as shown in Figure 13, when XPU calculates activation values, the activation values ​​of network layers 1 and 3 can be pre-loaded into the CPU's storage space. During backpropagation, when calculating the gradient value of layer 4, the activation values ​​of layer 3 can be pre-loaded, so the pre-loaded activation values ​​can be directly used for calculation; similarly, when calculating the gradient value of layer 2, the activation values ​​of layer 1 can be pre-loaded, so the pre-loaded activation values ​​can be used for calculation.

[0253] Another possible implementation is that the granularity of XPU transport can be the computational modules within the model. That is, transporting data at a finer module level allows for more efficient utilization of XPU storage and H2D bandwidth, maximizing efficiency and adapting to various parallel strategies. For example, as shown in Figure 14, during the forward propagation, the activation values ​​of modules a, b, c, and d can be offloaded to the CPU's storage space. During the backpropagation, when calculating the gradient values ​​of modules h, g, f, or e, the activation values ​​of modules d, c, b, and a can be transported in advance.

[0254] Another possible implementation is to perform activation value transport at the batch level. This increases the dimension of activation value transport and maximizes the transport time. For example, as shown in Figure 15, when performing gradient backpropagation in the previous batch, the activation values ​​of the next batch can be transported in advance. Especially when the number of batches is large, this can maximize the transport time and achieve more efficient activation value transport.

[0255] Furthermore, when moving activation values ​​at the batch level, the granularity can also be layer or module level, thus selecting a more suitable granularity for moving activation values ​​based on the transmission bandwidth between the XPU and the CPU.

[0256] Furthermore, in an alternative approach, combining the steps of determining the activation value storage scheme in the aforementioned second stage, if the backpropagation duration can be calculated based on different granularities when determining the activation value storage, then the activation values ​​can also be moved at the corresponding granularity. For example, if the activation value storage scheme is determined at the granularity of the computation module in the aforementioned second stage, then the activation values ​​can be moved at the granularity of the computation module; if the activation value storage scheme is determined at the granularity of the network layer in the aforementioned second stage, then the activation values ​​can be moved at the granularity of the network layer; in some possible scenarios, the activation value storage scheme may also be determined at the granularity of the training batch, then the activation values ​​can also be moved at the granularity of the training batch.

[0257] IV. Model Update

[0258] After calculating the gradient values, new parameters can be calculated based on the model's parameters using these gradient values, and these new parameters can be used as the model parameters for the next iteration. Alternatively, if the convergence condition is met, the new parameters can be used as the model parameters as the training output. This convergence condition can be a preset training duration, a preset number of iterations, a preset model output accuracy, or a preset change in the model's output values ​​across multiple iterations that is less than a preset change value. The specific condition can be determined based on the actual application scenario.

[0259] For example, the overall process of model update can be shown in Figure 16. During the forward propagation of the model, the activation values ​​corresponding to A = {10, 12} are stored in HBM, and the stored values ​​corresponding to B = {0, 2, 3, 7, 8} are unloaded (i.e. stored) into the CPU's storage space. During the backpropagation, the activation values ​​can be prefetched from the CPU's storage space. Then, the gradient values ​​are calculated based on the prefetched activation values, and the new model parameters are calculated based on the gradient values. The new model parameters are used as the model parameters output in this iteration.

[0260] Therefore, in this embodiment, the storage and communication bandwidth resources of the XPU and CPU can be fully utilized to achieve more efficient model training with higher resource utilization. By fully utilizing idle network bandwidth and hierarchical storage resources, adaptive and proactive data migration of activation values ​​is achieved, supplementing storage and computation with network bandwidth, greatly eliminating recomputation overhead and significantly alleviating the memory and computational limitations of large model training. Through refined joint modeling of network, storage, and computation, just-in-time proactive data migration is realized, adapting to large model training scenarios with hundreds of models and thousands of variations, and significantly improving end-to-end performance.

[0261] For example, some large language models typically include an attention module. Its output activation values ​​usually occupy a small amount of memory but require a large amount of computation. Therefore, these activation values ​​can be stored in a secondary storage space, replacing the large amount of computation with less data transfer cost. As shown in Figure 17, the attention activation values ​​can be moved from HBM to DRAM via D2H during forward generation. The HBM tensor storage size (activation value storage size) can be resized to 0 at a specified forward time, reducing HBM's footprint. At a specified backward time, the HBM tensor storage size is restored, and the activation values ​​are moved back from DRAM to HBM. Parallel H2D / D2H data transfer via computation and D2D communication reduces recomputation. Performing model training on the same computing device, the method provided in this embodiment improves MFU by 17.6% compared to the activation value recomputation scheme.

[0262] The foregoing has described the method flow provided in the embodiments of this application. The following describes the structure of the apparatus for performing the foregoing method.

[0263] Referring to Figure 18, an embodiment of this application provides a schematic diagram of the structure of a first computing unit. The first computing unit may specifically be the first computing unit 101 mentioned in Figure 1 above, or the GPU, TPU, NPU or other neural network dedicated accelerators mentioned in Figures 2 to 5 above. The first computing unit may be used to execute the steps performed by the first computing unit in the method flow corresponding to Figures 6 to 17 above.

[0264] The first computing unit, the first storage unit, and the second storage unit are deployed in the first system, with the first computing unit coupled to the first storage unit; the first computing unit includes:

[0265] The first calculation module 1801 is used to obtain multiple calculation results, which are the results of calculating multiple first calculation tasks.

[0266] Storage module 1802 is used to send at least one first calculation result from a plurality of calculation results to a second storage unit;

[0267] The second calculation module 1803 is used to calculate a second calculation task based on the obtained first calculation result. When the second calculation task is calculated, the first calculation result corresponding to the next second calculation task is obtained from the second storage unit and stored in the first storage unit. The second calculation task corresponds to the calculation result of at least one first calculation task.

[0268] In one possible implementation, the aforementioned second computing module 1803 is specifically used to compute a second computing task during a first duration, and to retrieve the first computing result corresponding to the next second computing task from the second storage unit during a second duration and store it in the first storage unit, with the first duration covering the second duration.

[0269] In one possible implementation, the aforementioned first computational task is applied to the forward propagation of the neural network, and the computation result of the first computational task includes activation values. The second computational task is applied to the backward propagation of the neural network, and the second computational task includes gradient computation tasks in the backward propagation.

[0270] In one possible implementation, the aforementioned first computation task includes a computation task of the network layer, a computation task of the computation module, or a computation task of the training batch; the second computation task includes a gradient computation task of the network layer, a gradient computation task of the computation module, or a gradient computation task of the training batch.

[0271] In one possible implementation, the aforementioned second calculation module 1803 is specifically used to calculate the gradient calculation task of the second network layer during a first duration, and to obtain the first calculation result required for the gradient calculation task of the first network layer from the second storage unit during a second duration.

[0272] In one possible implementation, the aforementioned second computing module 1803 is specifically used to calculate the gradient calculation task of the second computing module during a first duration, and to obtain the calculation results required by the gradient calculation task of the first computing module from the second storage unit during a second duration.

[0273] In one possible implementation, the aforementioned second calculation module 1803 is specifically used to calculate the gradient calculation task of the first training batch during a first duration, and to obtain the calculation results required for the gradient calculation task of the second training batch from the second storage unit during a second duration.

[0274] In one possible implementation, the aforementioned first calculation module 1801 is further configured to determine at least one first calculation result from a plurality of calculation results;

[0275] Storage module 1802 is specifically used to send at least one first calculation result to a second storage unit.

[0276] In one possible implementation, the aforementioned first calculation module 1801 is further configured to determine at least one second calculation result from a plurality of calculation results;

[0277] The storage module is specifically used to store at least one second calculation result in the first storage unit.

[0278] In one possible implementation, the aforementioned first calculation module 1801 is further configured to determine at least one third calculation result from multiple calculation results, wherein at least one third calculation result is discarded, and before the first calculation unit executes the second calculation task corresponding to the third calculation result, the first calculation unit re-executes the first calculation task to obtain the third calculation result.

[0279] In one possible implementation, the aforementioned second calculation module 1803 is further configured to recalculate the first calculation task to obtain the third calculation result if the third calculation result is included among the multiple settlement results, and the total duration of the third duration and the first duration covers the second duration.

[0280] In one possible implementation, the aforementioned first calculation module 1801 is specifically used to determine at least one first calculation result from multiple calculation results based on the attribute information of the calculation result, wherein the attribute information includes at least one of the space occupied by the calculation result or the amount of calculation.

[0281] In one possible implementation, the aforementioned first calculation module 1801 constructs constraints based on the size of the first storage unit and the size of the second storage unit; under the constraints, the first calculation module 1801 determines a storage indication of the calculation result based on the attribute information of the calculation result and the duration of the second calculation task, and the storage indication is used to indicate that the calculation result is one of the first calculation result, the second calculation result, or the third calculation result.

[0282] In one possible implementation, the aforementioned first computing unit includes a graphics processing unit (GPU), a network processing unit (NPU), or a tensor processing unit (TPU).

[0283] Referring to Figure 19, this application provides a second storage unit applied to a first system. The first system includes a first computing unit, a first storage unit, and a second storage unit. The first computing unit is coupled to the first storage unit. Specifically, the second storage unit may be the second storage unit 104 mentioned in Figures 1 to 5 above, or other storage units that can be used to store data. The second storage unit can be used to perform the actions performed by the second storage unit in Figures 6 to 17 above.

[0284] The second storage unit includes:

[0285] Transceiver module 1901 is used to receive at least one first calculation result, which is the result of the first calculation unit calculating the first calculation task;

[0286] Storage module 1902 is used to store the at least one first calculation result;

[0287] The transceiver module 1901 is also used to calculate a second calculation task based on the first calculation result obtained by the first calculation unit, and the second storage unit sends the first calculation result corresponding to the next second calculation task to the first calculation unit when the current second calculation task is being calculated.

[0288] In one possible implementation, the aforementioned first system may specifically be a neural network system, that is, a system in which a neural network is deployed or used to perform computations in a neural network. Accordingly, the aforementioned first computation task and second computation task may be computation tasks of the neural network.

[0289] In one possible implementation, the aforementioned first computational task is applied to the forward propagation of the neural network, and the computation result of the first computational task includes activation values. The second computational task is applied to the backward propagation of the neural network, and the second computational task includes gradient computation tasks in the backward propagation.

[0290] In one possible implementation, the aforementioned neural network includes at least one network layer, each network layer including at least one computation module, and the training of the neural network is divided into multiple training batches; accordingly, the first computation task may include the computation task of the network layer, the computation task of the computation module, or the computation task of the training batch; the second computation task may include the gradient computation task of the network layer, the gradient computation task of the computation module, or the gradient computation task of the training batch.

[0291] In one possible implementation, the aforementioned transceiver module 1901 is specifically used to: when the first computing unit calculates the gradient calculation task of the second network layer for a first duration, the second storage unit sends the first calculation result required for the gradient calculation task of the first network layer to the first computing unit for a second duration.

[0292] In one possible implementation, the aforementioned transceiver module 1901 is specifically used to: when the first computing unit calculates the gradient calculation task of the second computing module for a first duration, the second storage unit sends the first calculation result required by the gradient calculation task of the first computing module to the first computing unit for a second duration.

[0293] In one possible implementation, the aforementioned transceiver module 1901 is specifically used to: when the first computing unit calculates the gradient calculation task of the first training batch for a first duration, the second storage unit sends the first calculation result required for the gradient calculation task of the second training batch to the first computing unit for a second duration.

[0294] Furthermore, the aforementioned first computing unit or second computing unit can be in the form of a computing device or a chip. The structures of possible computing devices and chips will be described below.

[0295] Figure 20 shows a schematic diagram of the hardware structure of a computing device 200 provided in an embodiment of this application. This computing device 200 can be used to implement the steps executed by the first computing unit or the second computing unit in the methods shown in Figures 6 to 17.

[0296] The computing device 200 shown in Figure 20 may include a processor 2001, a memory 2002, a communication interface 2003, and a bus 2004. The processor 2001, the memory 2002, and the communication interface 2003 can be connected to each other via the bus 2004.

[0297] The processor 2001 is the control center of the computing device 200. It can be a general-purpose central processing unit (CPU) or other general-purpose processors. The general-purpose processor can be a microprocessor or any conventional processor, such as a GPU or NPU, and can be adapted to the actual application scenario.

[0298] As an example, processor 2001 may include one or more CPUs, and may also include other processors, such as the CPU, NPU, or GPU shown in Figure 20. The first computing unit and the second computing unit mentioned in the aforementioned method refer to one type of processor or chip; for example, the first computing unit may be a GPU, NPU, or TPU, and the second computing unit may be a CPU or other storage space.

[0299] The memory 2002 may be a read-only memory (ROM) or other type of static storage device capable of storing static information and instructions, random access memory (RAM) or other type of dynamic storage device capable of storing information and instructions, or electrically erasable programmable read-only memory (EEPROM), disk storage media or other magnetic storage devices, or any other medium capable of carrying or storing desired program code in the form of instructions or data structures and accessible by a computer, but is not limited thereto.

[0300] In one possible implementation, the memory 2002 may exist independently of the processor 2001. The memory 2002 can be connected to the processor 2001 via a bus 2004 and is used to store data, instructions, or program code. When the processor 2001 calls and executes the instructions or program code stored in the memory 2002, it can implement the methods provided in the embodiments of this application, such as the methods shown in Figures 6 to 17.

[0301] In another possible implementation, the memory 2002 can also be integrated with the processor 2001.

[0302] Communication interface 2003 is used for connecting computing device 200 to other devices via a communication network, which may be Ethernet, radio access network (RAN), wireless local area network (WLAN), etc. Communication interface 2003 may include a receiving unit for receiving data and a transmitting unit for transmitting data.

[0303] Bus 2004 can be an industry standard architecture (ISA) bus, a peripheral component interconnect (PCI) bus, or an extended industry standard architecture (EISA) bus. This bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only one thick line is used in Figure 20, but this does not indicate that there is only one bus or one type of bus.

[0304] It should be noted that the structure shown in FIG20 does not constitute a limitation on the computing device 200. In addition to the components shown in FIG20, the computing device 200 may include more or fewer components than shown, or combine certain components, or have different component arrangements.

[0305] For example, please refer to Figure 21, which is a schematic diagram of a chip structure provided in an embodiment of this application. This chip can specifically be a manifestation of the aforementioned first computing unit, such as a GPU, TPU, NPU, or other neural network-specific accelerators, referred to here simply as an XPU. The chip can be represented as an XPU 210, which is mounted as a coprocessor on the host CPU, with tasks assigned by the host CPU. The core of the XPU is the arithmetic circuit 2103, which is controlled by a controller 2104 to extract matrix data from memory and perform multiplication operations.

[0306] In some implementations, the arithmetic circuit 2103 internally includes multiple process engines (PEs). In some implementations, the arithmetic circuit 2103 is a two-dimensional pulsating array. The arithmetic circuit 2103 can also be a one-dimensional pulsating array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 2103 is a general-purpose matrix processor.

[0307] For example, suppose we have an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit retrieves the corresponding data of matrix B from the weight memory 2102 and caches it in each PE of the arithmetic circuit. The arithmetic circuit retrieves the data of matrix A from the input memory 2101 and performs matrix operations with matrix B. The partial result or the final result of the obtained matrix is ​​stored in the accumulator 2108.

[0308] Unified memory 2106 is used to store input and output data. Weight data is directly transferred to weight memory 2102 via direct memory access controller (DMAC) 2105. Input data is also transferred to unified memory 2106 via DMAC.

[0309] The bus interface unit (BIU) 2110 is used for interaction between the AXI bus and the DMAC and instruction fetch buffer (IFB) 2109.

[0310] The bus interface unit 2110 (BIU) is used by the instruction fetch memory 2109 to fetch instructions from external memory, and also by the memory access controller 2105 to fetch the original data of the input matrix A or the weight matrix B from external memory.

[0311] The DMAC is mainly used to move input data from external memory DDR to unified memory 2106, or to weight data to weight memory 2102, or to input data to input memory 2101.

[0312] The vector computation unit 2107 includes multiple arithmetic processing units that, when necessary, further process the output of the computation circuit, such as vector multiplication, vector addition, exponential operations, logarithmic operations, size comparisons, etc. It is mainly used for computation in non-convolutional / fully connected layers of neural networks, such as batch normalization, pixel-level summation, and upsampling of feature planes.

[0313] In some implementations, vector computation unit 2107 can store the processed output vector in unified memory 2106. For example, vector computation unit 2107 can apply linear and / or nonlinear functions to the output of computation circuit 2103, such as linear interpolation of feature planes extracted by convolutional layers, or, for example, accumulating a vector of values ​​to generate activation values. In some implementations, vector computation unit 2107 generates normalized values, pixel-level summed values, or both. In some implementations, the processed output vector can be used as activation input to computation circuit 2103, for example, for use in subsequent layers of the neural network.

[0314] The instruction fetch buffer 2109 connected to the controller 2104 is used to store the instructions used by the controller 2104;

[0315] Unified memory 2106, input memory 2101, weighted memory 2102, and instruction fetch memory 2109 are all on-chip memories. External memory is proprietary to this XPU hardware architecture.

[0316] The operations of each network layer or each computing module in the neural network can be performed by the computing circuit 2103 or the vector computing unit 2107.

[0317] The aforementioned first storage unit may include the unified memory 2106 therein.

[0318] The processor mentioned above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of programs in the methods shown in Figures 6 to 17.

[0319] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CPUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk, etc., including several instructions to cause a data quantization device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0320] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.

[0321] This application also provides a computer-readable storage medium storing a program for training a model or performing inference tasks, which, when run on a computer, causes the computer to perform all or part of the steps in the methods described in the embodiments shown in Figures 6 to 17 above.

[0322] This application also provides a digital processing chip. This digital processing chip integrates circuitry for implementing the aforementioned processor or processor functions, and one or more interfaces. When the digital processing chip integrates a memory, it can perform the method steps of any one or more of the foregoing embodiments. When the digital processing chip does not integrate a memory, it can be connected to an external memory via a communication interface. The digital processing chip implements the method steps of any one or more of the foregoing embodiments based on the program code stored in the external memory.

[0323] This application also provides a computer program product comprising one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in this application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a server or data center that integrates one or more available media. The available medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., a solid-state disk (SSD)).

[0324] The first and second computing units provided in this application embodiment can be chips, each including a processing unit and a communication unit. The processing unit can be, for example, a processor, and the communication unit can be, for example, an input / output interface, pins, or circuits. The processing unit can execute computer execution instructions stored in the storage unit to cause the chip within the server to execute the methods described in the embodiments shown in Figures 6 to 17. Optionally, the storage unit can be a storage unit within the chip, such as a register or cache. Alternatively, the storage unit can be a storage unit located outside the chip within a computing device, such as read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, such as random access memory (RAM).

[0325] Specifically, the aforementioned processing unit or processor can be a central processing unit (CPU), a neural-network processing unit (NPU), a graphics processing unit (GPU), a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.

[0326] It should also be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. In addition, in the device embodiment drawings provided in this application, the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.

[0327] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CPUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0328] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.

[0329] The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a server or data center that integrates one or more available media. The available medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state disk (SSD)).

[0330] The terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in a sequence other than that illustrated or described herein. The term "and / or" in this application is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, and B alone. Additionally, the character " / " generally indicates that the preceding and following related objects are in an "or" relationship. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not explicitly listed or inherent to such processes, methods, products, or devices. The naming or numbering of steps in this application does not imply that the steps in the method flow must be executed in the time / logical order indicated by the naming or numbering. The execution order of the named or numbered process steps can be changed according to the technical purpose to be achieved, as long as the same or similar technical effect can be achieved. The division of modules in this application is a logical division. In actual applications, there may be other division methods. For example, multiple modules may be combined into or integrated into another system, or some features may be ignored or not executed. In addition, the coupling or direct coupling or communication connection between the modules shown or discussed may be through some ports, and the indirect coupling or communication connection between modules may be electrical or other similar forms, which are not limited in this application. Furthermore, the modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed in multiple circuit modules. Some or all of the modules can be selected to achieve the purpose of the solution in this application according to actual needs.

Claims

1. A data processing method, characterized in that, It is applied to a first computing unit, the first computing unit, the first storage unit and the second storage unit are deployed in a first system, and the first computing unit is coupled to the first storage unit; The method includes: The first computing unit acquires multiple computing results, which are the results of computing multiple first computing tasks. The first calculation unit sends at least one of the plurality of calculation results to the second storage unit; The first computing unit calculates a second computing task based on the obtained first computing result. When the second computing task is calculated, it retrieves the first computing result corresponding to the next second computing task from the second storage unit and stores it in the first storage unit. The second computing task corresponds to at least one computing result of the first computing task.

2. The method according to claim 1, characterized in that, When the second computing task performs its calculations, it retrieves the first calculation result corresponding to the next second computing task from the second storage unit and stores it in the first storage unit, including: The first computing unit calculates the second computing task in a first duration, and retrieves the first computing result corresponding to the next second computing task from the second storage unit in a second duration and stores it in the first storage unit. The first duration covers the second duration.

3. The method according to claim 2, characterized in that, The first computational task is applied to the forward propagation of the neural network, and the computation result of the first computational task includes activation values. The second computational task is applied to the backward propagation of the neural network, and the second computational task includes the gradient computation task in the backward propagation.

4. The method according to claim 3, characterized in that, The neural network includes at least one network layer, and each network layer includes at least one computing module. The training of the neural network is divided into multiple training batches. The first computation task includes the computation task of the network layer, the computation task of the computation module, or the computation task of the training batch; The second computation task includes gradient computation tasks of network layers, gradient computation tasks of computation modules, or gradient computation tasks of training batches.

5. The method according to claim 4, characterized in that, The first computing unit calculates the second computing task for a first duration, and retrieves the first computing result required for the next second computing task from the second storage unit and stores it in the first storage unit for a second duration, including: The first computing unit calculates the gradient calculation task of the second network layer during the first duration, and retrieves the first calculation result required for the gradient calculation task of the first network layer from the second storage unit during the second duration.

6. The method according to claim 4, characterized in that, The first computing unit calculates one of the second computing tasks during a first time period, and retrieves the calculation results required for the next second computing task from the second storage unit during a second time period and stores them in the first storage unit, including: The first computing unit calculates the gradient calculation task of the second computing module during the first duration, and retrieves the calculation results required by the gradient calculation task of the first computing module from the second storage unit during the second duration.

7. The method according to any one of claims 4-6, characterized in that, The first computing unit calculates one of the second computing tasks during a first time period, and retrieves the calculation results required for the next second computing task from the second storage unit during a second time period and stores them in the first storage unit, including: The first computing unit calculates the gradient calculation task for the first training batch during the first duration, and retrieves the calculation results required for the gradient calculation task for the second training batch from the second storage unit during the second duration.

8. The method according to any one of claims 3-7, characterized in that, The first calculation unit sends at least one first calculation result from the plurality of calculation results to the second storage unit, including: The first calculation unit determines at least one first calculation result from the plurality of calculation results; The first computing unit sends the at least one first calculation result to the second storage unit.

9. The method according to claim 8, characterized in that, The method further includes: The first calculation unit determines at least one second calculation result from the plurality of calculation results; The first computing unit stores the at least one second computing result in the first storage unit.

10. The method according to claim 8 or 9, characterized in that, The method further includes: The first computing unit determines at least one third computing result from the plurality of computing results. The at least one third computing result is discarded. Before the first computing unit executes the second computing task corresponding to the third computing result, the first computing unit re-executes the first computing task to obtain the third computing result.

11. The method according to claim 10, characterized in that, If the third calculation result is included among multiple settlement results, the method further includes: The first computing unit recalculates the first computing task during the third duration to obtain the third computing result, and the sum of the third duration and the first duration covers the second duration.

12. The method according to any one of claims 8-11, characterized in that, The first calculation unit determines the at least one first calculation result from the plurality of calculation results, including: The first calculation unit determines at least one of the first calculation result, the second calculation result, or the third calculation result from a plurality of calculation results based on the attribute information of the calculation result, wherein the attribute information includes at least one of the space occupied by the calculation result or the amount of computation.

13. The method according to claim 12, characterized in that, The first calculation unit determines at least one of the first calculation result, the second calculation result, or the third calculation result from a plurality of calculation results based on the attribute information of the calculation result, including: The first computing unit constructs constraints based on the size of the first storage unit and the size of the second storage unit; Under the constraints of the constraints, the first calculation unit determines a storage indication of the calculation result based on the attribute information of the calculation result and the duration of the second calculation task. The storage indication is used to indicate that the calculation result is one of the first calculation result, the second calculation result, or the third calculation result.

14. The method according to any one of claims 1-13, characterized in that, The first computing unit includes a graphics processing unit (GPU), a network processing unit (NPU), or a tensor processing unit (TPU).

15. The method according to any one of claims 1-14, characterized in that, The first system also includes a second computing unit, the second storage unit is mounted to the second computing unit, or the second storage unit is a disk.

16. A first computing unit, characterized in that, The first computing unit, the first storage unit, and the second storage unit are deployed in the first system, with the first computing unit coupled to the first storage unit; The first calculation module is used to obtain multiple calculation results, wherein the multiple calculation results are the result of calculating multiple first calculation tasks; A storage module is configured to send at least one first calculation result from a plurality of calculation results to the second storage unit; The second calculation module is used to calculate a second calculation task based on the obtained first calculation result. When the second calculation task is calculated, the first calculation result corresponding to the next second calculation task is obtained from the second storage unit and stored in the first storage unit. The second calculation task corresponds to at least one calculation result of the first calculation task.

17. A system, characterized in that, The system includes a first computing unit, a first storage unit, and a second storage unit. The first computing unit is coupled to the first storage unit. The first storage unit and the second storage unit are used for the first computing unit to store data. The first computing unit is used to perform the steps in the method as described in any one of claims 1-15.

18. A computing device, characterized in that, The device includes a memory and a processor; the memory stores code, and the processor is configured to execute the code, wherein when the code is executed, the device performs the steps of the method as described in any one of claims 1-15.

19. A computer storage medium, characterized in that, The computer storage medium stores instructions that, when executed by the computer, cause the computer to perform the method according to any one of claims 1 to 15.

20. A computer program product, characterized in that, The computer program product stores instructions that, when executed by a computer, cause the computer to perform the method described in any one of claims 1 to 15.