Ai model training method and apparatus, system, and related device

By adjusting the batch size according to the trend of loss value changes during AI model training, the problems of low training efficiency and low accuracy in existing technologies are solved, achieving more efficient and higher accuracy training results.

WO2026123891A1PCT designated stage Publication Date: 2026-06-18HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2025-09-25
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

In the training process of existing AI models, dynamically adjusting the batch size can easily lead to low training efficiency and low inference accuracy.

Method used

By obtaining the trend of loss value changes during iterative training, the hyperparameters of the AI ​​model, including the batch size and learning rate, can be adjusted to match the convergence trend of the model and avoid unreasonable batch size values ​​from affecting training efficiency and accuracy.

🎯Benefits of technology

It improves the training efficiency and inference accuracy of AI models, and avoids local optima and poor generalization performance caused by unreasonable batch size values.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025123943_18062026_PF_FP_ABST
    Figure CN2025123943_18062026_PF_FP_ABST
Patent Text Reader

Abstract

An artificial intelligence (AI) model training method and apparatus, a system, and a related device, relating to the technical field of AI. The method comprises: during iterative training of an AI model by an accelerator in a data processing system, acquiring loss values of the AI model during multi-round iterative training, and adjusting the values of hyper-parameters of the AI model on the basis of the change trend of the loss values of the AI model during the multi-round iterative training, wherein the hyper-parameters include a batchsize; and on the basis of the adjusted values of the hyper-parameters, continuing to perform iterative training on the AI model. The change trend of the loss values can truly reflect the convergence trend of the AI model during the multi-round iterative training, and thus, the value of the batchsize of the AI model can be adjusted on the basis of the change trend of the loss values of the AI model during the multi-round iterative training, so that the adjusted value of the batchsize can match the convergence trend of the AI model, thereby accelerating the training of the AI model, and also enabling the inference accuracy of the AI model to reach a high level.
Need to check novelty before this filing date? Find Prior Art

Description

AI model training methods, devices, systems and related equipment

[0001] This application claims priority to Chinese patent application filed on December 12, 2024, with application number 202411846557.7 and title "AI Model Training Method, Apparatus, System and Related Equipment", the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of artificial intelligence technology, and in particular to an AI model training method, apparatus, system and related equipment. Background Technology

[0003] With the development of artificial intelligence (AI) technology, AI models have been widely used in fields such as natural language processing (NLP), image processing, autonomous driving, and healthcare. Specifically, they can be used to provide corresponding reasoning services by utilizing trained AI models.

[0004] Currently, during the iterative training of AI models, the values ​​of hyperparameters such as batch size can typically be dynamically adjusted to improve training efficiency. Specifically, this can be achieved by recording the gradient g of the AI ​​model during the m-th iteration. m Furthermore, after performing the nth iteration for the AI ​​model, the gradient g during the nth iteration training process is calculated. n The gradient g during the m-th iteration of training m The cosine value between and . Where m and n are both positive integers, and m is less than n. When the gradient g n With gradient g m If the cosine value between the two values ​​is less than the preset value, increase the batch size of the AI ​​model to accelerate the convergence of the AI ​​model, thereby improving the training efficiency of the AI ​​model.

[0005] However, training AI models using the above methods can easily lead to low inference accuracy in the trained AI models. Summary of the Invention

[0006] This application provides an AI model training method to improve the compression effect of images with a pixel bit depth of two bytes and reduce the storage space required for compressed image data. Furthermore, this application also provides a corresponding AI model training apparatus, computing device, data processing system, computer-readable storage medium, and computer program product.

[0007] Firstly, this application provides an AI model training method applied to a data processing system. This system includes an accelerator, such as an NPU or GPU, capable of data acceleration. The accelerator in the data processing system can be used to train an AI model, such as an LLAMA model. Specifically, this method can be executed by a corresponding AI model training device. During iterative training of the AI ​​model, the device can obtain the loss value of the AI ​​model in multiple rounds of iterative training and adjust the hyperparameters of the AI ​​model based on the changing trend of the loss value (such as the average slope of the loss value). These hyperparameters include the batch size; in practical applications, they may also include the learning rate. Therefore, the AI ​​model training device can continue iterative training of the AI ​​model based on the adjusted hyperparameter values.

[0008] Since the trend of the loss value truly reflects the convergence trend of the AI ​​model during multiple rounds of iterative training, the AI ​​model training device adjusts the batch size of the AI ​​model based on this trend. This ensures that the adjusted batch size matches the convergence trend of the AI ​​model. For example, if the AI ​​model's loss value oscillates, indicating that the model is not converging during multiple rounds of training, increasing the batch size can promote convergence. Conversely, if the slope of the loss value's decline is steep, indicating that the model is converging too quickly, decreasing the batch size can slow down the convergence and prevent the model from getting trapped in local optima. Therefore, continuing to train the AI ​​model based on a batch size that matches its convergence trend not only accelerates training but also prevents unreasonable batch size settings from affecting the model's inference accuracy, thus enabling the AI ​​model to achieve a higher level of inference accuracy.

[0009] In one possible implementation, when adjusting the hyperparameters of the AI ​​model based on the changing trend of the loss value during multiple rounds of iterative training, the AI ​​model training device can first calculate the average slope of the loss value during the multiple rounds of iterative training. Furthermore, when the average slope is greater than zero, the AI ​​model training device increases the batch size of the AI ​​model. Thus, when the average slope of the loss value is greater than 0, indicating that the AI ​​model is in an oscillating (non-convergent) state, increasing the batch size can accelerate the convergence of the AI ​​model, thereby improving the training efficiency.

[0010] In one possible implementation, when the AI ​​model training device obtains the loss value of the AI ​​model during multiple rounds of iterative training, it can specifically obtain the loss value of the AI ​​model during the multiple rounds of iterative training corresponding to the first observation window and the second observation window, with the second observation window's iterative training occurring before the first observation window's. Therefore, when the AI ​​model training device adjusts the hyperparameter values ​​of the AI ​​model based on the changing trend of the loss value during the multiple rounds of iterative training, it can specifically calculate the first average slope of the loss value during the first observation window's iterative training. Furthermore, when the first average slope is less than zero, indicating that the AI ​​model is gradually converging, the AI ​​model training device can calculate the absolute value of the second average slope of the loss value during the second observation window's iterative training. When the absolute value of the first average slope is less than the absolute value of the second average slope, indicating that the AI ​​model's convergence speed is slowing down, the AI ​​model training device can increase the batch size of the AI ​​model. When the absolute value of the first average slope is greater than the absolute value of the second average slope, indicating that the AI ​​model's convergence speed is faster, the AI ​​model training device can reduce the batch size of the AI ​​model. In this way, by comparing the average slope of the loss values ​​within two observation windows, the AI ​​model training device can determine the convergence trend of the AI ​​model during multiple rounds of training. Therefore, when the AI ​​model is approaching convergence, the AI ​​model training device can dynamically adjust the batch size based on the convergence trend, thereby accelerating the AI ​​model's training (increasing the batch size when it is small) while ensuring that the AI ​​model's training accuracy reaches a high level (decreasing the batch size when it is large).

[0011] In one possible implementation, when the absolute value of the first average slope is less than the absolute value of the second average slope, the AI ​​model training device increases the batch size of the AI ​​model. Specifically, this can be done when the absolute value of the first average slope is less than the absolute value of the second average slope, and the deviation between the absolute values ​​of the first and second average slopes satisfies an upward adjustment condition (e.g., the difference between these two absolute values ​​is greater than an upward adjustment threshold). Conversely, when the absolute value of the first average slope is greater than the absolute value of the second average slope, the AI ​​model training device decreases the batch size of the AI ​​model. Specifically, this can be done when the absolute value of the first average slope is greater than the absolute value of the second average slope, and the deviation between the absolute values ​​of the first and second average slopes satisfies a downward adjustment condition. Thus, the AI ​​model training device adjusts the batch size of the AI ​​model only when the deviation between the average slopes of the loss values ​​in the two observation windows meets certain conditions. This not only reduces the overhead caused by frequently adjusting the batch size value, but also avoids affecting the training efficiency or training accuracy of the AI ​​model due to unreasonable batch size adjustments.

[0012] In one possible implementation, the data processing system includes multiple accelerators, through which the AI ​​model is trained. The system also includes a central processing unit (CPU), and the AI ​​model comprises a first network layer and a second network layer. In this system, during backpropagation of the AI ​​model, the CPU updates the parameter values ​​in the first network layer based on the gradient data calculated by the multiple accelerators for the first network layer. Simultaneously, while the CPU updates the parameter values ​​in the first network layer, the multiple accelerators perform backpropagation calculations for the second network layer. By parallelizing the CPU's parameter update process and the accelerators' backpropagation process, the training efficiency of the AI ​​model in a single training round can be effectively improved.

[0013] In one possible implementation, the first network layer includes a first parameter and a second parameter. After receiving gradient data for the first parameter, the CPU can update the value of the first parameter using this gradient data. Simultaneously, the CPU can execute the receiving (from the accelerator) of gradient data for the second parameter in parallel. This parallel execution of receiving gradient data and updating parameter values ​​by the CPU improves the efficiency of updating parameter values.

[0014] Furthermore, while updating the value of the second parameter using the gradient data for that second parameter, the CPU can send the updated value of the first parameter (to the accelerator). In this way, the CPU can execute the process of updating the parameter value and feeding back the updated parameter value in parallel, further improving the overall efficiency of parameter updates.

[0015] In one possible implementation, multiple accelerators can perform gradient overflow detection on the first network layer before the CPU updates the parameter values ​​in the first network layer. Since gradient data is typically tensor data, and accelerators have tensor computing power, having the accelerators use tensor computing power to perform gradient overflow detection on the gradient data can effectively improve the efficiency of gradient overflow detection compared to using the CPU for gradient overflow detection.

[0016] In one possible implementation, when the CPU updates the parameter values ​​in the first network layer based on the gradient data calculated by multiple accelerators for the first network layer, it can specifically execute scaling operators and optimizer operators in parallel based on a single instruction multiple data stream (SIMD) architecture, using the gradient data calculated by multiple accelerators for the first network layer. These scaling operators and optimizer operators are used to update the parameter values ​​in the first network layer. In this way, the CPU's parallel execution of scaling operators and optimizer operators based on the SIMD architecture can improve the efficiency of updating the parameter values ​​in the network layer.

[0017] In one possible implementation, the data processing system includes multiple CPUs, the reduction results are stored in memory, and the parameter values ​​in the first network layer are updated by the CPU with the highest memory access efficiency among the multiple CPUs. Thus, utilizing a CPU with higher memory affinity to update the parameter values ​​in the network layer can improve the efficiency of CPU accessing data from memory, thereby accelerating the parameter value update process.

[0018] In one possible implementation, the data processing system further includes a central processing unit (CPU). If the AI ​​model includes a third network layer, the CPU can update the values ​​of the parameters in the third network layer, where the updated values ​​are obtained as values ​​in a first format. Then, the CPU converts the first format values ​​into values ​​in a second format, where the precision of the first format is higher than that of the second format. The CPU then sends the second format values ​​of the parameters in the third network layer to multiple accelerators. In this way, sending low-precision data to the accelerators by the CPU effectively reduces the communication bandwidth consumption between the CPU and the multiple accelerators and reduces the storage space occupied by the data in the accelerators.

[0019] Secondly, this application provides an AI model training device. The device is applied to a data processing system, which includes an accelerator for training an AI model. The device includes: an acquisition module for acquiring the loss value of the AI ​​model during multiple rounds of iterative training; an adjustment module for adjusting the values ​​of the hyperparameters of the AI ​​model, including the batch size, based on the changing trend of the loss value during multiple rounds of iterative training; and a training module for continuing iterative training of the AI ​​model based on the adjusted hyperparameter values.

[0020] In one possible implementation, the adjustment module is used to: calculate the average slope of the loss value of the AI ​​model during multiple rounds of iterative training; and increase the batch size of the AI ​​model when the average slope is greater than zero.

[0021] In one possible implementation, the acquisition module is configured to: acquire the loss value of the AI ​​model during the multi-round iterative training process corresponding to the first observation window, and the loss value of the AI ​​model during the multi-round iterative training process corresponding to the second observation window, wherein the multi-round iterative training process corresponding to the second observation window is executed before the multi-round iterative training process corresponding to the first observation window; the adjustment module is configured to: calculate the first average slope of the loss value of the AI ​​model during the multi-round iterative training process corresponding to the first observation window; when the first average slope is less than zero, calculate the absolute value of the second average slope of the loss value of the AI ​​model during the multi-round iterative training process corresponding to the second observation window; when the absolute value of the first average slope is less than the absolute value of the second average slope, increase the batch size of the AI ​​model; when the absolute value of the first average slope is greater than the absolute value of the second average slope, decrease the batch size of the AI ​​model.

[0022] In one possible implementation, the adjustment module is configured to: increase the batch size of the AI ​​model when the absolute value of the first average slope is less than the absolute value of the second average slope and the deviation between the absolute values ​​of the first average slope and the second average slope meets the upward adjustment condition; and decrease the batch size of the AI ​​model when the absolute value of the first average slope is greater than the absolute value of the second average slope and the deviation between the absolute values ​​of the first average slope and the second average slope meets the downward adjustment condition.

[0023] In one possible implementation, the data processing system includes multiple accelerators, through which the AI ​​model is trained. The data processing system also includes a central processing unit (CPU). The AI ​​model includes a first network layer and a second network layer. The CPU is used to update the parameter values ​​in the first network layer based on the gradient data calculated by the multiple accelerators for the first network layer during the backpropagation process of the AI ​​model. The multiple accelerators are used to perform backpropagation calculations for the second network layer while the CPU is updating the parameter values ​​in the first network layer.

[0024] In one possible implementation, multiple accelerators are also used to perform gradient overflow detection on the first network layer before the CPU updates the parameter values ​​in the first network layer.

[0025] In one possible implementation, the CPU is specifically used to execute scaling operators and optimizer operators in parallel based on a single instruction multiple data stream (SIMD) architecture, according to gradient data calculated by multiple accelerators for the first network layer. The scaling operators and optimizer operators are used to update the parameter values ​​in the first network layer.

[0026] In one possible implementation, the data processing system includes multiple CPUs, the reduction result is stored in memory, and the parameter values ​​in the first network layer are updated by the CPU with the highest memory access efficiency among the multiple CPUs.

[0027] In one possible implementation, the data processing system further includes a central processing unit (CPU), and the AI ​​model includes a third network layer; the CPU is also used to update the values ​​of the parameters in the third network layer, wherein the updated values ​​of the parameters in the third network layer are obtained as numerical values ​​in a first format; converting the numerical values ​​in the first format into numerical values ​​in a second format, wherein the data precision of the first format is higher than that of the second format; and sending the second format values ​​of the parameters in the third network layer to multiple accelerators.

[0028] The AI ​​model training device provided in the second aspect corresponds to the AI ​​model training method provided in the first aspect. Therefore, the technical effects of any implementation of the AI ​​model training device provided in the second aspect can be found in the relevant descriptions of the technical effects of the corresponding implementation in the first aspect above, and will not be elaborated further.

[0029] Thirdly, this application provides a computing device including a processor and a memory. The processor and the memory communicate with each other. The processor executes instructions stored in the memory to cause the computing device to perform an AI model training method as described in the first aspect or any implementation thereof. It should be noted that the memory can be integrated into the processor or can be independent of the processor. The computing device may also include a bus. The processor is connected to the memory via the bus. The memory may include readable storage and random access memory.

[0030] Fourthly, this application provides a data processing system, which includes at least one accelerator and an AI model training device, wherein the AI ​​model training device is used to perform the operation steps of the AI ​​model training method described in the first aspect or any implementation thereof, and the at least one accelerator is used to train an AI model.

[0031] Fifthly, this application provides a computer-readable storage medium storing instructions that, when executed on a computing device, cause the computing device to perform the operation steps of the AI ​​model training method described in the first aspect or any implementation thereof.

[0032] Sixthly, this application provides a computer program product containing instructions that, when run on a computing device, causes the computing device to perform the operational steps of the AI ​​model training method described in the first aspect or any implementation thereof.

[0033] Based on the implementation methods provided in the above aspects, this application can be further combined to provide more implementation methods. Attached Figure Description

[0034] Figure 1 is a schematic diagram of the structure of an exemplary data processing system;

[0035] Figure 2 is a flowchart illustrating an AI model training method provided in this application;

[0036] Figure 3 is a schematic diagram of the process of adjusting the batch size value based on the loss value of the AI ​​model during multiple training rounds;

[0037] Figure 4 illustrates how parallelization reduces the time required for a single round of AI model training.

[0038] Figure 5 is a schematic diagram of the process by which the accelerator 101 and the CPU 201 work together to update parameter values.

[0039] Figure 6 illustrates the benefits of optimizing a single round of training for the LLaMA-13B model;

[0040] Figure 7 is a schematic diagram of the structure of an AI model training device provided in this application;

[0041] Figure 8 is a schematic diagram of the hardware structure of a computing device provided in this application. Detailed Implementation

[0042] To make the above-mentioned objectives, features, and advantages of this application more apparent and understandable, various non-limiting embodiments of the present application will be described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this application. All other embodiments obtained based on the embodiments in this application and based on the above content are within the scope of protection of this application.

[0043] Referring to Figure 1, which is a schematic diagram of an exemplary data processing system, the data processing system 10 includes multiple accelerators. Figure 1 illustrates an example including accelerators 101 to 104. The multiple accelerators can communicate with each other via a bus.

[0044] An accelerator is a processor that can accelerate data computation to improve computing performance. For example, it can be a graphics processing unit (GPU), a neural network processing unit (NPU), or a tensor processing unit (TPU), or it can be an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other hardware that can realize parallel data computation. There is no limitation on this.

[0045] As shown in Figure 1, the data processing system 10 may also include one or more central processing units (CPUs). Figure 1 illustrates this with CPU 201 and CPU 202 as an example. Data communication between the CPU and the accelerator, as well as between different CPUs, can be achieved via a bus.

[0046] In practical applications, multiple accelerators in the data processing system 10 can be used for distributed training of the AI ​​model. Different network layers in the AI ​​model can be deployed to different accelerators, allowing each accelerator to train parameters for a portion of the network layers. This avoids limiting the parameter size of the AI ​​model due to the computational power of a single accelerator. The CPUs in the data processing system 10 can also participate in the AI ​​model training process; for example, the process of updating model parameters based on gradient data can be offloaded to CPU 201 or CPU 202.

[0047] During iterative training of AI models, the batch size can typically be dynamically adjusted. Batch size is a hyperparameter used in model training to indicate the number of training samples provided to the AI ​​model in a single training iteration. In practice, batch size can be positive integers such as 1, 2, 4, 8, and 12. Generally, a larger batch size results in more stable gradient updates during training, accelerating the convergence of the AI ​​model (training is considered complete when the model converges to a certain level). A smaller batch size introduces more randomness in each training iteration, preventing the model from getting stuck in local optima. However, this increased randomness can also slow down convergence, leading to a longer overall training time.

[0048] During the iterative training of an AI model, if we consider the gradient g of the AI ​​model in the nth iteration of training... n The gradient g during the m-th iteration of training m (The absolute value of the difference between m and n is greater than 1). Calculate the cosine value between these two gradients. When this cosine value is small, it indicates that the AI ​​model is in an oscillating state (i.e., the AI ​​model has not converged). Therefore, increasing the batch size of the AI ​​model can promote convergence. However, since the parameter update of the AI ​​model is performed in a high-dimensional complex space, when the gradient g... n With gradient g m When optimizing parameter values ​​across different dimensions, the gradient g n With gradient g m The cosine value between them will also be relatively small, but the AI ​​model may already be approaching convergence (e.g., the loss value of the AI ​​model decreases). At this point, if we consider the gradient g... n With gradient g mIf the cosine value between training samples is small, it may be mistakenly assumed that the AI ​​model is in an oscillating state. Increasing the batch size to accelerate convergence can easily lead to the AI ​​model getting trapped in local optima, resulting in low generalization performance (overfitting the training samples). This significantly reduces the AI ​​model's accuracy in real-world inference scenarios. Therefore, while the AI ​​model may converge quickly, its inference accuracy is often low.

[0049] Based on this, the data processing system 10 shown in Figure 1 may also include a training device 300, which can configure a suitable value for the batch size of the AI ​​model during the AI ​​model training process.

[0050] In practice, during the iterative training of the AI ​​model across multiple accelerators, the training device 300 can acquire the loss value of the AI ​​model during these multiple training iterations. For example, the training device 300 can record the loss value of the AI ​​model in each iteration. The loss value is a numerical indicator that measures the difference between the AI ​​model's prediction and the actual result at the current stage. It is typically used to calculate the gradient data of the AI ​​model to update the parameters. Then, based on the changing trend of the loss value during the multiple training iterations, the training device 300 adjusts the batch size of the AI ​​model and continues iterative training based on the adjusted batch size.

[0051] Since the trend of the AI ​​model's loss value truly reflects its convergence trend during multiple rounds of iterative training, adjusting the AI ​​model's batch size based on this trend ensures the adjusted batch size matches the model's convergence trend. For example, if the AI ​​model's loss value oscillates, indicating it's not converging during training, the training device 300 can increase the batch size to promote convergence. Conversely, if the rate of decline in the loss value is steep, indicating excessively rapid convergence, the training device 300 can decrease the batch size to slow convergence and prevent the model from getting trapped in local optima. Thus, training the AI ​​model with a batch size that matches its convergence trend not only accelerates training but also prevents unreasonable batch size settings from affecting inference accuracy, thereby achieving a higher level of inference accuracy.

[0052] In addition, when the training device 300 adjusts the batch size of the AI ​​model based on the loss value of the AI ​​model in multiple training rounds, it can reduce the impact of the randomness of the loss value on setting the batch size of the AI ​​model, thereby helping to improve the rationality of setting the batch size value and enabling the inference accuracy of the trained AI model to reach a higher level.

[0053] For example, the training device 300 described above can be implemented by software or hardware.

[0054] In the first example, when implemented in software, the training device 300 can be, for example, program code running on hardware, such as a process running on CPU 201 or CPU 202.

[0055] In the second example, when implemented in hardware, the training device 300 can be implemented using CPU 201 or CPU 202 in the data processing system 10, or it can be implemented using a separately configured processor in the data processing system 10. This processor can be, for example, any processor or any combination thereof, such as a CPU, NPU, ASIC, programmable logic device (PLD), complex programmable logical device (CPLD), FPGA, generic array logic (GAL), system-on-chip (SoC), software-defined infrastructure (SDI) chip, artificial intelligence (AI) chip, or data processing unit (DPU). Furthermore, the number of processors included in the training device 300 can be one or more, and the types of processors can be one or more. The specific number and types of processors can be set according to the actual application's business requirements; this embodiment does not limit this. It should be understood that the training device 300 can also be implemented by one or more of the accelerators 101 to 104 in the data processing system 10, and this embodiment is not limited to this.

[0056] It is worth noting that the data processing system 10 shown in Figure 1 is merely an illustrative example and is not intended to be limiting. For example, in other possible data processing systems, the number of accelerators used to train the AI ​​model could be one, or a larger number of accelerators (and CPUs) could be used to train the same AI model. When multiple accelerators are used to train the same AI model, some accelerators and some CPUs can be deployed on the same computing device, while other accelerators and other CPUs can be deployed on another computing device. Data communication between different computing devices can be achieved through a bus or switch.

[0057] For ease of understanding, embodiments of the AI ​​model training method provided in this application are described below with reference to the accompanying drawings.

[0058] Referring to Figure 2, which is a flowchart illustrating an AI model training method provided in an embodiment of this application, this method can be applied to the data processing system 10 shown in Figure 1, or to other applicable data processing systems. For ease of explanation, this embodiment uses the training device 300 applied in the data processing system 10 shown in Figure 1 as an example for illustrative purposes.

[0059] The AI ​​model training method shown in Figure 2 can specifically include:

[0060] S201: During the iterative training of the AI ​​model, the training device 300 obtains the loss value of the AI ​​model during multiple rounds of iterative training.

[0061] For example, the AI ​​model can be a large language model (LLM), a large language model meta AI (LLaMA) model, a bidirectional encoder representations from transformers (BERT) model, or a generative pre-trained Transformer 3 (GPT-3) model, or other types of models, such as AI models that support multimodal input data, etc., and there is no limitation on this.

[0062] In this embodiment, one or more accelerators and one or more CPUs in the data processing system 10 can be used to iteratively train the AI ​​model. For ease of understanding, the following explanation uses the training of the AI ​​model using accelerator 101, accelerator 102, and CPU 201 as an example. In each round of training, the AI ​​model can perform a forward computation process based on the input training samples to obtain the prediction result of the AI ​​model. Then, CPU 201 (or accelerator 101 / accelerator 102) can calculate the difference between the prediction result and the true result in the training samples. This difference can be measured by a loss value, which can be used to calculate the gradient data for updating the model parameters.

[0063] During each round of training, the training device 300 can record the loss value of the AI ​​model, so that the training device 300 can obtain multiple loss values ​​after the AI ​​model has completed multiple rounds of iterative training.

[0064] As one implementation, the training device 300 can be configured with an observation window, the length of which can be the number of iterations of the AI ​​model. This allows the training device 300 to record the loss value of the AI ​​model within each observation window. The length of the observation window can be pre-configured by a technician; alternatively, the data processing system 10 can provide a client application, allowing users (the owners of the AI ​​model) to configure the length of the observation window.

[0065] S202: The training device 300 adjusts the values ​​of the hyperparameters of the AI ​​model according to the changing trend of the loss value during multiple rounds of iterative training. The hyperparameters adjusted include batch size.

[0066] In this embodiment, whether the AI ​​model converges during iterative training can be determined based on the changes in the AI ​​model's loss value. In practical applications, when the AI ​​model's loss value gradually stabilizes and reaches a relatively small value, it can be determined that the AI ​​model is in a convergent state. Therefore, the training device 300 can determine the convergence trend of the AI ​​model based on the changing trend of the loss value during multiple rounds of iterative training, so as to determine a suitable batch size value for the AI ​​model.

[0067] As an example, the trend of the loss value can be represented by the slope of the loss value. The training device 300 can then calculate the average slope of multiple loss values ​​based on the loss values ​​recorded during multiple iterations of training in the observation window. The magnitude of the loss value can be used as the vertical axis, and the number of iterations can be used as the horizontal axis. The training device 300 can then determine whether the average slope is greater than or equal to 0. When the average slope is greater than or equal to 0, it indicates that the loss value of the AI ​​model is oscillating (the loss value gradually increases) and has not converged. In this case, the training device 300 can increase the batch size of the AI ​​model. For example, if the current batch size is 1, the training device 300 can adjust the batch size to 2.

[0068] In practical applications, the training device 300 can also be pre-configured with a mapping relationship between the average slope and the increase in the batch size value, where the average slope in this mapping relationship is greater than 0. Thus, after calculating the average slope of the loss value of the AI ​​model during multiple iterations of training, the training device 300 can determine the increase in the batch size value mapped by this average slope by looking up this mapping relationship. The training device 300 can then calculate the sum of this increase and the current batch size value of the AI ​​model; this sum is the increased batch size value.

[0069] Furthermore, when the average slope is less than 0, it indicates that the loss value of the AI ​​model is gradually decreasing, meaning the AI ​​model is converging. At this point, the training device 300 may not adjust the batch size of the AI ​​model, or it may still adjust the batch size (including increasing or decreasing the batch size). Below, we will exemplarily describe how the training device 300 adjusts the batch size of the AI ​​model even when the average efficiency is less than 0.

[0070] In one possible implementation, the training device 300 can calculate a first average slope of the loss values ​​of the AI ​​model during the multi-round iterative training process corresponding to the first observation window, such as by calculating the first average slope as described above. Furthermore, when the first average slope is less than 0, the training device 300 can determine a second average slope of the loss values ​​of the AI ​​model during the multi-round iterative training process corresponding to the second observation window, where the second observation window is sequentially preceding the first observation window; that is, the multi-round iterative training process corresponding to the second observation window is executed before the multi-round iterative training process corresponding to the first observation window. Specifically, the training device 300 can, as described above, first obtain multiple loss values ​​of the AI ​​model during the multi-round iterative training process corresponding to the second observation window, and calculate the average slope of these multiple loss values ​​to obtain the second average slope.

[0071] Typically, since the first average slope is negative (indicating the AI ​​model is approaching convergence), the training device 300 can compare the absolute value of the first average slope with the absolute value of the second average slope. When the absolute value of the first average slope is less than or equal to the absolute value of the second average slope, it indicates the AI ​​model is approaching convergence, but the convergence speed is slowing down. In this case, the training device 300 can increase the batch size to further improve the convergence speed of the AI ​​model, thereby accelerating its training. Conversely, when the absolute value of the first average slope is greater than the absolute value of the second average slope, it indicates the AI ​​model is approaching convergence, but the convergence speed is accelerating. In this case, the training device 300 can decrease the batch size to reduce the convergence speed of the AI ​​model, preventing it from getting trapped in local optima due to excessively fast convergence.

[0072] In practical applications, when the first average slope is negative, if the absolute value of the first average slope is greater than the absolute value of the second average slope, the loss value of the AI ​​model may decrease at a slower or faster rate. Therefore, in a further possible implementation, after determining that the absolute value of the first average slope is greater than the absolute value of the second average slope, the training device 300 can further determine whether the deviation between the absolute values ​​of the first and second average slopes meets the adjustment condition. If the deviation meets the adjustment condition, the training device 300 can determine to reduce the batch size of the AI ​​model. If the deviation does not meet the adjustment condition, the training device 300 may not update the batch size of the AI ​​model.

[0073] For example, when determining whether the deviation between the absolute value of the first average slope and the absolute value of the second average slope meets the down-adjustment condition, the training device 300 may first calculate the difference between the absolute values ​​of the first and second average slopes, and then calculate the ratio between the absolute value of the difference and the absolute value of the second average slope. When the ratio is greater than a first threshold (i.e., the deviation meets the down-adjustment condition), the training device 300 may determine to reduce the batch size of the AI ​​model to avoid the AI ​​model converging too quickly. When the ratio is less than or equal to the first threshold (i.e., the deviation does not meet the down-adjustment condition), the training device 300 may determine not to update the batch size of the AI ​​model (i.e., not to reduce the batch size).

[0074] Furthermore, when the first average slope is negative, if the absolute value of the first average slope is less than or equal to the absolute value of the second average slope, the convergence speed of the AI ​​model may be fast but slow (i.e., the absolute value of the average slope of the AI ​​model's loss value decreases rapidly) or slow but slow. Therefore, in a further possible implementation, after determining that the absolute value of the first average slope is less than or equal to the absolute value of the second average slope, the training device 300 may further determine whether the deviation between the absolute values ​​of the first and second average slopes satisfies the adjustment condition. If the deviation satisfies the adjustment condition, the training device 300 may determine to increase the batch size of the AI ​​model. If the deviation does not satisfy the adjustment condition, the training device 300 may not update the batch size of the AI ​​model.

[0075] For example, when determining whether the deviation between the absolute value of the first average slope and the absolute value of the second average slope meets the adjustment condition, the training device 300 may first calculate the difference between the absolute values ​​of the first and second average slopes, and then calculate the ratio between the absolute value of the difference and the absolute value of the second average slope. When the ratio is greater than a second threshold (i.e., the deviation meets the adjustment condition), the training device 300 may determine to increase the batch size of the AI ​​model. When the ratio is less than or equal to the second threshold (i.e., the deviation does not meet the adjustment condition), the training device 300 may determine not to update the batch size of the AI ​​model (i.e., not to decrease the batch size).

[0076] To facilitate understanding, examples are provided below for each of the above implementation methods.

[0077] As shown in Figure 3, during each iteration of AI model training, the training device 300 can record the loss value of the AI ​​model. Assuming the current training iteration is the Kth round, where K is a positive integer, after the Kth round of training, the training device 300 can calculate the first average slope of the loss value of the AI ​​model in the most recent N rounds of training (i.e., the training process from round K-N+1 to round K), and determine whether this first average slope is greater than or equal to 0.

[0078] If the first average slope is greater than or equal to 0, the training device 300 increases the batch size of the AI ​​model, as shown in Figure 3.

[0079] If the first average slope is less than 0, the training device 300 can calculate the second average slope of the loss value of the AI ​​model in the previous N rounds of training (i.e., the training process from round K-2N to round KN). If the absolute value of the first average slope is greater than the absolute value of the second average slope, then when the increase ratio of the absolute value of the first average slope relative to the absolute value of the second average slope exceeds a first threshold, the training device 300 reduces the batch size of the AI ​​model (to slow down the convergence speed of the AI ​​model), and when the increase ratio does not exceed the first threshold (e.g., 0.5), the batch size of the AI ​​model remains unchanged, as shown in Figure 3. If the absolute value of the first average slope is less than or equal to the absolute value of the second average slope, then when the decrease ratio of the absolute value of the first average slope relative to the absolute value of the second average slope exceeds a second threshold (e.g., 0.8), the training device 300 increases the batch size of the AI ​​model (to improve the convergence speed of the AI ​​model). Furthermore, when the decrease ratio does not exceed the second threshold, the batch size of the AI ​​model remains unchanged, as shown in Figure 3. Specifically, when N is greater than or equal to 3, adjusting the batch size of the AI ​​model based on three or more loss values ​​can minimize the impact of the randomness of the loss values ​​on the batch size adjustment by the training device 300, further improving the rationality of the batch size adjustment.

[0080] S203: The training device 300 continues to iteratively train the AI ​​model based on the adjusted batchsize value.

[0081] In this embodiment, after adjusting the batch size value of the AI ​​model, the training device 300 can continue to perform iterative training of the AI ​​model based on the adjusted batch size value until the AI ​​model meets the termination conditions of iterative training, such as the AI ​​model converging or the number of iterations reaching a preset number. Since adjusting the batch size value according to the changing trend of the AI ​​model's loss value allows for faster convergence when the AI ​​model is not yet converging, and slower convergence when the AI ​​model is converging quickly, continuing to train the AI ​​model based on the adjusted batch size value not only improves the overall training efficiency of the AI ​​model but also enables the trained AI model to achieve a higher level of accuracy.

[0082] In practical applications, as the AI ​​model iterates through training, it gradually converges, which may cause the training device 300 to frequently increase the batch size value. Therefore, in this embodiment, the training device 300 can be configured with an upper limit for the batch size value. Furthermore, during the adjustment of the batch size value, if the batch size has already reached this upper limit, the training device 300 does not need to continue increasing the batch size value. Instead, it can complete the subsequent iterative training process based on the current batch size value until the termination condition of the iterative training is met.

[0083] The upper limit of the batch size value can be configured in advance by technical personnel or by users. For example, before training the AI ​​model, users can configure the upper limit of the batch size value through the client provided by the data processing system 10.

[0084] It should be noted that the above explanation uses adjusting the batch size during model training as an example. In practical applications, the training device 300 can also adjust other hyperparameters in the AI ​​model simultaneously, such as adjusting the batch size and learning rate. For example, when the AI ​​model tends to converge, the training device 300 can also reduce the learning rate. In this embodiment, the hyperparameters whose values ​​are adjusted are not limited.

[0085] In addition, in this embodiment, not only can the training process of the AI ​​model be accelerated by dynamically adjusting the batch size value of the AI ​​model, but the training efficiency of the AI ​​model can also be further improved by having the CPU 201, accelerator 101 and accelerator 102 perform computational operations in parallel during each round of AI model training.

[0086] The following describes how CPU 201, accelerator 101, and accelerator 102 perform a training cycle for the AI ​​model.

[0087] In this embodiment, it is assumed that the AI ​​model includes multiple network layers. Taking network layers 1 to 2 as an example, different components of network layer 1 can be deployed in accelerators 101 and 102. For instance, network layer 1 may include 1024 parameters, of which 512 parameters can be deployed in accelerator 101 and the remaining 512 parameters can be deployed in accelerator 102. Furthermore, accelerators 101 and 102 can train the AI ​​model in a data-parallel manner, that is, accelerators 101 and 102 can use different parts of the same dataset to train the AI ​​model in parallel.

[0088] As shown in Figure 4, during the forward computation phase, accelerators 101 and 102 perform forward computation sequentially in network layers 1 to 2 using their respective input data to obtain corresponding prediction results. Then, accelerators 101 and 102 calculate their local loss values ​​based on their respective prediction results and the corresponding real results. Furthermore, accelerator 101 (or accelerator 102 / CPU 201 / training device 300) can aggregate the loss values ​​calculated by multiple accelerators to obtain a global loss value. Thus, accelerators 101 and 102 can perform the backpropagation phase based on this global loss value.

[0089] During the backpropagation phase, accelerators 101 and 102 can sequentially calculate the gradient data used to update the parameter values ​​in network layers 2 to 1.

[0090] In specific implementation, accelerator 101 can first calculate gradient data 2_1 for the parameters in network layer 2 based on the global loss value. This gradient data 2 is used to update the parameter values ​​in network layer 2 on accelerator 101. The gradient data 2 calculated by accelerator 101 can be stored in the memory of accelerator 101, and this memory can also store the parameter values ​​of network layer 2 deployed on accelerator 101, as shown in Figure 5. For example, the memory of accelerator 101 can be high bandwidth memory (HBM), etc. Simultaneously, accelerator 102 will also calculate gradient data 2_2 for the parameters in network layer 2 based on the global loss value, and the gradient data 2_2 and the parameter values ​​of network layer 2 deployed on accelerator 102 will also be stored in the memory of accelerator 102. In practical applications, due to the limited memory resources of accelerators, the gradient data and parameter values ​​stored in the accelerator can be data in a first format, which refers to a format with lower data precision, such as FP16 (16-bit floating-point number) format, in order to reduce the memory resources occupied by gradient data and parameter values.

[0091] After calculating the gradient data 2_1 for network layer 2, accelerator 101 can perform gradient overflow detection on network layer 2, that is, detect whether the value in the calculated gradient data 2_1 exceeds the upper limit of the value that the first format can represent, as shown in Figure 4. Since gradient data 2_1 is usually tensor data, when accelerator 101 has tensor computing power (e.g., accelerator 101 includes computing units for tensor calculations), accelerator 101 can use tensor computing power to perform gradient overflow detection on gradient data 2_1. This effectively improves the efficiency of gradient overflow detection compared to using CPU 201. Of course, in other embodiments, accelerator 101 can also send the calculated gradient data 2_1 to CPU 201 for gradient overflow detection; this is not limited. Similarly, after calculating the gradient data 2_2 for network layer 2, accelerator 102 can perform gradient overflow detection for network layer 2, or accelerator 102 can send the gradient data 2_2 to CPU 201 for gradient overflow detection, etc.

[0092] After gradient overflow detection, accelerators 101 and 102 can execute a reduction operator to exchange their calculated gradient data, as shown in Figure 4. In this way, accelerator 101 can aggregate gradient data for the complete parameters in network layer 2, hereinafter referred to as gradient data 2. Then, accelerator 101 (or accelerator 102) can convert the format of the reduced gradient data 2 from a first format to a second format with higher data precision, thereby helping to improve the accuracy of updating parameter values ​​based on the gradient data. For example, the second format could be FP32 (32-bit floating-point) format, etc. Next, accelerator 101 can send the gradient data 2 in the second format to CPU 201, as shown in Figures 4 and 5, and can store the gradient data 2 in the memory of CPU 201. For example, the memory of CPU 201 could be, for example, double data rate synchronous dynamic random access memory (DDR SDRAM), etc. Of course, in other embodiments, accelerators 101 and 102 may also send their calculated gradient data directly to CPU 201 without executing the reduce operator, so that CPU 201 can also obtain gradient data 2 for the complete parameters in network layer 2. This is not a limitation.

[0093] Then, CPU 201 can calculate the updated values ​​of each parameter in network layer 2 based on gradient data 2. Specifically, as shown in Figure 5, CPU 201 can read gradient data 2, momentum data, variance data, and the current values ​​of the parameters in network layer 2 from memory, and execute scaling and optimizer operators based on the read data. When executing the scaling operator, CPU 201 can numerically scale the gradient data 2 to scale each value in the gradient data 2. When executing the optimizer operator, CPU 201 can choose an appropriate direction to update the parameter values ​​and can also choose an appropriate learning rate to determine the parameter update magnitude. In practical applications, CPU 201 can first execute the scaling operator on all gradient data 2, and then execute the optimizer operator based on the scaled gradient data 2 to determine the updated values ​​of each parameter in network layer 2. Alternatively, CPU 201 can execute the scaling and optimizer operators in parallel based on a single instruction multiple data (SIMD) architecture. The SIMD architecture allows a single instruction to process multiple data points simultaneously. Therefore, CPU 201 executes scaling and optimizer operators in parallel based on the SIMD architecture, improving the efficiency of updating the values ​​of various parameters in network layer 2. After calculating the updated values ​​of the parameters in network layer 2, CPU 201 can write the updated values ​​back to memory, as shown in Figure 5.

[0094] It should be noted that this embodiment uses the CPU 201 updating the parameter values ​​in network layer 2 as an example. In actual application scenarios, the data processing system 10 may include multiple CPUs. In this case, based on the hardware affinity strategy, according to the storage location of the gradient data 2 (and other data) in memory, a CPU with higher memory affinity can be selected from among the multiple CPUs to perform the update process (this CPU is the available CPU with higher performance in accessing memory data among the multiple CPUs). For example, when the data processing system 10 includes multiple CPUs, and these multiple CPUs form a non-uniform memory access (NUMA) architecture, then the CPU in the NUMA architecture that has a higher affinity with the data in memory can perform the update of parameter values. Alternatively, accelerators 101 and 102 can write the gradient data to the memory with the highest affinity to CPU 201 (such as writing it to the memory bound to CPU 201), thereby improving the efficiency of CPU 201 accessing data from memory and speeding up the parameter value update process.

[0095] Next, CPU 201 can send the updated parameter values ​​of network layer 2 stored in memory to accelerators 101 and 102, so that accelerators 101 and 102 can perform the next round of training for the AI ​​model based on the updated parameter values. During this process, CPU 201 can send parameter values ​​in a second format to accelerators 101 and 102; correspondingly, accelerators 101 and 102 can convert the received parameter values ​​from the second format to the first format and save the updated first format parameter values ​​in their memory. Alternatively, before sending the updated parameter values, CPU 201 can first convert the second format parameter values ​​to the first format parameter values, and then send the first format parameter values ​​to accelerators 101 and 102, whereby accelerators 101 and 102 save the received parameter values ​​in their memory, as shown in Figure 5. In this way, the CPU 201 sends low-precision format data to the accelerator 101 and the accelerator 102, which can effectively reduce the communication bandwidth consumption between the CPU 201 and multiple accelerators and reduce the storage space occupied by the data in the accelerator 101 and the accelerator 102.

[0096] As shown in Figure 4, CPU 201 can execute the processes of receiving gradient data, updating the parameter values ​​in network layer 2 based on the received gradient data, and sending the updated parameter values ​​in parallel. Taking network layer 2 as an example, which includes parameters a and b, after receiving gradient data for parameter A, CPU 201 can begin to calculate the updated value 'a' of parameter A using this gradient data. Simultaneously, CPU 201 can execute the process of receiving gradient data for parameter B in parallel. Assuming that after CPU 201 calculates value 'a', it has already received gradient data for parameter B, then while sending value 'a' to accelerator 101 / accelerator 102, CPU 201 can simultaneously execute the process of calculating the updated value of parameter B based on the gradient data. In this way, the efficiency of CPU 201 in updating parameter values ​​can be further improved.

[0097] During the process of CPU 201 receiving gradient data sent by accelerator 101, updating the parameter values ​​in network layer 2 based on the gradient data, and sending the updated parameter values ​​back to accelerator 101, accelerator 101 can perform reverse computation on network layer 1 based on the calculated gradient data 2_1 for the parameters in network layer 2 to calculate the gradient data 1_1 for the parameters in network layer 1. Simultaneously, accelerator 102 can also perform reverse computation on network layer 1 based on the gradient data 2_2 for the parameters in network layer 2 to calculate the gradient data 1_1 for the parameters in network layer 1, as shown in Figure 4. Thus, the reverse computation process of accelerators 101 and 102 on network layer 1 can be executed in parallel with the process of CPU 201 sending and receiving data and updating parameter values. Therefore, after CPU 201 completes sending the updated parameter values ​​to accelerators 101 and 102, it can begin the process of updating the parameter values ​​in network layer 1. Thus, by parallelizing the process of updating parameter values ​​using the CPU 201 and the reverse computation using the accelerator, the training efficiency of the AI ​​model in a single training round can be effectively improved. As shown in Figure 4, based on the above parallelization implementation method, the training time T for a single round of training of the AI ​​model can be saved.

[0098] In practical applications, when testing the LLaMA-13B model, the training process of one iteration of the LLaMA-13B model was optimized based on the above implementation method, depending on different batch size values. Compared to not optimizing the single-round training process of the LLaMA-13B model, significant improvements were achieved in training latency, gradient overflow detection, gradient scaling, parameter value updates, parameter backpropagation, and CPU performance, as shown in Figure 6. Furthermore, during iterative training of the LLaMA-13B model, the training time for each batch of samples was effectively reduced, achieving a 19% reduction in actual testing; correspondingly, the training speed of an epoch was also improved by 19% for the same dataset. Here, an epoch refers to the process of completing one model training cycle using all samples in the dataset.

[0099] It should be noted that the above-described training process for the AI ​​model is merely an implementation example and is not intended to be limiting. Other training processes can also be used based on the above. Further explanation follows.

[0100] 1. The above training process is illustrated by using accelerators 101 and 102 to train all network layers in the AI ​​model. In actual application scenarios, accelerators 101 and 102 can train some network layers in the AI ​​model in the same way. The remaining network layers in the AI ​​model can be trained by other accelerators in the data processing system 10. There are no restrictions on this.

[0101] 2. During the training process described above, CPU 201 executes the processes of receiving gradient data, updating parameter values, and sending the updated parameter values ​​to the accelerators in parallel. In other implementations, CPU 201 can also execute each process sequentially. That is, CPU 201 can start updating parameter values ​​only after receiving all gradient data, and after completing the update of all parameter values, send the updated parameter values ​​to accelerators 101 and 102.

[0102] 3. In the training process described above, the process of CPU 201 updating the parameter values ​​in network layer 2 is executed in parallel with the process of the accelerator performing reverse computation on network layer 1. In other implementations, the accelerator can also perform reverse computation on network layer 1 after CPU 201 has completed updating the parameter values ​​in network layer 2.

[0103] 4. The above training process is illustrated by accelerators 101 and 102 performing gradient data reduction processing for each network layer separately. In other implementations, accelerators 101 and 102 can also perform gradient data reduction processing once for multiple network layers. For example, when the AI ​​model includes 10 network layers, during the backpropagation phase, accelerators 101 and 102 can perform a reduction operation once for the gradient data of two network layers at a time. For example, accelerators 101 and 102 can each be configured with a bucket, which is a storage space in the accelerator's memory. Accelerator 101 can cache the calculated gradient data into this bucket, and when the calculated gradient data fills the bucket, accelerator 101 can perform a reduction operation with accelerator 102. The size of the bucket can be designed to accommodate the size of the gradient data corresponding to the parameters of multiple network layers on accelerator 101.

[0104] 5. The above training process is illustrated by taking the process of updating parameter values ​​by CPU 201 as an example. In other implementations, the process of updating parameter values ​​based on gradient data can also be performed by accelerator 101 and accelerator 102, without the participation of CPU 201.

[0105] It is worth noting that other reasonable combinations of steps that can be conceived by those skilled in the art based on the above description also fall within the scope of protection of this application. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are preferred embodiments, and the actions involved are not necessarily essential to this application.

[0106] The AI ​​model training method provided in the embodiments of this application has been described above with reference to Figures 1 to 6. Next, the structure of the AI ​​model training device and computing device provided in the embodiments of this application will be described with reference to the accompanying drawings.

[0107] Referring to Figure 7, a schematic diagram of an AI model training device is shown. This AI model training device can be applied to a data processing system including an accelerator, such as the data processing system 10 shown in Figure 1 above, wherein the accelerator in the data processing system is used to train the AI ​​model. As shown in Figure 7, the AI ​​model training device 700 includes:

[0108] The acquisition module 701 is used to acquire the loss value of the AI ​​model during multiple rounds of iterative training.

[0109] The adjustment module 702 is used to adjust the values ​​of the hyperparameters of the AI ​​model, including the batch size, based on the changing trend of the loss value during multiple rounds of iterative training.

[0110] Training module 703 is used to continue iteratively training the AI ​​model based on the adjusted hyperparameter values.

[0111] In one possible implementation, the adjustment module 702 is used for:

[0112] Calculate the average slope of the loss value of the AI ​​model during multiple rounds of iterative training;

[0113] When the average slope is greater than zero, increase the batch size of the AI ​​model.

[0114] In one possible implementation, the acquisition module 701 is used for:

[0115] Obtain the loss value of the AI ​​model during the multi-round iterative training process corresponding to the first observation window, and the loss value of the AI ​​model during the multi-round iterative training process corresponding to the second observation window. The multi-round iterative training process corresponding to the second observation window is executed before the multi-round iterative training process corresponding to the first observation window.

[0116] Adjustment module 702 is used for:

[0117] Calculate the first average slope of the loss value of the AI ​​model during the multi-round iterative training process corresponding to the first observation window;

[0118] When the first average slope is less than zero, calculate the absolute value of the second average slope of the loss value of the AI ​​model during the multi-round iterative training process corresponding to the second observation window;

[0119] When the absolute value of the first average slope is less than the absolute value of the second average slope, increase the batch size of the AI ​​model.

[0120] When the absolute value of the first average slope is greater than the absolute value of the second average slope, reduce the batch size of the AI ​​model.

[0121] In one possible implementation, the adjustment module 702 is used for:

[0122] When the absolute value of the first average slope is less than the absolute value of the second average slope, and the deviation between the absolute values ​​of the first average slope and the second average slope meets the adjustment condition, increase the value of the batch size of the AI ​​model.

[0123] When the absolute value of the first average slope is greater than the absolute value of the second average slope, and the deviation between the absolute values ​​of the first average slope and the second average slope meets the adjustment condition, the batch size of the AI ​​model is reduced.

[0124] In one possible implementation, the data processing system includes multiple accelerators, the AI ​​model is trained through the multiple accelerators, the data processing system also includes a central processing unit (CPU), and the AI ​​model includes a first network layer and a second network layer.

[0125] The CPU is used to update the parameter values ​​in the first network layer during the backpropagation process of the AI ​​model, based on the gradient data calculated by multiple accelerators for the first network layer.

[0126] Multiple accelerators are used to perform reverse computation on the second network layer while the CPU updates the parameter values ​​in the first network layer.

[0127] In one possible implementation, multiple accelerators are also used to perform gradient overflow detection on the first network layer before the CPU updates the parameter values ​​in the first network layer.

[0128] In one possible implementation, the CPU is specifically used to execute scaling operators and optimizer operators in parallel based on a single instruction multiple data stream (SIMD) architecture, according to gradient data calculated by multiple accelerators for the first network layer. The scaling operators and optimizer operators are used to update the parameter values ​​in the first network layer.

[0129] In one possible implementation, the data processing system includes multiple CPUs, the reduction result is stored in memory, and the parameter values ​​in the first network layer are updated by the CPU with the highest memory access efficiency among the multiple CPUs.

[0130] In one possible implementation, the data processing system also includes a central processing unit (CPU), and the AI ​​model includes a third network layer.

[0131] The CPU is also used to update the values ​​of the parameters in the third network layer, wherein the updated values ​​of the parameters in the third network layer are obtained as values ​​in a first format; the values ​​in the first format are converted into values ​​in a second format, wherein the data precision of the first format is higher than that of the second format; and the values ​​in the second format of the parameters in the third network layer are sent to multiple accelerators.

[0132] Since the AI ​​model training device 700 shown in Figure 7 corresponds to the training device 300 in the embodiment shown in Figure 2 above, the specific implementation method and technical effects of the AI ​​model training device 700 shown in Figure 7 can be found in the relevant descriptions in the embodiment shown in Figure 2 above, and will not be repeated here.

[0133] Figure 8 is a schematic diagram of the hardware structure of a computing device 800 provided in this application. The computing device 800 can, for example, implement the training device 300 in the embodiment shown in Figure 2 above.

[0134] As shown in Figure 8, the computing device 800 includes a processor 801, a memory 802, and a communication interface 803. The processor 801, memory 802, and communication interface 803 communicate via a bus 804, or via wireless transmission or other means. The memory 802 stores instructions, and the processor 801 executes the instructions stored in the memory 802. Further, the computing device 800 may also include a memory unit 805, which is connected to the processor 801, the storage medium 802, and the communication interface 803 via the bus 804. The memory 802 stores program code, and the processor 801 can read the program code stored in the memory 802 into the memory unit 805 and execute the program code in the memory unit 805 to perform the following operations:

[0135] During the iterative training of the AI ​​model, the loss value of the AI ​​model is obtained in multiple rounds of iterative training.

[0136] Based on the changing trend of the loss value of the AI ​​model during multiple rounds of iterative training, adjust the values ​​of the hyperparameters of the AI ​​model, including the batch size;

[0137] Based on the adjusted hyperparameter values, the AI ​​model is further iteratively trained.

[0138] It should be understood that in this embodiment, the processor 801 can be a CPU, but it can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete device assemblies, etc. A general-purpose processor can be a microprocessor or any conventional processor.

[0139] The memory 802 may include read-only memory and random access memory, and provides instructions and data to the processor 801. The memory 802 may also include non-volatile random access memory.

[0140] The memory 802 can be volatile memory or non-volatile memory, or it can include both. The non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. The volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous linked dynamic random access memory (SLDRAM), and direct rambus RAM (DR RAM).

[0141] The communication interface 803 is used to communicate with other devices connected to the computing device 800. The bus 804 may include a data bus, a power bus, a control bus, and a status signal bus, etc. However, for clarity, all buses are labeled as bus 804 in the figure.

[0142] It should be understood that the computing device 800 according to the embodiments of this application can correspond to the training device 300 in the embodiment shown in FIG2 above. Specifically, it can correspond to the method executed by the training device 300 in the embodiment shown in FIG2 of this application. The above and other operations and / or functions implemented by the computing device 800 are respectively to implement the flow of the corresponding method in the embodiment shown in FIG2. For the sake of brevity, they will not be described in detail here.

[0143] This application also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium that a computing device can store, or a data storage device such as a data center that includes one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state drive). The computer-readable storage medium includes instructions that instruct the computing device to execute the aforementioned AI model training method.

[0144] This application also provides a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on a computing device, all or part of the processes or functions described in this application are generated.

[0145] The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, or data center to another website, computer, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means.

[0146] The computer program product can be a software installation package. When any of the aforementioned AI model training methods is required, the computer program product can be downloaded and executed on a computing device.

[0147] The above embodiments can be implemented, in whole or in part, by software, hardware, firmware, or any other combination thereof. When implemented using software, the above embodiments can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more sets of available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. A semiconductor medium can be a solid-state drive.

[0148] The terminology used in the above embodiments is for the purpose of describing specific embodiments only and is not intended to be a limitation of this application. As used in the specification and appended claims of this application, the singular expressions “a,” “an,” “the,” “the,” “the,” and “this” are intended to also include expressions such as “one or more,” unless the context clearly indicates otherwise. It should also be understood that in the embodiments of this application, “one or more” refers to one, two, or more; the character “ / ” generally indicates that the preceding and following objects are in an “or” relationship. In the embodiments of this application, “simultaneously” means within the same time period, including situations where they are at the same moment. The terms “first,” “second,” etc., in the specification, claims, and drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such terms can be used interchangeably where appropriate, and this is merely a way of distinguishing objects with the same attributes in the embodiments of this application.

[0149] References to "one embodiment" or "some embodiments" as described in this specification mean that one or more embodiments of this application include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.

[0150] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A method for training an artificial intelligence (AI) model, characterized in that, The method is applied to a data processing system, the data processing system including an accelerator, the accelerator being used to train an AI model, and the method comprising: During the iterative training of the AI ​​model, the loss value of the AI ​​model in multiple rounds of iterative training is obtained; Based on the changing trend of the loss value of the AI ​​model during the multiple rounds of iterative training, the values ​​of the hyperparameters of the AI ​​model are adjusted, including the batch size; Based on the adjusted hyperparameter values, the AI ​​model is further iteratively trained.

2. The method according to claim 1, characterized in that, The step of adjusting the hyperparameters of the AI ​​model based on the changing trend of the loss value during the multi-round iterative training process includes: Calculate the average slope of the loss value of the AI ​​model during the multiple rounds of iterative training; When the average slope is greater than zero, increase the batch size of the AI ​​model.

3. The method according to claim 1, characterized in that, The step of obtaining the loss value of the AI ​​model during multiple rounds of iterative training includes: The loss value of the AI ​​model during the multi-round iterative training process corresponding to the first observation window and the loss value of the AI ​​model during the multi-round iterative training process corresponding to the second observation window are obtained. The multi-round iterative training process corresponding to the second observation window is executed before the multi-round iterative training process corresponding to the first observation window. The step of adjusting the hyperparameter values ​​of the AI ​​model based on the changing trend of the loss value during the multi-round iterative training process includes: Calculate the first average slope of the loss value of the AI ​​model during the multi-round iterative training process corresponding to the first observation window; When the first average slope is less than zero, calculate the absolute value of the second average slope of the loss value of the AI ​​model in the multi-round iterative training process corresponding to the second observation window; When the absolute value of the first average slope is less than the absolute value of the second average slope, the batch size of the AI ​​model is increased. When the absolute value of the first average slope is greater than the absolute value of the second average slope, the batch size of the AI ​​model is reduced.

4. The method according to claim 3, characterized in that, The step of increasing the batch size of the AI ​​model when the absolute value of the first average slope is less than the absolute value of the second average slope includes: When the absolute value of the first average slope is less than the absolute value of the second average slope, and the deviation between the absolute values ​​of the first average slope and the second average slope meets the adjustment condition, the batch size of the AI ​​model is increased. The step of reducing the batch size of the AI ​​model when the absolute value of the first average slope is greater than the absolute value of the second average slope includes: When the absolute value of the first average slope is greater than the absolute value of the second average slope, and the deviation between the absolute values ​​of the first average slope and the second average slope meets the adjustment condition, the batch size of the AI ​​model is reduced.

5. The method according to any one of claims 1 to 4, characterized in that, The data processing system includes multiple accelerators, the AI ​​model is trained using the multiple accelerators, the data processing system also includes a central processing unit (CPU), the AI ​​model includes a first network layer and a second network layer, and the method further includes: During the backpropagation process of the AI ​​model, the CPU updates the parameter values ​​in the first network layer based on the gradient data calculated by the multiple accelerators for the first network layer. During the process of the CPU updating the parameter values ​​in the first network layer, the plurality of accelerators perform reverse computation on the second network layer.

6. The method according to claim 5, characterized in that, The method further includes: Before the CPU updates the parameter values ​​in the first network layer, the plurality of accelerators perform gradient overflow detection on the first network layer.

7. The method according to claim 5 or 6, characterized in that, The CPU updates the parameter values ​​in the first network layer based on the gradient data calculated by the multiple accelerators for the first network layer, including: The CPU executes scaling operators and optimizer operators in parallel based on the gradient data calculated by the multiple accelerators for the first network layer, using a single instruction multiple data stream (SIMD) architecture. The scaling operators and optimizer operators are used to update the parameter values ​​in the first network layer.

8. The method according to any one of claims 5 to 7, characterized in that, The data processing system includes multiple CPUs, the reduction result is stored in memory, and the parameter values ​​in the first network layer are updated by the CPU with the highest memory access efficiency among the multiple CPUs.

9. The method according to any one of claims 1 to 8, characterized in that, The data processing system further includes a central processing unit (CPU), the AI ​​model includes a third network layer, and the method further includes: The CPU updates the values ​​of the parameters in the third network layer, wherein the updated values ​​of the parameters in the third network layer are obtained as numerical values ​​in the first format. The numerical value in the first format is converted into a numerical value in the second format, where the data precision of the first format is higher than that of the second format. The values ​​of the parameters in the third network layer in the second format are sent to the plurality of accelerators.

10. An AI model training device, characterized in that, The device is applied to a data processing system, the data processing system including an accelerator, the accelerator being used to train an AI model, and the device comprising: The acquisition module is used to acquire the loss value of the AI ​​model during multiple rounds of iterative training during the iterative training process of the AI ​​model; An adjustment module is used to adjust the values ​​of the hyperparameters of the AI ​​model, including the batch size, based on the changing trend of the loss value of the AI ​​model during the multi-round iterative training process. The training module is used to continue iteratively training the AI ​​model based on the adjusted hyperparameter values.

11. The apparatus according to claim 10, characterized in that, The adjustment module is used for: Calculate the average slope of the loss value of the AI ​​model during the multiple rounds of iterative training; When the average slope is greater than zero, increase the batch size of the AI ​​model.

12. The apparatus according to claim 10, characterized in that, The acquisition module is used for: The loss value of the AI ​​model during the multi-round iterative training process corresponding to the first observation window and the loss value of the AI ​​model during the multi-round iterative training process corresponding to the second observation window are obtained. The multi-round iterative training process corresponding to the second observation window is executed before the multi-round iterative training process corresponding to the first observation window. The adjustment module is used for: Calculate the first average slope of the loss value of the AI ​​model during the multi-round iterative training process corresponding to the first observation window; When the first average slope is less than zero, calculate the absolute value of the second average slope of the loss value of the AI ​​model in the multi-round iterative training process corresponding to the second observation window; When the absolute value of the first average slope is less than the absolute value of the second average slope, the batch size of the AI ​​model is increased. When the absolute value of the first average slope is greater than the absolute value of the second average slope, the batch size of the AI ​​model is reduced.

13. The apparatus according to claim 12, characterized in that, The adjustment module is used for: When the absolute value of the first average slope is less than the absolute value of the second average slope, and the deviation between the absolute values ​​of the first average slope and the second average slope meets the adjustment condition, the batch size of the AI ​​model is increased. When the absolute value of the first average slope is greater than the absolute value of the second average slope, and the deviation between the absolute values ​​of the first average slope and the second average slope meets the adjustment condition, the batch size of the AI ​​model is reduced.

14. The apparatus according to any one of claims 10 to 13, characterized in that, The data processing system includes multiple accelerators, the AI ​​model is trained through the multiple accelerators, the data processing system also includes a central processing unit (CPU), and the AI ​​model includes a first network layer and a second network layer. The CPU is used to update the parameter values ​​in the first network layer based on the gradient data calculated by the multiple accelerators for the first network layer during the backpropagation process for the AI ​​model. The plurality of accelerators are used to perform reverse computation on the second network layer during the process of the CPU updating the parameter values ​​in the first network layer.

15. The apparatus according to claim 14, characterized in that, The plurality of accelerators are also used to perform gradient overflow detection on the first network layer before the CPU updates the parameter values ​​in the first network layer.

16. The apparatus according to claim 14 or 15, characterized in that, The CPU is specifically used to execute scaling operators and optimizer operators in parallel based on a single instruction multiple data stream (SIMD) architecture, according to the gradient data calculated by the multiple accelerators for the first network layer. The scaling operators and optimizer operators are used to update the parameter values ​​in the first network layer.

17. The apparatus according to any one of claims 14 to 16, characterized in that, The data processing system includes multiple CPUs, the reduction result is stored in memory, and the parameter values ​​in the first network layer are updated by the CPU with the highest memory access efficiency among the multiple CPUs.

18. The apparatus according to any one of claims 10 to 17, characterized in that, The data processing system also includes a central processing unit (CPU), and the AI ​​model includes a third network layer. The CPU is further configured to update the values ​​of the parameters in the third network layer, wherein the updated values ​​of the parameters in the third network layer are obtained as values ​​in a first format; the first format values ​​are converted into values ​​in a second format, wherein the data precision of the first format is higher than that of the second format; and the second format values ​​of the parameters in the third network layer are sent to the plurality of accelerators.

19. A computing device, characterized in that, The computing device includes a processor and a memory; The processor is configured to execute instructions stored in the memory to cause the computing device to perform the steps of the method as described in any one of claims 1 to 9.

20. A data processing system, characterized in that, The data processing system includes at least one accelerator and an artificial intelligence (AI) model training device, the AI ​​model training device being used to perform the method as described in any one of claims 1 to 9, and the at least one accelerator being used to train the AI ​​model.

21. A computer-readable storage medium, characterized in that, Includes instructions that, when executed on a computing device, cause the computing device to perform the steps of the method as described in any one of claims 1 to 9.

22. A computer program product containing instructions, characterized in that, When it is run on at least one computing device, it causes the at least one computing device to perform the steps of the method as described in any one of claims 1 to 9.