Training method based on parallel strategy, and related apparatus
By adopting differentiated parallel strategies for the image and language processing modules of multimodal large models, the problem of low model computing power utilization in existing technologies is solved, achieving more efficient utilization of computing resources and model training efficiency.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2025-12-16
- Publication Date
- 2026-06-25
AI Technical Summary
Existing parallel training methods ignore the differences between different processing modules when training large multimodal models, resulting in low utilization of model computing power.
Different parallel strategies are adopted for the image processing module and the language processing module. The image processing module uses one or more of the following strategies: data parallelism, tensor parallelism, pipeline parallelism and context parallelism. The language processing module uses different parallel strategies. They share computing resources but execute different parallel strategies to improve the utilization of model computing power during model training.
By separating and decoupling the processing modules at the computational logic level and sharing resources at the physical level, the utilization rate of computing resources is improved, the waiting time between adjacent computing nodes is reduced, and the efficiency of model training is increased.
Smart Images

Figure CN2025142835_25062026_PF_FP_ABST
Abstract
Description
A training method and related apparatus based on a parallel strategy
[0001] This application claims priority to Chinese Patent Application No. 202411906967.6, filed on December 20, 2024, entitled “A Training Method and Related Apparatus Based on Parallel Strategy”, the entire contents of which are incorporated herein by reference. Technical Field
[0002] This application relates to the field of artificial intelligence, and in particular to a training method and related apparatus based on a parallel strategy. Background Technology
[0003] Multimodal large language models (MLLMs) are a class of models that combine natural language processing capabilities with the ability to understand and generate information from other modalities (such as images and speech). Therefore, MLLMs can process and understand information from different modalities and fuse this information to accomplish complex tasks, making them widely applicable.
[0004] Existing parallel training methods employ homogeneous parallelism when training large multimodal models. This means that multiple processing modules within the large multimodal model, used to handle information from different modalities, employ the same parallelism strategy. This shared strategy can include one or more of the following: data parallelism (DP), tensor parallelism (TP), pipeline parallelism (PP), and context parallelism (CP). However, homogeneous parallelism ignores the differences between different processing modules, resulting in low model flops utilization (MFU) during model training. Summary of the Invention
[0005] This application provides a training method and related apparatus based on a parallel strategy, which can be applied to the field of artificial intelligence. It can not only improve the utilization rate of model computing power during model training, but also improve the utilization rate of computing resources.
[0006] In a first aspect, embodiments of this application provide a training method based on a parallel strategy, the method comprising:
[0007] Determine a first parallel strategy for the image processing module in a multimodal large model, wherein the first parallel strategy includes one or more of the following: data parallelism (DP), tensor parallelism (TP), pipeline parallelism (PP), and context parallelism (CP).
[0008] Determine a second parallel strategy for the language processing module in a multimodal large model, wherein the second parallel strategy includes one or more of the following strategies: DP strategy, TP strategy, PP strategy, and CP strategy.
[0009] The image processing module uses the same computing resources as the language processing module, but the first and second parallel strategies executed by the computing resources are different. The multimodal large model is used to perform multimodal tasks.
[0010] When training large multimodal models, the above method allows the image processing and language processing modules to share computational resources. Furthermore, different parallel strategies can be implemented for the image processing and language processing modules respectively, avoiding the impact of differences between the different processing modules on the mean efficiency (MFU) during model training, thereby improving MFU. Moreover, when applied to training scenarios where the image processing and language processing modules are dependent on each other, the above method separates and decouples the various processing modules at the computational logic level while sharing computational resources at the physical level, which is beneficial for improving the utilization of computational resources.
[0011] In one alternative implementation, computing resources include accelerators or accelerator clusters for parallel computing, with accelerators including virtual accelerators or physical accelerators.
[0012] In one alternative implementation, the parallel dimension of the first parallel strategy is the same as that of the second parallel strategy. The parallel dimension includes the product of one or more of the parallel dimensions of the DP strategy, the TP strategy, the PP strategy, and the CP strategy.
[0013] In another alternative implementation, the image processing module is a visual encoder (VIT), and the language processing module is a large language model (LLM). This method also includes:
[0014] Multiple feature data are generated based on the first parallel strategy. Each feature data is obtained by VIT forward calculation from the image data of one of the multiple training samples. Each of the multiple training samples includes image data and text data. The text data is used to indicate the text instructions for obtaining the descriptive text of the image data.
[0015] The text data and feature data in each training sample are fused separately to generate multiple fused data sets.
[0016] Multiple prediction data are generated based on the second parallel strategy, wherein the multiple prediction data are obtained by forward calculation of multiple fused data through LLM respectively;
[0017] Optimize parameters in a multimodal large model based on multiple prediction data.
[0018] In another alternative implementation, each training sample in the multiple training samples includes multiple image-text pairs. Each image-text pair includes a sample image and a sample text corresponding to the sample image. The sample images in the multiple image-text pairs belong to image data, and the sample text in the multiple image-text pairs belong to text data.
[0019] In another alternative implementation, the sum of the text token and the image token for each training sample is the same.
[0020] In the above method, each training sample has the same data length, which helps to reduce the waiting time between adjacent computing nodes when using 1F1B pipeline scheduling, that is, to reduce PP cavitation, and thus further improve MFU.
[0021] In another alternative implementation, the parameters in the multimodal large model are optimized based on multiple prediction data, including:
[0022] Multiple first inverse data are generated, wherein the multiple first inverse data are obtained by inverse calculation of multiple predicted data through LLM;
[0023] Optimize the parameters in the LLM based on multiple first-reverse data;
[0024] Multiple second reverse data are generated, wherein the multiple second reverse data are obtained by reverse calculation of multiple first reverse data through VIT;
[0025] Optimize the parameters in VIT based on multiple second-reverse data.
[0026] In another alternative implementation, the computing resource includes at least one computing node, and each computing node includes multiple computing units. The number of computing units in the computing resource is equal to the parallel dimension of the first parallel strategy.
[0027] In another alternative implementation, the parallel dimension of the PP strategy in the second parallel strategy is N. pp In the second parallel strategy, the pipeline scheduling of the PP strategy is one forward computation followed by one backward computation (1F1B), and the computing resources include N. pp There are N computing nodes. pp The i-th computing node in the n computing nodes includes the i-th processing layer of the language processing module, where i takes the values 1, 2, ..., N. pp .
[0028] When the second parallel strategy includes the PP strategy, the above method adopts a 1F1B pipeline scheduling, which allows the forward and backward computations of the language processing module to be performed alternately, thereby reducing the total time for the language processing module to perform forward and backward computations, and thus improving MFU.
[0029] In another alternative implementation, multiple feature data are generated based on the first parallel strategy, including:
[0030] Based on the parallel dimension of the DP strategy in the first parallel strategy, the image data from multiple training samples are distributed to N. pp There are N computational nodes, where the larger i is, the more N pp The more image data is allocated to the i-th computing node among the computing nodes;
[0031] Based on VIT, multiple feature data are obtained by processing image data from multiple training samples.
[0032] For scenarios where the first parallel strategy includes the DP strategy and the second parallel strategy includes the PP strategy, when the first parallel strategy includes the DP strategy, considering the uneven storage pressure of activation values between PP groups (i.e., between multiple computing nodes where multiple processing layers are located) when using 1F1B pipeline scheduling, the above method can implement asymmetric allocation when allocating image data to multiple computing nodes based on the parallel dimension of the DP strategy in the first parallel strategy, thereby balancing the storage pressure of multiple computing nodes throughout the entire model training process.
[0033] In yet another alternative implementation, the method further includes:
[0034] A third parallel strategy for the image processing module is determined, wherein the third parallel strategy includes one or more of the DP strategy, TP strategy, PP strategy and CP strategy, and the third parallel strategy is used to train the image processing module.
[0035] A fourth parallel strategy for the language processing module is determined, wherein the fourth parallel strategy includes one or more of the DP strategy, TP strategy, PP strategy and CP strategy. The fourth parallel strategy is used to train the language processing module. The third parallel strategy and the fourth parallel strategy executed by the computing resources are different.
[0036] In yet another alternative implementation, the method further includes:
[0037] Under the training based on the first parallel strategy and the second parallel strategy, determine the first model computing power utilization rate (MFU) of the multimodal large model.
[0038] Determine the second MFU of the multimodal large model under training based on the third and fourth parallel strategies;
[0039] If the first MFU is greater than the second MFU, the first parallel strategy is determined to be the preferred parallel strategy for the image processing module, and the second parallel strategy is determined to be the preferred parallel strategy for the language processing module.
[0040] In the above method, the model training device can also perform model training for the parallel strategy in each combination method to obtain the MFU corresponding to each combination method, and select the preferred parallel strategy corresponding to each processing module according to the MFU to further improve the MFU.
[0041] In yet another alternative implementation, the method further includes:
[0042] Send the multimodal large model trained based on the first parallel strategy and the second parallel strategy.
[0043] Secondly, embodiments of this application provide a computing device, which includes a first determining module and a second determining module, wherein...
[0044] The first determining module is used to determine the first parallel strategy of the image processing module in the multimodal large model. The first parallel strategy includes one or more of the following: data parallelism (DP), tensor parallelism (TP), pipeline parallelism (PP), and context parallelism (CP). The first parallel strategy is used to train the image processing module.
[0045] The second determining module is used to determine the second parallel strategy of the language processing module in the multimodal large model. The second parallel strategy includes one or more of the DP strategy, TP strategy, PP strategy and CP strategy. The second parallel strategy is used to train the language processing module.
[0046] The image processing module uses the same computing resources as the language processing module, but the first and second parallel strategies executed by the computing resources are different. The multimodal large model is used to perform multimodal tasks.
[0047] In one alternative implementation, computing resources include accelerators or accelerator clusters for parallel computing, with accelerators including virtual accelerators or physical accelerators.
[0048] In one alternative implementation, the parallel dimension of the first parallel strategy is the same as that of the second parallel strategy. The parallel dimension includes the product of one or more of the parallel dimensions of the DP strategy, the TP strategy, the PP strategy, and the CP strategy.
[0049] In another alternative implementation, the image processing module is a visual encoder (VIT), the language processing module is a large language model (LLM), and the computing device also includes a first generation module, a fusion processing module, a second generation module, and an optimization module, wherein:
[0050] The first generation module is used to generate multiple feature data based on the first parallel strategy. Each feature data is obtained by VIT forward calculation of image data in one of the multiple training samples. Each training sample includes image data and text data. The text data is used as a text instruction to indicate the descriptive text of the obtained image data.
[0051] The fusion processing module is used to fuse the text data and feature data in each training sample separately to generate multiple fused data.
[0052] The second generation module is used to generate multiple prediction data based on the second parallel strategy, wherein the multiple prediction data are obtained by forward calculation of multiple fused data through LLM respectively;
[0053] The optimization module is used to optimize the parameters in a multimodal large model based on multiple prediction data.
[0054] In another alternative implementation, each training sample in the multiple training samples includes multiple image-text pairs. Each image-text pair includes a sample image and a sample text corresponding to the sample image. The sample images in the multiple image-text pairs belong to image data, and the sample text in the multiple image-text pairs belong to text data.
[0055] In another alternative implementation, the sum of the text token and the image token for each training sample is the same.
[0056] In another alternative implementation, the optimization module is specifically used to: optimize the parameters in a multimodal large model based on multiple prediction data.
[0057] Multiple first inverse data are generated, wherein the multiple first inverse data are obtained by inverse calculation of multiple predicted data through LLM;
[0058] Optimize the parameters in the LLM based on multiple first-reverse data;
[0059] Multiple second reverse data are generated, wherein the multiple second reverse data are obtained by reverse calculation of multiple first reverse data through VIT;
[0060] Optimize the parameters in VIT based on multiple second-reverse data.
[0061] In another alternative implementation, the computing resource includes at least one computing node, and each computing node includes multiple computing units. The number of computing units in the computing resource is equal to the parallel dimension of the first parallel strategy.
[0062] In another alternative implementation, the parallel dimension of the PP strategy in the second parallel strategy is N. pp In the second parallel strategy, the pipeline scheduling of the PP strategy is one forward computation followed by one backward computation (1F1B), and the computing resources include N. pp There are N computing nodes. pp The i-th computing node in the n computing nodes includes the i-th processing layer of the language processing module, where i takes the values 1, 2, ..., N. pp .
[0063] In another alternative implementation, regarding the generation of multiple feature data based on the first parallel strategy, the first generation module is specifically used for:
[0064] Based on the parallel dimension of the DP strategy in the first parallel strategy, the image data from multiple training samples are distributed to N. pp There are N computational nodes, where the larger i is, the more N pp The more image data is allocated to the i-th computing node among the computing nodes;
[0065] Based on VIT, multiple feature data are obtained by processing image data from multiple training samples.
[0066] In another alternative implementation:
[0067] The first determining module is also used to determine a third parallel strategy for the image processing module, wherein the third parallel strategy includes one or more of the DP strategy, TP strategy, PP strategy and CP strategy, and the third parallel strategy is used to train the image processing module.
[0068] The second determining module is also used to determine a fourth parallel strategy for the language processing module, wherein the fourth parallel strategy includes one or more of the DP strategy, TP strategy, PP strategy and CP strategy. The fourth parallel strategy is used to train the language processing module, and the third parallel strategy and the fourth parallel strategy executed by the computing resources are different.
[0069] In another alternative implementation, the computing device further includes a third determining module, wherein:
[0070] The first determining module is also used to determine the first model computing power utilization rate (MFU) of the multimodal large model during training based on the first parallel strategy and the second parallel strategy.
[0071] The second determination module is also used to determine the second MFU of the multimodal large model when training based on the third and fourth parallel strategies.
[0072] The third determining module is used to determine, when the first MFU is greater than the second MFU, the first parallel strategy is the preferred parallel strategy corresponding to the image processing module, and the second parallel strategy is the preferred parallel strategy corresponding to the language processing module.
[0073] In another alternative implementation, the computing device further includes a transmitting / receiving module, wherein:
[0074] The receiving and transmitting module is used to send multimodal large models trained based on the first parallel strategy and the second parallel strategy.
[0075] Thirdly, embodiments of this application provide a computing device including a processor, which is configured to invoke a computer program to implement the method described in the first aspect or any possible implementation of the first aspect.
[0076] Fourthly, embodiments of this application provide a computing device, which includes logic circuitry and an interface, the logic circuitry and the interface being coupled; the interface is used for inputting and / or outputting information, wherein:
[0077] This logic circuit is used to perform the method described in the first aspect or any possible implementation thereof.
[0078] Fifthly, embodiments of this application provide a server, which includes a processor for invoking a computer program to implement the method described in the first aspect or any possible implementation of the first aspect.
[0079] Sixthly, embodiments of this application provide a chip including logic circuitry and an interface, the logic circuitry and the interface being coupled; the interface is used for inputting and / or outputting information, wherein:
[0080] This logic circuit is used to perform the method described in the first aspect or any possible implementation thereof.
[0081] In a seventh aspect, embodiments of this application provide a computer-readable storage medium for storing a computer program, wherein when the computer program is executed, it is capable of implementing the method of the first aspect or any possible implementation of the first aspect.
[0082] In one alternative implementation, the computer-readable storage medium may be a non-transitory computer-readable storage medium.
[0083] The beneficial effects of the methods and apparatus provided in any of the second to seventh aspects and any of the possible implementations of the second to seventh aspects of this application can be referred to the beneficial effects of the technical solutions provided in the first aspect and any possible implementation of the first aspect, and will not be repeated here. Attached Figure Description
[0084] The accompanying drawings used in the embodiments of this application are described below.
[0085] Figure 1 is a schematic diagram illustrating the working principle of a model provided in an embodiment of this application;
[0086] Figure 2 is a schematic diagram illustrating the working principle of another model provided in an embodiment of this application;
[0087] Figure 3A is a schematic diagram of the architecture of a model training system provided in an embodiment of this application;
[0088] Figure 3B is a schematic diagram of the structure of a model training device provided in an embodiment of this application;
[0089] Figure 4 is a schematic diagram of the principle of a parallel strategy provided in an embodiment of this application;
[0090] Figure 5 is a schematic diagram illustrating the principle of a parallel strategy-based training method provided in an embodiment of this application;
[0091] Figure 6 is a flowchart illustrating a training method based on a parallel strategy provided in an embodiment of this application;
[0092] Figure 7 is a schematic diagram illustrating the principle of another training method based on a parallel strategy provided in an embodiment of this application;
[0093] Figure 8 is a time-based schematic diagram of a flow scheduling embodiment provided in this application;
[0094] Figure 9 is a schematic diagram of the structure of a computing device provided in an embodiment of this application;
[0095] Figure 10 is a schematic diagram of the structure of another computing device provided in an embodiment of this application;
[0096] Figure 11 is a schematic diagram of the structure of another computing device provided in an embodiment of this application. Detailed Implementation
[0097] The embodiments of this application are described below with reference to the accompanying drawings.
[0098] Multimodal large language models (MLLMs) are a class of models that combine natural language processing capabilities with the ability to understand and generate information from other modalities. Therefore, MLLMs can process and understand information from different modalities and fuse this information to accomplish complex tasks, making them widely applicable. It should be noted that modality refers to the source or form of information; for example, the modality of information can be text, images, video, audio, etc.
[0099] Commonly used large multimodal models include contrastive language-image pre-training (CLIP) models and large language and vision assistant (LLAVA) models. Both CLIP and LLVA models include language processing and image processing modules to process information from these two different modalities, respectively, to complete multimodal tasks, such as "generating descriptive text from an image" or "generating descriptive text from text instructions and images."
[0100] The working principles of the CLIP model and the LLVA model are briefly introduced below.
[0101] For example, please refer to Figure 1, which is a schematic diagram of the working principle of a model provided in an embodiment of this application. As shown in Figure 1, the image processing module in CLIP model 10 can be an image encoder 101. Optionally, the image encoder 101 can be a convolutional neural network. The language processing module in CLIP model 10 can be a text encoder 102. Optionally, the text encoder 102 can be a transformer model for processing text.
[0102] When CLIP model 10 performs the multimodal task of "generating descriptive text from an image", the target image serves as the input data for CLIP model 10. The text encoder 102 can process M pre-defined candidate texts into M text feature vectors (such as feature vector T1, feature vector T2, ..., feature vector T...). M The image encoder 101 processes the target image into an image feature vector I1. Then, the CLIP model 10 calculates the similarity between the image feature vector I1 and M text feature vectors, and outputs the candidate text corresponding to the text feature vector with the highest similarity to the image feature vector I1 as the descriptive text of the target image. Optionally, the similarity can be represented by the inner product between vectors; the larger the inner product, the greater the similarity.
[0103] For example, please refer to Figure 2, which is a schematic diagram of the working principle of another model provided in the embodiment of this application. As shown in Figure 2, the image processing module in the LLAVA model 20 can be a vision transformer (VIT) 201, and the image processing module in the LLAVA model 20 can be a large language model (LLM) 202.
[0104] When LLAVA model 20 handles the multimodal task of "generating descriptive text based on text instructions and images," the target image and text instructions serve as input data. VIT 201 processes the input target image into image feature data. Specifically, the target image can be first segmented into multiple sub-images of fixed resolution size, which are then used as input to VIT. The feature data of these sub-images, after VIT processing, are then concatenated to form image feature data. Next, the input text instructions, after dimensional alignment of the encoded text feature data and image feature data, are concatenated to obtain fused data. Optionally, the text instructions can be used to instruct the retrieval of descriptive text for the target image; for example, the text instructions could be "output the content described in the image" or "what is described in the image?". Further, LLM 202 processes the fused data to output the descriptive text for the target image.
[0105] As shown in Figures 1 and 2, in the CLIP model, the language processing module and the image processing module can work independently and in parallel. In the LLAVA model, the image processing module and the language processing module work in a specific order, and the input of the language processing module depends on the output of the image processing module.
[0106] Please refer to Figure 3A, which is a schematic diagram of the architecture of a model training system provided in an embodiment of this application. The model training system 30 includes at least a model training device 301, one or more data acquisition devices 302, and one or more model usage devices 303.
[0107] The model training device 301 and the data acquisition device 302 can communicate via wired or wireless means. The data acquisition device 302 can send multiple acquired training samples to the model training device 301. The training samples include image information and language information. Accordingly, the model training device 301 trains a multimodal large model using the received training samples.
[0108] The model training device 301 and the model using device 303 can communicate via wired or wireless means. Therefore, the model training device 301 can send the trained multimodal large model (or multimodal network) to the model using device 303; correspondingly, the model using device 303 uses the received multimodal large model to generate and execute multimodal tasks. The multimodal large model includes an image processing module and a language processing module, which are used to process image-related information and language-related information, respectively.
[0109] Optionally, the data acquisition device 302 may include multiple devices, such as a first data acquisition device and a second data acquisition device. The first data acquisition device is used to acquire language information from the training samples, and the second data acquisition device is used to acquire image information from the training samples. Optionally, the first data acquisition device may be a device capable of speech recognition, and the second data acquisition device may be a device with a camera function. Optionally, both the first and second data acquisition devices may be devices that acquire network data or database data via a network (such as cloud or terminal devices).
[0110] Optionally, there may be only one data acquisition device 302, in which case the data acquisition device 302 may acquire multiple training samples.
[0111] The model training device 301 can acquire a first parallel strategy for training the image processing module and a second parallel strategy for training the language processing module. Optionally, the first and second parallel strategies can be determined by the model training device 301 itself, or they can be determined by other devices (such as the data acquisition device 302) and then sent to the model training device.
[0112] Optionally, the multimodal large model in the model training device 301 is a model built based on the transformer architecture, such as the CILP model or the LLVA model.
[0113] Optionally, the model training device 301 can be a computing device with strong computing power, such as a server or a server cluster consisting of multiple servers.
[0114] In one optional implementation, the model training system 30 may further include an application server. During the training phase, the model training device 301 trains a multimodal large model using multiple samples. During the model usage phase, the trained multimodal large model is stored in the application server. Optionally, the application server can send the trained multimodal large model to the model usage device 303, which uses the multimodal large model to process the acquired image-type information and / or language-type information, generating processing results to complete the multimodal task. Alternatively, the model usage device 303 sends the acquired image-type information and / or language-type information to the application server, which uses the multimodal large model to process the acquired information, generating processing results, and then sends the processing results to the model usage device 303, facilitating the model usage device 303 in completing the multimodal task.
[0115] Optionally, the device 303 used in this model can be a computing device that needs to perform multimodal tasks (such as "analyzing an image and generating descriptive text" or "generating descriptive text based on text instructions and images"). For example, the device used in this model can be a terminal device with computing capabilities, such as a mobile phone, tablet computer, computer with wireless transceiver capabilities, wearable device, vehicle, drone, helicopter, airplane, ship, robot, robotic arm, smart home device, transportation vehicle with wireless communication capabilities, communication module, etc. The embodiments of this application do not limit the device form of the terminal device.
[0116] Optionally, the model-using device 303 can feed back the processing results generated based on the multimodal large model to the model training device 301, so that the model training device 301 can further train the model based on the processing results generated by the model-using device 303. The retrained model can be sent to the model-using device 303 to update the original model.
[0117] It is understood that the model training device 301 and the model usage device 303 mentioned above can be two separate devices or a single device. When the model training device 301 and the model usage device 303 are a single device, the aforementioned communication connection and information exchange process between the two devices does not exist. The model training device 301 and the application server mentioned above can be two separate devices or a single device, and this application does not impose any restrictions on this.
[0118] Furthermore, the model training device 301 includes one or more computing nodes, each computing node including multiple computing units, and the number of multiple computing units in each computing node is the same.
[0119] Optionally, the computing unit in this embodiment can be a physical computing resource, such as a graphics processing unit (GPU), a data processing unit (DPU), a digital signal processing unit (DSP), a neural network processing unit (NPU), an application-specific integrated circuit (ASIC), or a field-programmable gate array (FPGA). Optionally, the computing unit in this embodiment can also be a virtual computing resource, such as a virtual CPU (vCPU), a virtual GPU (vGPU), a virtual DSP (vDSP), a virtual NPU (vNPU), a virtual ASIC (vASIC), or a virtual FPGA (vFPGA).
[0120] In the training process of a multimodal large model, the input data of processing modules (such as image processing and language processing modules) can be represented as a tensor. As shown in Figure 3B, a small cube represents a value in the input data F0. The dimensions of this input data F0 include batch, sequence, and hidden dimensions. The dimensions of input data F0 can be represented in the form [B, S, H], where B is the length of input data F0 in the batch dimension, S is the length of input data F0 in the sequence dimension, and H is the length of input data F0 in the hidden dimension. Figure 3B uses the dimension of input data F0 as [6, 6, 6] for illustration. When training a multimodal large model, B reflects the number of training samples processed in one training cycle, S reflects the number of tokens in a training sample, and H reflects the number of values or vectors used to quantize a token in a training sample. H is related to the weight matrix used in the processing module and can be set according to actual needs.
[0121] It's important to note that in the field of deep learning, a token represents the smallest unit of data in the data processing process. For example, for text data, a text token typically represents a word, a punctuation mark, a letter, or a number. For image data, an image token typically represents a sub-region (also known as a patch) of the image.
[0122] Existing parallel training methods employ a homogeneous parallel strategy when training large multimodal models. This means that multiple processing modules corresponding to different modalities in a large multimodal model use the same parallel strategy. This same parallel strategy can include one or more of the following strategies: data parallelism (DP), tensor parallelism (TP), pipeline parallelism (PP), and context parallelism (CP).
[0123] The following section explains the DP, TP, PP, and CP strategies. For clarity, please refer to Figure 4, which is a schematic diagram illustrating the principle of a parallel strategy provided in an embodiment of this application.
[0124] As shown in Figure 4(A), the DP strategy divides the input data 401 of the processing module (such as the image processing module and the language processing module) in the batch dimension on an average basis. The number of divisions is equal to the parallel dimension N of the DP strategy. DP That is, the input data 401 can be divided into N. DP Sub-input data (e.g., sub-input data 4011). It can be understood that the input data 401 shown in Figure 4(A) can be considered as a two-dimensional projection of the input data F0 shown in Figure 3B onto the batch-hidden dimension plane. Further, N DP Each input data point can be evenly distributed across multiple computational units in the model training device as input, and each of these computational units is equipped with a processing module (i.e., the processing module is copied multiple times and distributed across multiple computational units). It is understandable that the parallel dimension N... DP It is no greater than the number of computing units in the model training device.
[0125] As shown in Figure 4(B), the CP strategy divides the input data 401 of the processing module on an average basis along the hidden dimension. The number of divisions is equal to the parallel dimension N of the CP strategy. CP That is, the input data 401 can be divided into N. CP Sub-input data (e.g., sub-input data 4012). Multiple sub-input data can be evenly distributed among multiple computing units in the model training device as input, and each of these multiple computing units is configured with a processing module (i.e., the processing module is copied multiple times and distributed among multiple computing units). It can be understood that input data 401 can be considered as a two-dimensional projection diagram of input data F0 shown in Figure 3B onto the sequence-hidden dimension plane. Further, N...CP The input data can be evenly distributed across multiple computational units in the model training device, and each of these computational units is equipped with a processing module (i.e., the processing module is copied multiple times and distributed across multiple computational units). It is understandable that the parallel dimension N... CP It is no greater than the number of computing units in the model training device.
[0126] In multimodal large models, the image processing and language processing modules typically contain multiple processing layers (such as feedforward neural network (FNN) layers) from a transformer model. Each processing layer contains a parameter matrix (weight). The dimension of the parameter matrix is related to the hidden layer dimension of the input data. The dimension of the parameter matrix includes the hidden layer dimension of the input data and the hidden layer dimension of the parameter matrix. For distinction, the hidden layer dimension of the input data was previously referred to as the hidden dimension, and here the hidden layer dimension of the parameter matrix is referred to as the hidden′ dimension. Therefore, the dimension of the parameter matrix can also be represented as [H, H′], where H′ is the length of the parameter matrix in the hidden′ dimension, which is usually 4 times H.
[0127] As shown in Figure 4(C), the TP strategy involves equally splitting the parameter matrix 402 in the processing module along the hidden' dimension. The number of splits is equal to the parallel dimension N of the TP strategy. TP That is, the parameter matrix 402 can be divided to obtain N. TP There are several sub-parameter matrices (e.g., sub-parameter matrix 4021). Furthermore, in the processing module, N... TP Each sub-parameter matrix can be evenly distributed across multiple computational units in the model training device (i.e., each computational unit contains a portion of the parameter matrix from the processing module), and each of these multiple computational units processes the same input data. It is understandable that the parallel dimension N... TP It is no greater than the number of computing units in the model training device.
[0128] As shown in Figure 4(D), the PP strategy is based on the parallel dimension N of the PP strategy. PP The processing module is divided into multiple processing layers on an average basis, with the number of divisions equal to the parallel dimension N of the PP strategy. PP That is, N is obtained after dividing multiple processing layers. PP A processing layer (such as processing layer 403). Further, N TP The processing layers can be evenly distributed across multiple computational units in the model training device (i.e., each computational unit contains a portion of the processing layers from the processing module), and each of these multiple computational units processes the same input data. It is understandable that the parallel dimension N...PP It is no greater than the number of computing units in the model training device.
[0129] However, the homogeneous parallel strategy in existing parallel training methods ignores the differences between different processing modules in multimodal large models, resulting in low model flops utilization (MFU) during model training.
[0130] For example, in an image processing module called VIT and a language processing module called LLM, both VIT and LLM employ parallel strategies that include TP8 (i.e., parallel dimension N). TP In the case of a parallel TP strategy of 8, on the one hand, VIT is deep and narrow, while LLM is deep and wide. Compared with LLM, the parameter matrix of VIT's processing layer is a slender matrix (i.e., H is relatively small). When using the parallel strategy TP8, the parameter matrix is further slender after being split, resulting in lower computational efficiency when the slender parameter matrix participates in calculations (such as matrix multiplication), thus leading to a lower MFU. On the other hand, since the input data of VIT is an image, when the image resolution is high, the length S of the input data in the sequence dimension of VIT is larger than that of LLM. When using the parallel strategy TP8, the input data needs to be operated on multiple computation units with multiple split parameter matrices. The calculation results of multiple computation units also need to be summarized and output through communication between multiple computation units, resulting in longer communication time and computation time for VIT, thus leading to a lower MFU.
[0131] Considering that the language processing module and image processing module in the CLIP model can work independently and in parallel, the model training device can allocate different processing modules in the CLIP model to different computing resources to train the CLIP model. Furthermore, different computing resources execute different parallel strategies, which is beneficial to improving MFU during model training.
[0132] For clarity and exemplification, please refer to Figure 5, which is a schematic diagram illustrating the principle of a parallel-strategy-based training method provided in an embodiment of this application. As shown in Figure 5, the model training device 50 includes three computing nodes (computing node 501, computing node 502, and computing node 503), and each computing node includes eight computing units. It can be understood that the model training device 50 may be the model training device 301 in the embodiment shown in Figure 3A.
[0133] The image processing module in the CLIP model uses compute node 501, while the language processing module uses compute nodes 502 and 503. The parallelism strategy executed by compute node 501 (i.e., the parallelism strategy of the image processing module) is DP4TP2 (i.e., including parallel dimensions N). DPDP strategy with 4 and parallel dimension N TP The parallel strategy executed by compute nodes 502 and 503 (i.e., the parallel strategy of the language processing module) is TP8PP2 (i.e., including the parallel dimension N). TP TP strategy with 8 and parallel dimension N PP (PP strategy of 2).
[0134] Therefore, the model training device 50 can process the input data of the image processing module based on the parallel strategy DP4TP2, and allocate the image processing module to the eight computing units of the computing node 501. Based on the explanation of the parallel strategy in Figure 4, it can be understood that the input data 504 of the image processing module is split into four sub-input data (i.e., sub-input data 5041, sub-input data 5042, sub-input data 5043, and sub-input data 5044), and the parameter matrix in the image processing module is split into two sub-parameter matrices (i.e., sub-parameter matrix W1 and sub-parameter matrix W2). Four of the eight computing units in the computing node 501 are assigned to sub-parameter matrix W1, and the other four computing units are assigned to sub-parameter matrix W2. Then, each of the four sub-input data is processed by the two computing units containing sub-parameter matrix W1 and sub-parameter matrix W2.
[0135] Similarly, the model training device 50 can allocate the language processing module to the 16 computing units of computing nodes 502 and 503 according to the parallel strategy TP8PP2. Based on the explanation of the parallel strategy in Figure 4, it can be understood that the language processing module is divided into two processing layers (i.e., processing layer stage01 and processing layer stage02), and the parameter matrix of each processing layer is split into 8 sub-parameter matrices. For example, the parameter matrix of processing layer stage01 is split into sub-parameter matrices W011, W012, ..., W018. Computing nodes 502 and 503 are responsible for executing the calculations of processing layers stage01 and stage02, respectively, and the 8 computing units in each computing node are assigned the 8 sub-parameter matrices corresponding to the processing layer. The processing result 506 of the input data 505 of the language processing module after processing by computing node 502 will be further processed by all computing units in computing node 503.
[0136] Since the language processing module and image processing module in the CLIP model can work independently and in parallel, the computing node 501 in the model training device 50 can also work independently and in parallel with the two computing nodes 502 and 503.
[0137] As can be seen, in the embodiment shown in Figure 5, different processing modules of the CLIP model can use different computing resources to implement different parallel strategies for different processing modules. This is beneficial for calculating the MFU during model training based on the training results of the CLIP model, and for optimizing the parallel strategies of different processing modules based on the MFU, thereby improving the MFU.
[0138] Furthermore, considering that the training method for the CLIP model in the embodiment shown in Figure 5 is applied to the LLAVA model, since the language processing module and the image processing module in the LLAVA model have a dependency relationship, when the two processing modules use different computing resources, the computing resources used by the language processing module must wait until the computing resources used by the image processing module output the calculation results before they can be calculated, resulting in low utilization of computing resources. Therefore, this application embodiment provides a training method based on a parallel strategy, which can not only improve the MFU (Mean Functional Fusion Rate) during model training but also improve the utilization of computing resources.
[0139] For clarity, please refer to Figure 6. Figure 6 is a flowchart illustrating a training method based on a parallel strategy provided in an embodiment of this application. This method can be implemented based on the architecture shown in Figure 3A, or on other architectures. The method includes, but is not limited to, the following steps:
[0140] Step S601: The model training device determines the first parallel strategy for the image processing module in the multimodal large model.
[0141] The model training device contains computing resources that can be used for model training. The computing resources in this application can be understood as accelerators or accelerator clusters used for parallel computing. Optionally, the computing resources can be physical (hardware) computing resources / accelerators, such as graphics processing units (GPUs), data processing units (DPUs), digital signal processing units (DSPs), neural network processing units (NPUs), application-specific integrated circuits (ASICs), or field-programmable gate arrays (FPGAs). Optionally, the computing resources can also be virtual (logical) computing resources / accelerators obtained based on virtualization technology, such as virtual CPUs (also called vCPUs), virtual GPUs (also called vGPUs), virtual DSPs (also called vDSPs), virtual NPUs (also called vNPUs), virtual ASICs (also called vASICs), or virtual FPGAs (also called vFPGAs). Virtualization technology is a technology that virtualizes physical (hardware) computing resources into multiple virtual (logical) computing resources, which can improve the utilization rate of physical computing resources.
[0142] In one optional implementation, the model training device includes one or more computing nodes, each computing node including multiple computing units, and the number of multiple computing units in each computing node is the same. It is understood that the multiple computing units are computing resources within the model training device. The model training device can be the model training device 301 shown in Figures 3A and 3B, or it can be other computing devices. For an explanation of the computing units, please refer to the description of the corresponding part in the embodiment described in Figure 3A, which will not be repeated here.
[0143] Multimodal large models are used to perform multimodal tasks. For an explanation of multimodal large models, please refer to the descriptions of the corresponding parts in the embodiments shown in Figures 1 and 2, which will not be repeated here. It is understood that the embodiment shown in Figure 6 is more suitable for multimodal large models where the image processing module and the language processing module have dependencies, such as the LLAVA model.
[0144] The first parallel strategy includes one or more of the DP, TP, PP, and CP strategies, used to train the image processing module in the multimodal large model. For explanations of the DP, TP, PP, and CP strategies, please refer to the descriptions of the corresponding parts in the embodiment shown in Figure 4; they will not be repeated here.
[0145] Step S602: The model training device determines the second parallel strategy for the language processing module in the multimodal large model.
[0146] The second parallel strategy includes one or more of the DP, TP, PP, and CP strategies, used to train the language processing module in the multimodal large model. For explanations of the DP, TP, PP, and CP strategies, please refer to the descriptions of the corresponding parts in the embodiment shown in Figure 4; they will not be repeated here.
[0147] The embodiments of this application do not strictly limit the order of steps S601 and S602. That is, step S601 can be performed before step S602, step S601 can be performed after step S602, or step S601 can be performed simultaneously with step S602.
[0148] The image processing module in the model training device uses the same computing resources as the language processing module, but the first parallel strategy and the second parallel strategy executed by these computing resources are different.
[0149] The parallel strategies (such as the first parallel strategy and the second parallel strategy) in the embodiments of this application can be represented by the type of parallel strategy plus the dimension of the parallel strategy. It is understood that if the parallel strategy includes multiple strategies from DP, TP, PP, and CP, then the parallel strategy can be considered a hybrid parallel strategy, and the parallel dimension of the hybrid parallel strategy is equal to the product of the parallel dimensions of each of the various parallel strategies. For example, if the parallel strategy is DP4TP2PP2, it indicates that the parallel strategy includes a parallel dimension N. DP DP strategy with 4, parallel dimension N TP A TP strategy with 2 and parallel dimension N PP The parallel strategy is PP with a value of 2, and the parallel dimension of this parallel strategy is the product of 4, 2 and 2, that is, the parallel dimension of this parallel strategy is 16.
[0150] After steps S601 and S602, the model training device can allocate the image processing module and the language processing module to computing resources based on a first parallel strategy and a second parallel strategy, and process multiple training samples of the multimodal large model, so that the computing resources can execute the first parallel strategy to train the image processing module and execute the second parallel strategy to train the language processing module. However, it should be understood that the phrase "the computing resources used by the image processing module and the computing resources used by the language processing module in the model training device are the same" means that the image processing module and the language processing module are allocated to the same computing resources.
[0151] In one alternative implementation, the first parallel strategy and the second parallel strategy may be determined based on the number of computing units in the model training device.
[0152] It is understandable that when the parallel dimension of the first parallel strategy and the parallel dimension of the second parallel strategy are equal and both equal to the number of computing units in the model training device, the computing resources for executing the first parallel strategy and the second parallel strategy can be the same. That is, the computing resources used by the image processing module and the computing resources used by the language processing module are all the computing units of the model training device.
[0153] Therefore, from among multiple parallel strategies that satisfy the condition that the parallel dimension equals the number of computational units in the model training device, two different parallel strategies can be arbitrarily selected as the first parallel strategy for the image processing module and the second parallel strategy for the language processing module. For example, when the number of computational units in the model training device is 32, parallel strategies that satisfy the condition that the parallel dimension equals 32 can be DP32, DP16PP2, TP8PP4, DP4TP2PP4, and DP2TP8PP2CP2, etc. In other words, there are many ways to combine the parallel strategies of the image processing module and the language processing module; all combinations can be obtained through exhaustive enumeration.
[0154] For example, in addition to the first and second parallel strategies, the model training device can also determine a third parallel strategy for the image processing module and a fourth parallel strategy for the language processing module in the multimodal large model. The third parallel strategy includes one or more of the DP, TP, PP, and CP strategies, and is used to train the image processing module. The fourth parallel strategy includes one or more of the DP, TP, PP, and CP strategies, and is used to train the language processing module. It is understood that the third and fourth parallel strategies executed by the computational resources in the model training device are not the same.
[0155] Furthermore, the model training device can also perform model training for the parallel strategy in each combination, thereby obtaining the MFU corresponding to each combination, and taking the parallel strategy in the combination with the smallest MFU as the optimal parallel strategy.
[0156] For example, the module training device determines the first MFU (Multimodal Large Model) for training based on the first and second parallel strategies, and the second MFU for training based on the third and fourth parallel strategies. If the first MFU is greater than the second MFU, the first parallel strategy is determined as the preferred parallel strategy for the image processing module, and the second parallel strategy is determined as the preferred parallel strategy for the language processing module. Conversely, if the first MFU is less than the second MFU, the third parallel strategy is determined as the preferred parallel strategy for the image processing module, and the fourth parallel strategy is determined as the preferred parallel strategy for the language processing module.
[0157] In one alternative implementation, the first parallel strategy and the second parallel strategy can be directly determined by the model training device itself. It is understood that the model training device can determine the first parallel strategy and the second parallel strategy based on its own computing resources.
[0158] In one optional implementation, the first parallel strategy and the second parallel strategy may be sent to the model training device by another device, thereby allowing the model training device to indirectly determine the first parallel strategy and the second parallel strategy. It is understood that this other device can manage the model training device, obtain information about the model training device's computing resources, determine the first parallel strategy and the second parallel strategy based on the model training device's computing resources, and then send the first parallel strategy and the second parallel strategy to the model training device.
[0159] In one alternative implementation, the model training device can send the multimodal large model trained based on the first parallel strategy and the second parallel strategy to other devices (such as the model using device).
[0160] Therefore, in the embodiments of this application, when training a multimodal large model, the image processing module and the language processing module share computing resources, and different parallel strategies can be executed for the image processing module and the language processing module respectively. This avoids the differences between different processing modules affecting the MFU during model training, thereby improving the MFU. Furthermore, the embodiments of this application can calculate the MFU during model training based on the training results of the multimodal large model, and select the optimal parallel strategy corresponding to each processing module based on the MFU, thereby further improving the MFU. In addition, when applied to the training scenario of a multimodal large model where the image processing module and the language processing module have a dependency relationship, the processing modules in the multimodal large model are separated and decoupled at the computational logic level, while sharing computing resources at the physical level, which is beneficial to improving the utilization rate of computing resources.
[0161] It is understood that the training method provided in this application is not limited to a multimodal large model that includes an image processing module and a language processing module, but can be widely applied to a multimodal large model that includes multiple processing modules that are used to process different modal information and have dependencies.
[0162] The following example, using "a multimodal large model is an LLAVA model, an image processing module is a VIT, a language processing module is an LLM, the first parallel strategy is DP32, the second parallel strategy is TP8PP4, and the model training device includes four computing nodes, each of which includes eight GPUs," will be used to further explain the working principle of the model training device in the embodiment shown in Figure 6.
[0163] For clarity, please refer to Figure 7, which is a schematic diagram illustrating the principle of another parallel-strategy-based training method provided in this application embodiment. As shown in Figure 7, the model training device 70 includes four computing nodes (i.e., computing node 701, computing node 702, computing node 703, and computing node 704), and each computing node contains 8 GPUs. For example, computing node 701 includes GPU 7011, GPU 7012, ..., GPU 7018.
[0164] After determining the first parallel strategy and the second parallel strategy, the model training device 70 allocates the image processing module and the language processing module to computing resources.
[0165] Specifically, due to the parallel dimension N of the DP strategy in the first parallel strategy DP32 DPThe model training device 70 copies the VIT from the LLAVA model 32 times, and distributes these 32 copies of VIT to the 32 GPUs in the model training device 70. This means that each GPU can perform all the computations in the VIT. Figure 7 also illustrates the VIT allocation. Because the parallel dimension N of the PP strategy in the second parallel strategy TP8PP4 is... PP The model training device 70 divides the LLM in the LLAVA model into four processing layers (i.e., processing layer stage0, processing layer stage1, processing layer stage2, and processing layer stage3) according to the LLM computation order. This is because the parallel dimension N of the TP strategy in the second parallel strategy TP8PP4 is... TP The model training device 70 divides the parameter matrix of each processing layer into 8 sub-parameter matrices. For example, the parameter matrix of the first processing layer, stage0, is divided into sub-parameter matrices W01, W02, ..., W08. Figure 7 also illustrates the allocation of processing layers and parameter matrices in LLM. For an explanation of the LLAVA model, please refer to the description of the corresponding part of the embodiment shown in Figure 2, which will not be repeated here.
[0166] After the image processing module and the language processing module are allocated to computing resources, the model training device 70 will perform the following steps to train the LLAVA model:
[0167] Step 1: The model training device 70 generates multiple feature data based on the first parallel strategy DP32.
[0168] In this model, multiple feature data are obtained by VIT forward computation from image data in multiple training samples. Each training sample includes image data and text data, with the text data serving as a descriptive text instruction to the resulting image data. Specifically, each training sample contains multiple image-text pairs, each pair including a sample image and its corresponding sample text. The sample text includes a descriptive text instruction to the output sample image. For example, the sample text could be "What does this image describe?" Another example could be "What is the person in this image doing?" It is understood that the sample images in the multiple image-text pairs belong to image data, and the sample text in the multiple image-text pairs belongs to text data.
[0169] For example, as shown in Figure 7, when training the LLAVA model, the input data of the LLAVA model consists of 32 training samples (i.e., training sample 710, training sample 711, ..., training sample 741). Each training sample contains multiple image-text pairs. For example, training sample 710 includes two image-text pairs (i.e., image-text pair 750 and image-text pair 751). Image-text pair 750 includes sample image 7501 and sample text 7502, and image-text pair 751 includes sample image 7511 and sample text 7512.
[0170] Since VIT can only process images with a preset fixed resolution, and considering that the resolutions of sample images in multiple training samples vary, there may be sample images that do not meet the preset fixed resolution. Therefore, the model training device 70 can preprocess the sample images in all training samples to obtain multiple preprocessed images with a preset fixed resolution.
[0171] For example, as shown in Figure 7, if the preset fixed resolution is 448*448, the resolution of sample image 7501 is 896*896, and the resolution of sample image 7511 is 448*448, then sample image 7501 will be segmented into two preprocessed images with a resolution of 448*448 (i.e., preprocessed image 7601 and preprocessed image 7602), while sample image 7511 will not be segmented and can be considered as preprocessed image 7603. It is understood that the resolution is related to the number of image tokens in the image. Optionally, an image with a resolution of 448*448 can contain 256 image tokens, and an image with a resolution of 896*896 can contain 448 image tokens.
[0172] Next, due to the parallel dimension N of the DP strategy in the first parallel strategy DP32 DP The model training device 70 divides multiple preprocessed images into 32 preprocessed image sets, and distributes these 32 sets equally among each computational unit as input. It is understood that each preprocessed image can be used to perform VIT forward computation to obtain feature processing data.
[0173] For example, as shown in Figure 7, all sample images in the 32 training samples are processed into 96 preprocessed images (i.e., preprocessed image 7601, preprocessed image 7602, ..., preprocessed image 7696). These 96 preprocessed images are divided into 32 parts on average. Each GPU in each model training device 70 processes 3 preprocessed images using VIT to obtain 96 sub-feature data corresponding to the 96 preprocessed images. For example, GPU 7011 can use VIT to process preprocessed images 7601, 7602, and 7603 to generate corresponding sub-feature data 7701, 7702, and 7703, respectively.
[0174] Then, the model training device 70 processes the sub-feature data generated by all computing nodes, processing (such as stitching) the sub-feature data corresponding to the sample images that originally belonged to the same training sample to generate the feature data corresponding to that training sample. That is, the 32 training samples correspond to 32 feature data (i.e., feature data 7800, feature data 7801, ..., feature data 7831). For example, training sample 710 corresponds to feature data 7800, and feature data 7800 is the data obtained by processing sub-feature data 7701, sub-feature data 7702, and sub-feature data 7703.
[0175] Optionally, the computing nodes in the model training device 70, excluding the first computing node 701, will aggregate all the generated sub-feature data into the first computing node 701. The first computing node 701 will process all the sub-feature data and output 32 training samples corresponding to 32 feature data respectively.
[0176] It is understandable that, since sub-feature data is data obtained by forward computation of the preprocessed image using VIT, feature data can be regarded as data obtained by forward computation of multiple sample images in the training samples using VIT. It should be noted that both sub-feature data and feature data are data containing the correlation information between multiple tokens in the image. This data can be a matrix or represented in other forms (such as vectors), and this application does not strictly limit it.
[0177] Step 2: The model training device 70 performs fusion processing on the sample text and feature data corresponding to the sample images of each training sample in multiple training samples to generate multiple fused data.
[0178] After generating feature data corresponding to the image data (i.e., multiple sample images) in each training sample, the model training device 70 fuses the feature data with the text data (i.e., multiple sample texts) in the training sample to generate fused data corresponding to that training sample. It can be understood that each training sample corresponds to one set of fused data.
[0179] For example, as shown in Figure 7, the 32 training samples correspond to 32 fused data sets (i.e., fused data 7900, fused data 7901, ..., fused data 7931). Specifically, as mentioned above, training sample 710 corresponds to feature data 7800, and the sample text in training sample 710 is sample text 7502 and sample text 7512. The model training device 70 then fuses the feature data 7800 with the sample text 7502 and sample text 7512 to generate fused data 7900. Optionally, before the fusion process, sample text 7502 and sample text 7512 are encoded before being fused with feature data 7800. Optionally, during the fusion process, feature data 7800, sample text 7502, and sample text 7512 are dimensionally aligned (e.g., linearly projected) before being concatenated to facilitate subsequent LLM processing. It should be noted that the specific fusion principle of the "fusion process" is not strictly limited in this embodiment.
[0180] Step 3: The model training device 70 generates multiple prediction data based on the second parallel strategy TP8PP4.
[0181] After the model training device 70 generates multiple fused data sets, it can process these sets based on the second parallel strategy TP8PP4 to generate multiple predicted data sets. These predicted data sets are obtained by performing LLM forward computation on each of the fused data sets. In essence, each fused data set corresponds to one predicted data set.
[0182] For example, as shown in Figure 7, the model training device 70 can divide 32 fused data points into 8 data groups (i.e., data group group0, data group group1, ..., data group group7), with each data group including 4 fused data points. For example, data group0 includes fused data 7900, fused data 7901, fused data 7902, and fused data 7903. It should be noted that the number of fused data groups can be set according to actual applications, and this embodiment does not impose a strict limitation on this.
[0183] Furthermore, the eight fused data groups are sequentially processed by the LLM processing layers on the four computing nodes in the model training device 70. For example, when the intermediate data obtained after data group 0 is processed by computing node 701 is processed by computing node 702, data group 1 begins to be processed by computing node 701. Moreover, when data group 0 is processed by computing node 701, it is processed in parallel by eight GPUs on computing node 701 that contain the parameter matrix of the LLM processing layer stage0. Through communication between the eight GPUs on computing node 701 (such as all-gather communication), the eight GPUs process the computation results of data group 0 to generate intermediate data. This intermediate data can be understood as the computation result of data group 0 after processing by the LLM processing layer stage0. Similarly, all eight data groups are sequentially processed by the four processing layers in the LLM, ultimately resulting in eight prediction groups (i.e., prediction group 01, prediction group 11, ..., prediction group 71), each prediction group containing the prediction data corresponding to each fused data point in the data group. Understandably, each prediction data includes descriptive text of the sample image predicted by the LLAVA model based on the sample text and sample image in a training sample.
[0184] It is understandable that steps 1 to 3 are the forward computation (also known as forward calculation) steps of the multimodal large model, that is, the multimodal large model calculates based on the input training samples and the current model parameters to generate prediction data.
[0185] Step 4: The model training device 70 optimizes the parameters in the multimodal large model based on multiple prediction data.
[0186] The model training device 70 can optimize the parameters of a multimodal large model using backpropagation based on multiple prediction data. Specifically, each prediction data point is sequentially backpropagated through the LLM and VIT processes in the LLAVA model. The gradient of the parameters (i.e., the derivative of the loss with respect to the parameters) is obtained based on the loss between the backpropagation result of each computation stage and the activation value of that stage (which can be understood as the forward computation result). These gradients indicate how to adjust the model parameters to reduce the loss. After all gradients are calculated, the LLAVA model uses these gradients to update the parameters to improve the model's end-to-end accuracy. Parameter updates are typically implemented using optimization algorithms such as gradient descent, and this process iterates continuously during training until the model converges to a satisfactory state.
[0187] Specifically, the model training device 70 generates multiple first inverse data based on multiple predicted data, and optimizes the parameters in the LLM (Limited Least Meaning) based on these multiple first inverse data. These multiple first inverse data are obtained by inverse calculation of the multiple predicted data using the LLM. Next, the model training device 70 generates multiple second inverse data based on the multiple first inverse data, and then optimizes the parameters in the VIT (Virtual Inverse Technology) based on these multiple second inverse data. These multiple second inverse data are obtained by inverse calculation of the multiple first inverse data using the VIT.
[0188] In an optional implementation, since the second parallel strategy includes a forward pass followed by a backward pass (1F1B) strategy, the model training device 70 can process the eight fused data groups in step 3 using a pipelined scheduling of one forward pass followed by one backward pass. The 1F1B pipelined scheduling allows forward and backward computations between PP groups to be performed alternately; that is, in this embodiment, the forward and backward computations of LLM can be performed alternately. The 1F1B pipelined scheduling can improve the utilization of computational resources during model training, which is beneficial for improving MFU (Model-to-Future Function).
[0189] For example, please refer to Figure 8, which is a time-based schematic diagram of a pipeline scheduling based on an embodiment of this application. In the case of 1F1B pipeline scheduling, the training steps of the LLAVA model shown in Figure 7 can be composed of several stages as shown in Figure 8 from a temporal perspective.
[0190] Phase 1: The four computing nodes in the model training device 70 perform forward computation of VIT in parallel to generate multiple feature data.
[0191] The second stage: The model training device 70 reshards the forward calculation results of VIT on the four computing nodes to generate multiple fused data.
[0192] The third stage: The model training device 70 uses a 1F1B pipeline scheduling to process 8 data groups, generating multiple predicted data and multiple first inverse data.
[0193] Specifically, in Figure 8, the white grids with numbers correspond to the time periods during which the LLM processing layer in a compute node performs forward computation on a data group. The numbers in the grids distinguish different data groups; that is, numbers 0, 1, ..., 7 represent data group 0, data group 1, ..., data group 7, respectively. For example, the time period corresponding to grid 801 is the time period during which processing layer stage0 in compute node 701 performs forward computation on data group 0. Similarly, the gray grids with numbers in Figure 8 correspond to the time periods during which the LLM processing layer in a compute node performs reverse computation on a data group. The numbers in the grids distinguish different data groups; that is, numbers 0, 1, ..., 7 represent data group 0, data group 1, ..., data group 7, respectively. For example, the time period corresponding to grid 802 is the time period during which processing layer stage3 in compute node 704 performs reverse computation on data group 0.
[0194] Optionally, the multiple training samples acquired by the model training device 70 can be training samples with the same data length. As described in step 1, each training sample includes multiple image-text pairs. Therefore, when packing multiple image-text pairs into a single training sample during the sampling phase, the data length of that training sample can be made to meet a preset length. That is, the sum of the text tokens and image tokens in each training sample is the same, equal to the preset number of tokens. As can be seen from steps 1 and 2, the data length of the data group is related to the data length of the training sample; when the data length of each training sample is the same, the data length of each data group is the same.
[0195] It is understandable that when the data length of each data group is the same, the computation time for each computing node to perform forward / backward computation on each data group is also the same. This can reduce the waiting time between adjacent computing nodes under the 1F1B pipeline scheduling, i.e., reduce PP cavitation. For example, as shown in grids 803 and 804 in Figure 8, while computing node 702 is performing forward computation on data group 0, computing node 701 is also performing forward computation on data group 1. Since the computation time is the same, when computing node 702 completes the forward computation on group 0, computing node 701 has also completed the forward computation on group 1. At this time, computing node 702 can immediately process the computation result output by computing node 701 (i.e., immediately perform forward computation on group 1) without waiting. If the computation time is different, and computing node 702 completes the forward computation on group 0 but computing node 701 has not completed the forward computation on group 1, then computing node 702 needs to wait for a certain period of time before it can process the computation result output by computing node 701. Furthermore, when each data group has the same data length, if the second parallel strategy contains a DP strategy, it can also balance the computational load between DP groups.
[0196] Phase 4: The model training equipment performs data resharding on multiple first-reverse datasets.
[0197] Fifth stage: The four computing nodes in the model training device 70 perform VIT inverse calculation in parallel to generate multiple second inverse data.
[0198] Combining stages one through five, it can be seen that when training the LLAVA model, the model training device 70 first performs forward computation of VIT, then forward and backward computation of LLM, and finally backward computation of VIT. That is, backward computation of VIT can only be performed after backward computation of LLM is completed. Therefore, all computing units in the model training device 70 store the activation values required for backward computation of VIT (i.e., the forward computation results of VIT in the first stage). Combining with step 1, it can be seen that the model training device 70 can evenly distribute multiple preprocessed images to each computing node based on the first parallel strategy DP32. It is understandable that when the number of preprocessed images processed by each computing node is the same, the number of activation values from the first stage that each computing node needs to store is the same.
[0199] Furthermore, in the third stage, due to the 1F1B pipeline scheduling, compute node 704 immediately performs backward computation on a data group after performing forward computation to optimize model parameters. As described in step 4, activation values are needed when optimizing model parameters, so compute node 704 only needs to store the activation values of one data group (i.e., the forward computation result of the LLM for that data group in the third stage). Meanwhile, the processing layer stage2 in compute node 703 performs backward computation on the previous data group after performing forward computation on a data group to optimize model parameters. For example, as shown in grids 805 and 806 in Figure 8, compute node 703 performs backward computation on data group group0 after performing forward computation on data group group1. That is to say, compute node 703 needs to store the activation values of two data groups in the third stage. Similarly, compute node 702 needs to store the activation values of three data groups in the third stage, and compute node 701 needs to store the activation values of four data groups in the third stage.
[0200] It is understandable that when the second parallel strategy includes the PP strategy and adopts a 1F1B pipeline scheduling, the storage pressure of activation values among the PP groups (i.e., among the four computing nodes where the four processing layers are located) is uneven (usually the computing node where the first processing layer is located in LLM has the greatest storage pressure), that is, the number of activation values of the third stage that the four computing nodes in the model training device need to store decreases sequentially.
[0201] Based on the above considerations, optionally, in step 1, the model training device 70 can train the model in parallel dimension N based on the first parallel strategy DP32. DP When multiple preprocessed images are assigned to four computing nodes, the number of preprocessed images assigned to computing nodes 701, 702, 703, and 704 increases sequentially (i.e., asymmetric assignment), so that the number of activation values of the first stage that the four computing nodes need to store increases sequentially, and the number of activation values of the third stage that the four computing nodes need to store decreases sequentially. This implementation can balance the storage pressure of the four computing nodes throughout the entire model training process.
[0202] Furthermore, in this embodiment, the number of preprocessed images requiring VIT processing by the four computing nodes increases sequentially, meaning the forward computation load of the four computing nodes in the first stage increases sequentially. Optionally, the model training device 70 can also enable VIT recalculation. VIT recalculation means that the four computing nodes no longer store all activation values from the first stage, but recalculate some activation values when performing the reverse VIT computation. Specifically, the VIT recalculation granularity configured by the model training device 70 for the four computing nodes can decrease sequentially. Recalculation granularity is an indicator that measures the number of activation values recalculated; the smaller the recalculation granularity, the fewer activation values are recalculated. In other words, the model training device 70 can reduce the computation load performed by the four computing nodes in the fifth stage sequentially, thereby balancing the computation load of the four computing nodes throughout the entire model training process.
[0203] In summary, for scenarios where the first parallel strategy includes a DP strategy and the second parallel strategy includes a PP strategy, the embodiments of this application can, when training a multimodal large model (such as an LLAVA model), first perform forward computation of the image processing module (such as VIT), then use 1F1B pipeline scheduling to cross-perform forward and backward computation of the language processing module (such as LLM), and finally perform backward computation of the image processing module, thereby reducing the total time for the language processing module to perform forward and backward computation, and thus improving MFU.
[0204] Furthermore, in the embodiments of this application, when multiple image-text pairs are packaged into a training sample during the sampling stage, the data length of the training sample can meet the preset length. This helps to ensure that the data length of each training sample in multiple training samples is the same, thereby reducing the waiting time between adjacent computing nodes under the 1F1B pipeline scheduling, i.e., reducing PP cavitation, and further improving MFU.
[0205] Furthermore, considering the uneven storage pressure of activation values between PP groups (i.e., between multiple computing nodes where multiple processing layers are located) when using 1F1B pipeline scheduling, this application embodiment can implement asymmetric allocation when allocating multiple preprocessed images to multiple computing nodes based on the parallel dimension of the DP strategy in the first parallel strategy, and can configure different VIT recomputation granularities for these multiple computing nodes, thereby balancing the computational load and storage pressure of multiple computing nodes throughout the entire model training process.
[0206] The following describes the computing device provided in the embodiments of this application.
[0207] Figure 9 is a schematic diagram of a computing device provided in an embodiment of this application. The computing device 90 can be the model training device in the above method embodiment or a device in the model training device. The computing device 90 can include a first determining module 901 and a second determining module 902. The detailed description of each unit is as follows:
[0208] The first determining module 901 is used to determine a first parallel strategy for the image processing module in the multimodal large model. The first parallel strategy includes one or more of the following: data parallelism (DP), tensor parallelism (TP), pipeline parallelism (PP), and context parallelism (CP). The first parallel strategy is used to train the image processing module.
[0209] The second determining module 902 is used to determine a second parallel strategy for the language processing module in the multimodal large model, wherein the second parallel strategy includes one or more of the DP strategy, TP strategy, PP strategy and CP strategy, and the second parallel strategy is used to train the language processing module.
[0210] The image processing module uses the same computing resources as the language processing module, but the first and second parallel strategies executed by the computing resources are different. The multimodal large model is used to perform multimodal tasks.
[0211] In one alternative implementation, computing resources include accelerators or accelerator clusters for parallel computing, with accelerators including virtual accelerators or physical accelerators.
[0212] In one alternative implementation, the parallel dimension of the first parallel strategy is the same as that of the second parallel strategy. The parallel dimension includes the product of one or more of the parallel dimensions of the DP strategy, the TP strategy, the PP strategy, and the CP strategy.
[0213] In another alternative implementation, the image processing module is a visual encoder (VIT), the language processing module is a large language model (LLM), and the computing device 90 further includes a first generation module, a fusion processing module, a second generation module, and an optimization module, wherein:
[0214] The first generation module is used to generate multiple feature data based on the first parallel strategy. Each feature data is obtained by VIT forward calculation of image data in one of the multiple training samples. Each training sample includes image data and text data. The text data is used as a text instruction to indicate the descriptive text of the obtained image data.
[0215] The fusion processing module is used to fuse the text data and feature data in each training sample separately to generate multiple fused data.
[0216] The second generation module is used to generate multiple prediction data based on the second parallel strategy, wherein the multiple prediction data are obtained by forward calculation of multiple fused data through LLM respectively;
[0217] The optimization module is used to optimize the parameters in a multimodal large model based on multiple prediction data.
[0218] In another alternative implementation, each training sample in the multiple training samples includes multiple image-text pairs. Each image-text pair includes a sample image and a sample text corresponding to the sample image. The sample images in the multiple image-text pairs belong to image data, and the sample text in the multiple image-text pairs belong to text data.
[0219] In another alternative implementation, the sum of the text token and the image token for each training sample is the same.
[0220] In another alternative implementation, the optimization module is specifically used to: optimize the parameters in a multimodal large model based on multiple prediction data.
[0221] Multiple first inverse data are generated, wherein the multiple first inverse data are obtained by inverse calculation of multiple predicted data through LLM;
[0222] Optimize the parameters in the LLM based on multiple first-reverse data;
[0223] Multiple second reverse data are generated, wherein the multiple second reverse data are obtained by reverse calculation of multiple first reverse data through VIT;
[0224] Optimize the parameters in VIT based on multiple second-reverse data.
[0225] In another alternative implementation, the computing resource includes at least one computing node, and each computing node includes multiple computing units. The number of computing units in the computing resource is equal to the parallel dimension of the first parallel strategy.
[0226] In another alternative implementation, the parallel dimension of the PP strategy in the second parallel strategy is N. pp In the second parallel strategy, the pipeline scheduling of the PP strategy is one forward computation followed by one backward computation (1F1B), and the computing resources include N. pp There are N computing nodes. pp The i-th computing node in the n computing nodes includes the i-th processing layer of the language processing module, where i takes the values 1, 2, ..., N. pp .
[0227] In another alternative implementation, regarding the generation of multiple feature data based on the first parallel strategy, the first generation module is specifically used for:
[0228] Based on the parallel dimension of the DP strategy in the first parallel strategy, the image data from multiple training samples are distributed to N. pp There are N computational nodes, where the larger i is, the more N pp The more image data is allocated to the i-th computing node among the computing nodes;
[0229] Based on VIT, multiple feature data are obtained by processing image data from multiple training samples.
[0230] In another alternative implementation:
[0231] The first determining module 901 is further configured to determine a third parallel strategy for the image processing module, wherein the third parallel strategy includes one or more of the DP strategy, TP strategy, PP strategy and CP strategy, and the third parallel strategy is used to train the image processing module.
[0232] The second determining module 902 is also used to determine a fourth parallel strategy for the language processing module, wherein the fourth parallel strategy includes one or more of the DP strategy, TP strategy, PP strategy and CP strategy, and the fourth parallel strategy is used to train the language processing module. The third parallel strategy and the fourth parallel strategy executed by the computing resources are different.
[0233] In another alternative implementation, the computing device 90 further includes a third determining module, wherein:
[0234] The first determining module is also used to determine the first model computing power utilization rate (MFU) of the multimodal large model during training based on the first parallel strategy and the second parallel strategy.
[0235] The second determination module is also used to determine the second MFU of the multimodal large model when training based on the third and fourth parallel strategies.
[0236] The third determining module is used to determine, when the first MFU is greater than the second MFU, the first parallel strategy is the preferred parallel strategy corresponding to the image processing module, and the second parallel strategy is the preferred parallel strategy corresponding to the language processing module.
[0237] In another alternative implementation, the computing device 90 further includes a transmitting / receiving module, wherein:
[0238] The receiving and transmitting module is used to send multimodal large models trained based on the first parallel strategy and the second parallel strategy.
[0239] Figure 10 is a schematic diagram of another computing device provided in an embodiment of this application. As shown in Figure 10, the computing device 100 includes: a bus 1001, a processor 1002, a memory 1003, and a communication interface 1004. The processor 1002, the memory 1003, and the communication interface 1004 communicate with each other via the bus 1001. It should be understood that this application does not limit the number of processors and memories in the computing device 100. Optionally, the computing device 100 may be a server or a server cluster.
[0240] Bus 1001 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, only one line is used in Figure 10, but this does not imply that there is only one bus or one type of bus. Bus 1001 can include pathways for transmitting information between various components of computing device 100 (e.g., memory 1003, processor 1002, communication interface 1004).
[0241] The processor 1002 may include any one or more processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).
[0242] The memory 1003 may include volatile memory, such as random access memory (RAM). The processor 1002 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid state drive (SSD).
[0243] The memory 1003 stores executable program code, as well as input information and processing results.
[0244] The processor 1002 executes the executable program code to implement the functions of the computing device 90 shown in FIG9, such as determining a first parallel strategy, determining a second parallel strategy, etc.
[0245] The communication interface 1004 uses transceiver modules, such as, but not limited to, network interface cards and transceivers, to enable communication between the computing device 100 and other devices or communication networks.
[0246] Figure 11 is a schematic diagram of another computing device provided in an embodiment of this application. As shown in Figure 11, the computing device 110 includes a logic circuit 1101 and an interface 1102. The logic circuit 1101 can be a chip, processing circuit, integrated circuit, or system-on-chip (SoC) chip, etc., and the interface 1102 can be a communication interface, input / output interface, pins, etc. For example, Figure 11 illustrates the computing device as a chip, which includes the logic circuit 1101 and the interface 1102.
[0247] In this embodiment, the logic circuit and the interface can also be coupled to each other. The specific connection method of the logic circuit and the interface is not limited in this embodiment. For example, the logic circuit 1101 can be used to execute the functions or steps implemented by the processor 1002 shown in FIG. 10, and the interface 1102 can be used to execute the functions or steps implemented by the communication interface 1004 shown in FIG. 10. For a detailed description of the logic circuit 1101 and the interface 1102, please refer to FIG. 10 or the method embodiment shown above, which will not be detailed here.
[0248] Furthermore, embodiments of this application also provide a computer-readable storage medium storing a computer program, which, when run on a processor, implements the method flow shown in FIG6 or FIG7. Optionally, the computer-readable storage medium may be a non-transitory computer-readable storage medium.
[0249] This application also provides a computer program product, which, when run on a processor, implements the method flow shown in FIG6 or FIG7.
[0250] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative. For instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules or components may be combined or integrated into another system, or some features may be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices, or units, or it may be an electrical, mechanical, or other form of connection. The modules described as separate components may or may not be physically separated. The components shown as modules may or may not be physical modules, that is, they may be located in one place or distributed across multiple network modules. Some or all of the modules can be selected according to actual needs to achieve the technical effects of the solutions provided in the embodiments of this application.
[0251] Furthermore, the functional modules in the various embodiments of this application can be integrated into one processing module, or each module can exist physically separately, or two or more modules can be integrated into one module. The integrated modules can be implemented in hardware or as software functional modules. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned readable storage medium includes: USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, optical disks, and other media capable of storing program code.
[0252] In this application, the terms "exemplarily" or "for example" are used to indicate that something is an example, illustration, or description. Any embodiment or design described as "exemplarily" or "for example" in this application should not be construed as being more preferred or advantageous than other embodiments or designs. Specifically, the use of the terms "exemplarily" or "for example" is intended to present the relevant concepts in a specific manner.
[0253] Furthermore, unless otherwise stated, the use of ordinal numbers such as "first" and "second" in the embodiments of this application is for distinguishing multiple objects, and is not for limiting the order, timing, priority or importance of multiple objects, such as first parallel strategy and second parallel strategy.
[0254] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A training method based on a parallel strategy, characterized in that, The method includes: A first parallel strategy is determined for the image processing module in a multimodal large model, wherein the first parallel strategy includes one or more of the following: data parallelism (DP), tensor parallelism (TP), pipeline parallelism (PP), and context parallelism (CP). Determine a second parallel strategy for the language processing module in the multimodal large model, wherein the second parallel strategy includes one or more of the following strategies: DP strategy, TP strategy, PP strategy, and CP strategy. The image processing module uses the same computing resources as the language processing module, but the first parallel strategy and the second parallel strategy executed by the computing resources are different. The multimodal large model is used to perform multimodal tasks.
2. The method according to claim 1, characterized in that, The parallel dimension of the first parallel strategy is the same as that of the second parallel strategy. The parallel dimension includes the product of one or more of the parallel dimensions of the DP strategy, the TP strategy, the PP strategy, and the CP strategy.
3. The method according to claim 2, characterized in that, The image processing module is a visual encoder (VIT), the language processing module is a large language model (LLM), and the method further includes: Multiple feature data are generated based on the first parallel strategy. Each feature data is obtained by forward calculation of image data from one of the training samples of the multiple training samples through the VIT. Each training sample includes the image data and text data. The text data is used as a text instruction to indicate the descriptive text of the image data. The text data and feature data in each training sample are fused separately to generate multiple fused data; Multiple prediction data are generated based on the second parallel strategy, wherein the multiple prediction data are obtained by forward calculation of the multiple fused data through the LLM respectively; The parameters in the multimodal large model are optimized based on the multiple prediction data.
4. The method according to claim 3, characterized in that, Each training sample includes multiple image-text pairs. Each image-text pair includes a sample image and sample text corresponding to the sample image. The sample images in the multiple image-text pairs belong to the image data, and the sample text in the multiple image-text pairs belong to the text data.
5. The method according to claim 4, characterized in that, The sum of the text token and the image token for each training sample is the same.
6. The method according to any one of claims 3-5, characterized in that, The optimization of the parameters in the multimodal large model based on the multiple prediction data includes: Multiple first inverse data are generated, wherein the multiple first inverse data are obtained by inverse calculation of the multiple predicted data through the LLM; Optimize the parameters in the LLM based on the plurality of first inverse data; Generate multiple second reverse data, wherein the multiple second reverse data are obtained by reverse calculation of the multiple first reverse data through the VIT; The parameters in the VIT are optimized based on the plurality of second reverse data.
7. The method according to any one of claims 3-6, characterized in that, The computing resources include at least one computing node, and each computing node includes multiple computing units. The number of computing units in the computing resources is equal to the parallel dimension of the first parallel strategy.
8. The method according to claim 7, characterized in that, In the second parallel strategy, the parallel dimension of the PP strategy is N. pp In the second parallel strategy, the pipeline scheduling of the PP strategy is one forward computation followed by one backward computation (1F1B), and the computing resources include N. pp N computing nodes, the N pp The i-th computing node in the n computing nodes includes the i-th processing layer of the language processing module, where i takes the values 1, 2, ..., N sequentially. pp .
9. The method according to claim 8, characterized in that, The generation of multiple feature data based on the first parallel strategy includes: The image data from the multiple training samples are allocated to the N according to the parallel dimension of the DP strategy in the first parallel strategy. pp There are N computing nodes, where the larger i is, the more N is... pp The more image data is allocated to the i-th computing node among all computing nodes; Based on the VIT, the image data in the multiple training samples are processed to obtain multiple feature data.
10. The method according to any one of claims 1-9, characterized in that, The method further includes: A third parallel strategy for the image processing module is determined, wherein the third parallel strategy includes one or more of the DP strategy, TP strategy, PP strategy and CP strategy, and the third parallel strategy is used to train the image processing module. A fourth parallel strategy for the language processing module is determined, wherein the fourth parallel strategy includes one or more of the DP strategy, TP strategy, PP strategy and CP strategy, and the fourth parallel strategy is used to train the language processing module. The third parallel strategy executed by the computing resources is different from the fourth parallel strategy.
11. The method according to claim 10, characterized in that, The method further includes: Based on training using the first parallel strategy and the second parallel strategy, the first model computing power utilization rate (MFU) of the multimodal large model is determined. Based on training using the third and fourth parallel strategies, the second MFU of the multimodal large model is determined. If the first MFU is greater than the second MFU, the first parallel strategy is determined to be the preferred parallel strategy corresponding to the image processing module, and the second parallel strategy is determined to be the preferred parallel strategy corresponding to the language processing module.
12. The method according to any one of claims 1-11, characterized in that, The method further includes: Send the multimodal large model trained based on the first parallel strategy and the second parallel strategy.
13. A computing device, characterized in that, The device includes a first determining module and a second determining module, wherein: The first determining module is used to determine a first parallel strategy for the image processing module in the multimodal large model, wherein the first parallel strategy includes one or more of the following: data parallelism (DP) strategy, tensor parallelism (TP) strategy, pipeline parallelism (PP) strategy, and context parallelism (CP) strategy. The second determining module is used to determine a second parallel strategy for the language processing module in the multimodal large model, wherein the second parallel strategy includes one or more of the following strategies: DP strategy, TP strategy, PP strategy and CP strategy. The image processing module uses the same computing resources as the language processing module, but the first parallel strategy and the second parallel strategy executed by the computing resources are different. The multimodal large model is used to perform multimodal tasks.
14. A computing device, characterized in that, The computing device includes a processor configured to cause the computing device to implement the method as described in any one of claims 1-11.
15. A computing device, characterized in that, The computing device includes logic circuitry and an interface, the interface being used for inputting and / or outputting information, and the logic circuitry being used to enable the computing device to implement the method as described in any one of claims 1-11.
16. A server, characterized in that, The server includes a processor configured to cause the server to implement the method as described in any one of claims 1-11.
17. A chip, characterized in that, The chip includes logic circuitry and an interface, the interface being used for inputting and / or outputting information, and the logic circuitry being used to enable the chip to implement the method as described in any one of claims 1-11.
18. A computer-readable storage medium, characterized in that, The computer-readable storage medium is used to store a computer program, which, when executed, performs the method as described in any one of claims 1-11.