A configuration method and system of an AI model
By identifying the basic and differential operator sequences in parallel configuration, the problem of inaccurate operator lists and low efficiency in the parallel configuration of large-scale AI models is solved, achieving efficient and accurate operator sequence configuration and fast response.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2024-12-30
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, when configuring parallel methods for large-scale AI models, the operator list is inaccurate and inefficient. Furthermore, after changing the parallel configuration, the model needs to be rerun to collect the operator sequence, which is also inefficient.
By pre-setting multiple parallel configurations, the operator sequence of each computing device is obtained, the basic operator sequence and the difference operator sequence are identified, and the operator sequence of the target parallel configuration is generated, thereby improving accuracy and efficiency.
It enables efficient and accurate configuration of operator sequences for large-scale AI models, reduces computing resource requirements, and supports rapid changes to operator sequences with parallel configurations.
Smart Images

Figure CN122308948A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence (AI), and more particularly to a method and system for configuring an AI model. Background Technology
[0002] The training process for large-scale models typically involves extensive parallel computation. When training models on a large-scale cluster, the main time-consuming aspects include communication time and computation time. Communication time refers to the time spent communicating between the various computing nodes, while computation time is the time spent running the model. The time spent on communication and computation is fixed, while the communication time usually varies depending on the parallel configuration.
[0003] When configuring parallelism, in addition to configuring the specific parallelism method, such as data parallelism (DP), pipelined parallelism, and tensor parallelism, it is also necessary to configure the sequence of operators running on each computing device in the AI cluster when using this parallelism configuration. Currently, the list of operators to be deployed on each computing device can be output theoretically, i.e., the arrangement of operators that satisfy the parallel configuration. However, this method not only produces an inaccurate list of operators but is also inefficient. Furthermore, if the parallel configuration needs to be changed after training begins, the AI model must be run again with the newly changed configuration to collect the operator sequence on each computing card, resulting in low efficiency.
[0004] Therefore, how to obtain a list of operators that conform to parallel configuration more efficiently has become an urgent problem to be solved. Summary of the Invention
[0005] This application provides a method and system for configuring an artificial intelligence (AI) model, which is used to obtain a sequence of operators computed by a computing device that runs the AI model in parallel with a target parallel configuration in a more efficient manner.
[0006] In view of this, firstly, this application provides a method for configuring an artificial intelligence (AI) model, comprising: running a first AI model using M parallel configurations respectively; obtaining M sequences of first operators executed by any computing device in the AI cluster running the first AI model; typically, N parallel modes can be pre-set, each of the M parallel configurations not including at least one of the N parallel modes of the first AI model, and the at least one parallel mode not included in each of the M parallel configurations being different, wherein M is less than or equal to N, which is equivalent to shutting down one or more of the N parallel modes each time, and each time the mode is shut down... The parallel methods are different; then, the same second operator sequence is obtained from M first operator sequences; the first AI model is run using one of the N parallel methods to obtain the third operator sequence corresponding to each of the N parallel methods; then, the difference operator between the third operator sequence and the second operator sequence corresponding to each of the N parallel methods is calculated, that is, the additional communication-related operator introduced under each parallel method due to the introduction of parallelism; then, the operator sequence in the computing device corresponding to the target parallel configuration is generated according to the number of N parallel methods in the target parallel configuration, the difference operator of each parallel method, and the second operator sequence.
[0007] In this embodiment, a common first operator sequence for parallel operation of the first AI model in various parallel modes is determined, namely the basic operator sequence, and a difference operator sequence in each parallel mode that is different from other parallel modes. In this way, for any parallel configuration of the first AI model, the operator sequence corresponding to the parallel configuration can be obtained based on the basic sub-sequence and the difference operator sequence of various parallel modes in the parallel configuration. This not only improves the accuracy of operator sequence prediction, but also improves efficiency.
[0008] In one possible implementation, M equals N, and the number of at least one parallel mode is 1. That is, in one possible scenario, one of the parallel modes can be turned off sequentially to obtain N parallel configurations.
[0009] In one possible implementation, obtaining the same second operator sequence among the M first operator sequences includes: taking the intersection of the M first operator sequences to obtain the second operator sequence. In this embodiment, the basic operator sequence contained in each first operator sequence can be obtained by taking the intersection of each first operator sequence.
[0010] In one possible implementation, the aforementioned calculation of the difference operator between the third operator sequence and the second operator sequence corresponding to each of the N parallelism methods includes: subtracting the third operator sequence and the second operator sequence corresponding to each of the N parallelism methods to obtain the difference operator corresponding to each parallelism method, wherein each of the N parallelism methods employs the minimum parallelism of each parallelism method. In this embodiment of the application, the communication-related operators introduced by parallelism under each parallelism method can be obtained by performing a difference operation, thereby quickly determining the additional operator overhead under each parallelism method.
[0011] In one possible implementation, the aforementioned generation of the operator sequence in the computing device corresponding to the target parallel configuration based on the number of N parallel modes in the target parallel configuration and the difference operator and second operator sequence for each parallel mode can include: combining the difference operator and second operator sequence corresponding to each of the N parallel modes in the target parallel configuration based on the number of each parallel mode to obtain the operator sequence in the computing device corresponding to the target parallel configuration. In this embodiment, the basic operator sequence and the difference operator sequence can be combined by using the number corresponding to each parallel configuration, so that after receiving the target parallel configuration, the operator sequence that meets the requirements can be quickly combined. Even if the parameters of the target parallel configuration are changed, it is not necessary to perform a complete runtime test of the target parallel configuration in the AI cluster, and the efficiency of obtaining the operator sequence can be obtained very efficiently.
[0012] In one possible implementation, the aforementioned method of running the first AI model using M parallel configurations and obtaining the M first operator sequences executed by any computing device in the AI cluster running the first AI model includes: running the first AI model using M parallel configurations in the AI cluster and collecting the M first operator sequences executed by any computing device in the AI cluster running the first AI model. Specifically, in this application, the AI cluster can be configured so that multiple computing devices in the AI cluster run the first AI model, thereby allowing the collection of the operator sequences that any computing device needs to execute when the devices in the AI cluster run the first AI model in parallel.
[0013] In one possible implementation, the aforementioned process of running the first AI model using one of N parallel methods to obtain a third operator sequence corresponding to each of the N parallel methods includes: running the first AI model using one of the N parallel methods through an AI cluster; and collecting the third operator sequence corresponding to each of the N parallel methods from any computing device in the AI cluster. In this embodiment, the cluster can be configured to run the first AI model using each parallel method, thereby collecting the operator sequence that any computing device in the AI cluster needs to execute when running the first AI model using different parallel methods.
[0014] In one possible implementation, the aforementioned M first operator sequences are operator sequences computed by any computing device when the AI runs the second AI model in parallel using M parallel configurations; the aforementioned third operator sequences corresponding to each of the N parallel methods are operator sequences computed by any computing device when the AI cluster runs the second AI model based on each parallel method.
[0015] The second AI model is obtained by trimming the first AI model by the AI cluster. The types of neural network layers included in the second AI model are the same as those included in the first AI model. The number of neural network layers included in the second AI model is less than the number of neural network layers included in the first AI model. Furthermore, the number of computing devices required to run the second AI model in the AI cluster is less than the number of computing devices required to run the first AI model.
[0016] Therefore, in this embodiment of the application, when it is necessary to run the first AI model through an AI cluster, in order to reduce the computing resources required for the AI cluster to run the AI model in parallel, the AI cluster can trim the first AI model. While maintaining the type of the first AI model, the number of its network layers is reduced, so that the number of computing devices required to run the second AI model in the AI cluster is less than the number of computing devices required to run the first AI model, thereby reducing the computing resources required for the AI cluster to run the AI model in parallel.
[0017] Secondly, this application provides an artificial intelligence (AI) model configuration system, comprising:
[0018] The same operator determination module is used to run the first AI model in M parallel configurations respectively, obtain the M first operator sequences executed by any computing device in the AI cluster running the first AI model, obtain the same second operator sequences among the M first operator sequences, each of the M parallel configurations does not include at least one of the N parallel methods of the first AI model, and the at least one parallel method not included in each of the M parallel configurations is different, where M is less than or equal to N;
[0019] The difference operator determination module is used to run the first AI model using one of the N parallel methods respectively, to obtain the third operator sequence corresponding to each of the N parallel methods, and to calculate the difference operator between the third operator sequence corresponding to each of the N parallel methods and the second operator sequence.
[0020] The configuration generation module is used to generate the operator sequence in the computing device corresponding to the target parallel configuration based on the number of N parallel modes in the target parallel configuration and the difference operator and second operator sequence for each parallel mode.
[0021] The effects achieved by the second aspect or any optional implementation of the second aspect can be referred to in the description of the first aspect or any optional implementation of the second aspect, and will not be repeated here.
[0022] In one possible implementation, the aforementioned M equals N, and the number of at least one parallel mode is 1.
[0023] In one possible implementation, the aforementioned identical operator determination module is specifically used to obtain the second operator sequence by taking the intersection of the M first operator sequences.
[0024] In one possible implementation, the aforementioned difference operator determination module is specifically used to subtract the third operator sequence and the second operator sequence corresponding to each of the N parallel methods to obtain the difference operator corresponding to each parallel method, wherein each of the N parallel methods adopts the minimum parallelism of each parallel method.
[0025] In one possible implementation, the aforementioned configuration generation module is specifically used to combine the difference operator corresponding to each of the N parallel modes in the target parallel configuration with the second operator sequence, based on the quantity corresponding to each of the N parallel modes, to obtain the operator sequence in the computing device corresponding to the target parallel configuration.
[0026] In one possible implementation, the aforementioned identical operator determination module is specifically used to: run the first AI model in M parallel configurations through the AI cluster, and collect the M first operator sequences executed by any computing device in the AI cluster.
[0027] In one possible implementation, the aforementioned difference operator determination module is specifically used to: run the first AI model using one of N parallel methods through the AI cluster; and collect the third operator sequence corresponding to each of the N parallel methods for running the first AI model on any computing device in the AI cluster.
[0028] In one possible implementation, the aforementioned M first operator sequences are operator sequences computed by any computing device when the AI runs the second AI model in parallel using M parallel configurations; the third operator sequences corresponding to each of the aforementioned N parallel methods are operator sequences computed by any computing device when the AI cluster runs the second AI model based on each parallel method.
[0029] The second AI model is obtained by trimming the first AI model by the AI cluster. The types of neural network layers included in the second AI model are the same as those included in the first AI model. The number of neural network layers included in the second AI model is less than the number of neural network layers included in the first AI model. Furthermore, the number of computing devices required to run the second AI model in the AI cluster is less than the number of computing devices required to run the first AI model.
[0030] Thirdly, embodiments of this application provide a management device, including a processor and a memory; the processor of the management device is used to execute instructions stored in the memory, so that the management device performs method steps as described in the first aspect and any implementation thereof.
[0031] Optionally, the management device is one of the computing devices in the AI cluster.
[0032] Fourthly, this application provides a cloud server system, including a management device and an AI cluster, wherein the AI cluster includes multiple computing devices; the management device is used to execute the method steps as described in the first aspect and any implementation thereof; the AI cluster is used to run a first AI model under the instruction of the management device.
[0033] Fifthly, embodiments of this application provide a computer program product containing instructions that, when executed by a device, cause the device to perform a method as described in the first aspect or any implementation thereof.
[0034] In a sixth aspect, embodiments of this application provide a computer-readable storage medium including computer program instructions, which, when executed by a device, cause the device to perform a method as described in the first aspect or any implementation thereof.
[0035] In a seventh aspect, embodiments of this application provide a chip including at least one processor and an interface; at least one processor obtains program instructions or data through the interface; at least one processor is used to execute program line instructions to implement the method in the first aspect or any implementation thereof. Attached Figure Description
[0036] Figure 1 A schematic diagram of a system architecture provided for an embodiment of this application;
[0037] Figure 2 This application provides a schematic diagram of an AI cluster architecture.
[0038] Figure 3 This application provides a schematic diagram of a cloud service system architecture.
[0039] Figure 4 This is a schematic diagram of the structure of a model used in an embodiment of this application;
[0040] Figure 5 A flowchart illustrating a configuration method for an AI model provided in an embodiment of this application;
[0041] Figure 6 A flowchart illustrating another AI model configuration method provided in this application embodiment;
[0042] Figure 7 This is a schematic diagram illustrating a basic operator sequence calculation method provided in an embodiment of this application;
[0043] Figure 8 This is a schematic diagram illustrating a method for calculating a difference operator sequence, as provided in an embodiment of this application.
[0044] Figure 9 This application provides an embodiment of the number of difference operator sequences under different parallelization levels;
[0045] Figure 10 This application provides an end-to-end performance diagram.
[0046] Figure 11 This is a schematic diagram of an application scenario provided by an embodiment of this application;
[0047] Figure 12 This application provides a schematic diagram of an AI model configuration system structure.
[0048] Figure 13 A schematic diagram of the structure of a management device provided in an embodiment of this application;
[0049] Figure 14 This is a schematic diagram of the structure of a cloud service system provided in an embodiment of this application. Detailed Implementation
[0050] The technical solutions of the embodiments of this application will now be described with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of this application without creative effort are within the scope of protection of this application.
[0051] For ease of understanding, some terms or concepts involved in the embodiments of this application will be introduced.
[0052] (1) Large Model
[0053] Large models are large-scale models. The "large" in large models can be reflected in many aspects, such as large data scale, large-scale parallel computing capabilities, and larger model structures.
[0054] (2) Language model (LM)
[0055] Language models play a crucial role in Natural Language Processing (NLP), where their task is to predict the probability of a sentence occurring in a language. For example, a language model is typically constructed as a probability distribution p(s) of a string s, where p(s) attempts to reflect the frequency of string s as a sentence. It can be applied to scenarios such as text recognition or machine translation. In the embodiments of this application, the NLP models mentioned below include language models.
[0056] (3) Large Language Model (LLM)
[0057] A large language model (LLM) refers to a language model containing hundreds of billions (or more) parameters trained on massive amounts of text data. It is a natural language processing model based on deep learning. These models can process large amounts of text data to learn the grammatical and semantic rules of natural language. LLMs can be applied to text generation, machine translation, question answering systems, text summarization, and sentiment analysis, offering advantages such as strong generative capabilities, high adaptability, accurate prediction, and strong scalability. For example, in movie recommendation scenarios, a large language model can generate descriptions of movie scenes, including genre, main actors, and plot, enabling the system to better recommend similar movies. Large language models can also generate recommendation reasons; for example, e-commerce websites can use large language models to generate reasons for recommending products, such as product quality, price, and features, allowing users to better understand the value of the product.
[0058] (4) Large Model Training
[0059] Large-scale model training refers to the process of training a deep learning model (such as a neural network) using a large amount of data and computing resources. This training typically involves millions or even billions of parameters to improve the model's prediction accuracy and generalization ability in various complex scenarios.
[0060] (5) Large Model Parallelism
[0061] Large-scale model parallelism is an innovative and efficient computational strategy employed in modern artificial intelligence, particularly when dealing with complex language understanding and generation tasks. The core idea behind this approach is to distribute the computation of large neural network models across multiple computing resources (such as GPUs, TPUs, or CPUs) to achieve faster training speeds and higher resource utilization.
[0062] (6) Training Operator Prediction
[0063] Training operator prediction refers to the process of optimizing model parameters by predicting training operators during the training of machine learning or deep learning models. In the training of large models, there exists a sequence of operators, which exhibit regularity and are determined by the parallel configuration. Once the parallel configuration is determined, the sequence of operators is also fixed. Specifically, a training operator refers to the specific operation performed during model training, such as parameter updates and loss calculations. The core of training operator prediction lies in improving training efficiency and effectiveness by predicting these operations.
[0064] In the typical training process of first-generation AI models, especially for large models, there is usually a large amount of parallel computation on the data. When training large models in parallel on a large-scale cluster, the training time is divided into communication time and computation time, which can vary with the parallel configuration. For a large cluster, a better parallel configuration can even double the training performance compared to an unoptimized configuration.
[0065] When configuring AI models in parallel, in addition to configuring the specific parallelism method, such as data parallelism (DP), pipelined parallelism, and tensor parallelism, it is also necessary to configure the sequence of operators running on each computing device in the AI cluster that runs the AI model when using this parallelism configuration. Currently, the list of operators that need to be deployed on each computing device can be output through theoretical derivation, i.e., the arrangement of operators that satisfy the parallel configuration. However, this method not only produces an inaccurate list of operators but is also inefficient. Furthermore, if the parallel configuration needs to be changed midway through AI model training, the AI model needs to be run once to collect the operator sequence on each computing card, resulting in extremely low efficiency.
[0066] Therefore, this application provides a configuration method that can achieve better performance in operator sequence prediction during parallel processing of large models without relying on manual configuration.
[0067] First, the method provided in this application can be applied to parallel computing scenarios for large models. Specifically, it can be applied to systems that utilize multiple computing devices for parallel training of large models, or to systems where multiple computing devices process large models in parallel, that is, the computational tasks of each operator of the large model are allocated to different parallel items for parallel execution. The parallel tasks mentioned in the embodiments of this application can be allocated to these multiple computing devices for execution.
[0068] For example, the system architecture of the AI cluster used in the method provided in this application embodiment can be as follows: Figure 1 As shown, the AI cluster may include a management device and multiple computing devices. The management device may be an independently deployed device used to manage multiple computing devices within the AI cluster, or it may be one of the multiple computing devices within the cluster.
[0069] Multiple computing devices can be used to perform data processing, such as inference or training of large models.
[0070] The management device connects to one or more computing devices and can be used to communicate with them to manage or instruct them to perform required functions. Alternatively, the management device can be one of the computing devices; the deployment method of the management device can be determined based on the specific application scenario.
[0071] For example, an AI cluster architecture used in the embodiments of this application can be as follows: Figure 2As shown, this system architecture may include one or more servers, and each server may include one or more computing devices. These computing devices may specifically be graphics processing units (GPUs), embedded neural network processing units (NPUs), tensor processing units (TPUs), or other dedicated neural network accelerators, collectively referred to as XPUs. Figure 2 In this example, the XPU is used as a computing device. The XPU can be replaced by a GPU, NPU, TPU, or other neural network accelerators.
[0072] In another possible implementation, the method provided in this application can be applied to a cloud service system. For example, the method provided in the embodiments of this application can be deployed in a cloud platform to provide services to users through a client. It can also be understood that the aforementioned management device can specifically be a cloud platform, or that the functions of the management device can be deployed in a cloud platform.
[0073] For example, one possible system architecture provided in this application can be as follows: Figure 3 As shown. Figure 3 As shown, the cloud service system 10 may include a data center 11, a cloud platform 12, and a client 13.
[0074] The cloud platform 12 may specifically include a server cluster or one or more servers, or the servers may be replaced by other devices with computing capabilities. Optionally, the cloud platform 12 can work with other devices, such as data storage, routers, load balancers, etc. The cloud platform 12 can use data in the data center 11 or call program code in the data center 11 to implement the method steps provided in the embodiments of this application.
[0075] Data center 11 can be used to store data for cloud platform 12 to query or write data. Specifically, data center 11 may include a data server deployed separately for storing data, or it may include storage media connected to cloud platform 12. The specific deployment form of data center 11 for storing data can be determined according to the actual application scenario.
[0076] The cloud platform 12 can interact with users as a client to provide services. Users can operate on the client 13 to exchange data with the cloud platform 12 or request services from the cloud platform 12. This client can be deployed on devices with displays, such as personal computers, computer workstations, smartphones, tablets, laptops, and smart cars.
[0077] In one implementation, the cloud platform 12 is used to implement the method provided in the embodiments of this application, and the user can perform interactive operations with the cloud platform 12 through the client 13. For example, before configuring the model in parallel, the cloud platform can issue a parallel mode to the AI cluster, and the AI cluster can feed back an operator sequence to the cloud platform; the user can send a target parallel configuration to the cloud platform through the client, and the AI cluster can feed back an operator sequence to the client. This operator sequence is the sequence of operators calculated by the computing devices in the AI cluster when the AI cluster runs the model in parallel using the target parallel configuration. The user can then determine the configuration of the AI cluster through the operator sequence received by the client.
[0078] Typically, large models are very large in scale, and the amount of computation required is also very large. Both the training and inference phases of large models require a significant amount of computation. Therefore, the training and inference processes of large models can be configured in parallel to achieve parallel operation of the large models.
[0079] For example, for training large models, the training process can be configured as multiple parallel tasks, which can be distributed across multiple computing devices for execution. The model training process can be configured as multiple parallel tasks from one or more dimensions.
[0080] The configuration of parallel tasks for large models can be performed by the aforementioned management device. In one specific scenario, this management device can be one of the computing devices in an AI cluster, which can directly configure the AI cluster and run AI models in parallel within the cluster. In another specific scenario, the management device can be a device deployed independently of the AI cluster, such as the aforementioned cloud platform. In this scenario, the cloud platform can execute the method flow provided in this application embodiment and output the operator sequence corresponding to the target parallel configuration.
[0081] First, for ease of understanding, the model structure involved in the embodiments of this application will be introduced.
[0082] For example, a possible model structure could be as follows: Figure 4 As shown, this model can include multiple network layers, such as... Figure 4 The layers 1 to m shown can each contain multiple modules, such as... Figure 4 The modules shown are 1 to n. Each module may include one or more operators.
[0083] Typically, the training or inference steps of a model can be distributed across multiple computing devices for parallel execution, as described above. Figures 1 to 3The specific granularity of the server or XPU mentioned in the document can be adjusted according to the actual application scenario, and this application does not limit it.
[0084] For model parallelism, multiple parallelism methods can be configured simultaneously. These include layer-based configuration, such as pipelined parallelism where a single computing device computes one or more layers of the model; data parallelism where a group of computing devices in an AI cluster executes one portion of the data; and tensor parallelism, where a single data set is further divided into multiple parts, with each computing device in a group executing a portion of the data. Typically, for large models, each computing device needs to execute the same or similar operators.
[0085] Therefore, embodiments of this application provide a configuration method for an artificial intelligence (AI) model, which can configure the operator sequence of each computing device very efficiently and accurately.
[0086] The method flow provided in the embodiments of this application is described below.
[0087] See Figure 5 The present application provides a flowchart of an AI model configuration method.
[0088] 501. Run the first AI model using M parallel configurations respectively, and obtain the M first operator sequences executed by any computing device in the AI cluster running the first AI model.
[0089] The first AI model is one that needs to be configured to run in parallel across multiple computing devices in an AI cluster.
[0090] The first AI model can be parallelized in N ways, where N is a positive integer and M is not greater than N, that is, M is less than or equal to N.
[0091] The first AI model can be run using M parallel configurations, resulting in M sequences of first operators executed by any computing device in the AI cluster running the first AI model. Each parallel configuration corresponds to one sequence of first operators, and each of the M parallel configurations excludes at least one of the N parallel methods, and the excluded parallel methods are different for each of the M parallel configurations. This is equivalent to pre-setting N parallel methods, and then disabling parallelism in one or more of these methods each time, resulting in a different parallel configuration. For example, the first parallel configuration might exclude parallel method A, the second might exclude parallel method B, and so on.
[0092] Specifically, the method provided in this application embodiment can be deployed in the aforementioned management device. The management device can be one of the servers in the AI cluster, or it can be a device deployed independently and connected to the AI cluster. If the management device is one of the computing devices in the cluster, the management device can use the M configurations to run the first AI model in parallel within the AI cluster, and collect the operator sequences calculated by one of the computing devices in the AI cluster when running AI using each of the M configurations, to obtain the first AI model with M first operator sequences.
[0093] Therefore, in this embodiment, multiple parallel configurations can be obtained by disabling parallelism in different ways, and the first AI model can be run in the AI cluster using different parallel configurations. This allows for the acquisition of the basic operator sequence that the computing device needs to compute during the operation of the first AI model in the AI cluster when parallelism is disabled in different dimensions, thus providing a basic operator sequence for subsequent parallel configurations.
[0094] Optionally, M equals N, and the number of at least one parallel method can be 1. For example, for N pre-defined parallel methods, one parallel method can be removed from each of the N parallel methods, and different parallel methods can be removed each time, thus obtaining M parallel configurations, or N parallel configurations. For example, based on N parallel methods, the number of parallel methods for one of the parallel methods can be set to 1 each time, which means that the parallel configuration does not include that parallel method. Therefore, in this embodiment, one parallel method can be selected to be turned off each time, so that when the computing device runs the first AI model, the computing node only needs to execute the basic operators used to implement the model function under each parallel method, without introducing additional operators caused by parallelism.
[0095] Optionally, the aforementioned N parallelism methods may include one or more of the following dimensions: data parallelism, sequence parallelism, tensor parallelism, pipeline parallelism, or expert parallelism. Data parallelism indicates the number of parallel partitions of the data input to the large model; sequence parallelism indicates the number of parallel partitions of the sequence input to the large model; tensor parallelism indicates the number of partitions of the input sequence at the tensor granularity; pipeline parallelism indicates the number of partitions of the data input to the large model from the time dimension; and expert parallelism indicates the number of partitions of the model parameters of the large model. Therefore, in the embodiments of this application, parallelism can be configured from multiple dimensions, thereby dividing the large model parallel processing task more suited to the user's actual needs from multiple dimensions.
[0096] 502. Obtain the same second operator sequence among M first operator sequences.
[0097] The sequence of operators commonly included in the M first operator sequences constitutes the second operator sequence. Each first operator sequence represents the sequence of operators that the computing nodes need to execute in a non-parallel configuration. Therefore, taking the common operators in the M first operator sequences can represent the basic operators that each computing device needs to execute to implement the function of the first AI model.
[0098] Optionally, the intersection of the M first operator sequences can be directly taken as the second operator sequence. Therefore, in this embodiment, the basic operator sequence to be executed by the computing device can be identified with less computational effort.
[0099] This can be understood as detecting the fundamental operators in the first AI model by disabling different parallel processing methods. Therefore, even if the specific list of operators in the first AI model is unknown, the fundamental operators that each computing device needs to execute can be identified very efficiently by running the first AI model on one or more computing devices.
[0100] In one possible implementation, to further reduce the computational resources required for the AI cluster to run the first AI model in parallel, the AI cluster can prune the first AI model. For example, it can maintain the type of neural network layers in the first AI model but reduce the number of neural network layers, thereby reducing the number of computing devices required to run the second AI model in the AI cluster to be less than the number of computing devices required to run the first AI model. Accordingly, the aforementioned M first operator sequences are the operator sequences computed by any one computing device when the AI runs the second AI model in parallel using M parallel configurations.
[0101] Therefore, in this embodiment of the application, when it is necessary to run the first AI model through an AI cluster, in order to reduce the computing resources required for the AI cluster to run the AI model in parallel, the AI cluster can trim the first AI model. While maintaining the type of the first AI model, the number of its network layers is reduced, so that the number of computing devices required to run the second AI model in the AI cluster is less than the number of computing devices required to run the first AI model, thereby reducing the computing resources required for the AI cluster to run the AI model in parallel.
[0102] 503. Run the first AI model using one of the N parallel methods respectively, and obtain the third operator sequence corresponding to each of the N parallel methods.
[0103] Typically, running a first AI model in parallel across multiple computing devices introduces additional communication-related operators due to the need for communication between the computing nodes. In this embodiment, the first AI model can be run based on each parallelism method, thereby obtaining the sequence of operators that the computing devices need to execute under each parallelism method.
[0104] To reduce the resource consumption of computing devices and improve the efficiency of obtaining the third operator sequence, optionally, each of the N parallel methods uses the minimum degree of parallelism for that method. For example, the number of parallel methods can be set to 2, thereby identifying the operator sequence that each computing device needs to execute with the lowest computing device overhead.
[0105] In one possible implementation, to further reduce the computing resources required for the AI cluster to run the first AI model in parallel, the AI cluster can trim the first AI model to obtain a second AI model, as described in step 502 above. Accordingly, the third operator sequence corresponding to each of the aforementioned N parallel methods is the operator sequence computed by any computing device when the AI cluster runs the second AI model based on each parallel method.
[0106] Therefore, in this embodiment of the application, when it is necessary to run the first AI model through an AI cluster, in order to reduce the computing resources required for the AI cluster to run the AI model in parallel, the AI cluster can trim the first AI model. While maintaining the type of the first AI model, the number of its network layers is reduced, so that the number of computing devices required to run the second AI model in the AI cluster is less than the number of computing devices required to run the first AI model, thereby reducing the computing resources required for the AI cluster to run the AI model in parallel.
[0107] This application does not limit the execution order of steps 501 and 503. Step 501 can be executed first, or step 503 can be executed first, or steps 501 and 503 can be executed simultaneously. The specific order can be determined according to the actual application scenario, and this application does not limit it.
[0108] 504. Calculate the difference operator between the third operator sequence and the second operator sequence for each of the N parallelization methods.
[0109] By obtaining the difference operator between the third operator sequence and the second operator sequence for each of the N parallelization methods, the difference operator corresponding to each parallelization method is obtained.
[0110] Typically, the second operator sequence can represent the basic operators required to run the first AI model. However, when running the first AI model in parallel based on each parallel method, communication between computing devices will introduce additional operators to the computing devices. Therefore, calculating the difference operator between the third operator sequence and the second operator sequence corresponding to each parallel method can represent the operator sequence that needs to be introduced for each parallel method.
[0111] 505. Generate the operator sequence in the computing device corresponding to the target parallel configuration based on the number of N parallel modes in the target parallel configuration and the difference operator and second operator sequence for each parallel mode.
[0112] The target parallel configuration may specifically include the quantity corresponding to one or more of the aforementioned N parallel methods. Based on this quantity and the difference operator and second operator sequence corresponding to each parallel method, an operator sequence in the computing device corresponding to the target parallel configuration is generated. That is, the operator sequence that the computing device needs to compute when running the first AI model according to the target parallel configuration in the computing device. In this embodiment, the common first operator sequence for parallel operation of the AI model in various parallel methods of the first AI model, i.e., the basic operator sequence, and the difference operator sequence in each parallel method that differs from other parallel methods can be determined in advance. Therefore, for any parallel method running within the AI cluster, the operator sequence corresponding to any parallel configuration can be obtained through the basic operator sequence and the difference operator sequence in each parallel method that differs from other parallel methods. For any parallel configuration of the first AI model, the operator sequence corresponding to the parallel configuration can be obtained based on the basic subsequence and the difference operator sequence of various parallel methods in the parallel configuration, which not only improves the accuracy of operator sequence prediction but also improves efficiency.
[0113] For example, based on the quantity of each of the N parallelism methods in the target parallel configuration, the difference operator and the second operator sequence corresponding to each parallelism method can be combined to obtain the operator sequence in the computing device corresponding to the target parallel configuration. In this embodiment, the basic operator sequence and the difference operator sequence can be combined according to the quantity corresponding to each parallel configuration, so that after receiving the target parallel configuration, the operator sequence that meets the requirements can be quickly combined. Even if the parameters of the target parallel configuration are changed, there is no need to perform a complete runtime test of the target parallel configuration in the AI cluster, and the efficiency of the operator sequence can be obtained very efficiently.
[0114] Optionally, the aforementioned target parallel configuration can be derived from user input data. For example, user input data can be obtained, and the target parallel configuration can be determined based on the content of the user input data. For instance, if the method provided in this application embodiment is deployed in the cloud and provides services to users in the form of a client, the user can input the target parallel configuration for running the first AI model on the client, the client sends the parallel configuration to the cloud, and the cloud efficiently feeds back the operator sequence to the client through the method provided in this application embodiment. When the user changes their requirements, they can input a new target parallel configuration, and the cloud can also provide feedback very efficiently again, improving the efficiency of the user obtaining a parallel configuration that meets their needs and the corresponding operator list.
[0115] Optionally, if the target parallel configuration is adjusted, a new operator sequence can be generated using the parameters included in the new target parallel configuration, for the number of N parallel modes and the difference operator and second operator sequence for each parallel mode.
[0116] The foregoing has described the steps performed by the management device in the embodiments of this application. In another possible scenario, the method provided in the embodiments of this application can be deployed in a management device connected to an AI cluster, as described above. Figure 3 The cloud platform shown obtains the required operator sequence by configuring the AI cluster and collecting data from the AI cluster.
[0117] The following section provides a more detailed description of the method flow provided in the embodiments of this application, using AI clusters as an example.
[0118] See Figure 6 The following is a flowchart illustrating another data processing method provided in this application embodiment.
[0119] 601. The management devices shall disable different parallel modes respectively, and configure M parallel configurations for the AI cluster containing multiple computing devices.
[0120] The parallel operation of the first AI model can include various parallel methods, such as data parallelism, sequence parallelism, tensor parallelism, pipeline parallelism, or expert parallelism. One of these parallel methods can be set to non-parallelism, while the others are set to parallelism, resulting in M parallel configurations. These M parallel configurations are then used as the configuration for the AI cluster.
[0121] For example, if the N parallel modes include parallel configuration items for a first parallel mode, a second parallel mode, and a third parallel mode, and these three parallel configuration items are enabled, the management device can respectively disable the parallel configuration item for the first parallel mode and enable the parallel configuration items for the second and third parallel modes; disable the parallel configuration item for the second parallel mode and enable the parallel configuration items for the first and third parallel modes; disable the parallel configuration item for the third parallel mode and enable the parallel configuration items for the first and second parallel modes, etc., to obtain various configurations that disable the parallel configuration items for different parallel modes.
[0122] For example, you can set data parallelism to non-parallelism, and sequence parallelism, tensor parallelism, pipeline parallelism, and expert parallelism to parallelism to obtain one parallel configuration; you can set sequence parallelism to non-parallelism, and data parallelism, tensor parallelism, pipeline parallelism, and expert parallelism to parallelism to obtain another parallel configuration, and so on.
[0123] In addition, to reduce the amount of computation required for the AI cluster to output operator sequences, a minimum parallelism can be set for each parallelism method. For example, the number of parallel operations for each parallelism method can be set to 2. In this case, the first AI model only needs to be run on the number of parallelism methods * 2 computing devices. That is, only a small number of computing devices in the AI cluster need to be started to run the first AI model in parallel, which can complete the operation of a set of parallel configurations more efficiently and reduce the computational overhead of the AI cluster.
[0124] 602. The management device collects M sequences of first operators executed by any computing device.
[0125] The first AI model can be run in parallel on multiple computing devices in the AI cluster using each parallel configuration. Each parallel method outputs a first operator sequence, which is the operator sequence calculated by any computing device in the AI cluster. M parallel configurations correspond to M first operator sequences, and the M operator sequences are fed back to the management device, or the management device actively collects the M first operator sequences.
[0126] Typically, in existing large-scale parallel frameworks, each parallel operator sequence usually contains the same basic operator sequence. To make the method of obtaining this basic operator sequence robust enough, that is, to perform better across all dimensions, the embodiments of this application can obtain the minimum subset among the possible operator sequences of all parallel configurations as the basic operator sequence by sequentially disabling one of the parallel configurations.
[0127] Furthermore, to further reduce the computational resources required to obtain the operator sequence, when running the first AI model in parallel within the AI cluster, the AI cluster can prune the first AI model. For example, it can maintain the type of neural network layers in the first AI model but reduce the number of neural network layers to obtain the second AI model. Therefore, when running the second AI model in the AI cluster, compared to running the first AI model, each computing device executes the same or similar operator sequence, but requires fewer computing devices, thus obtaining M first operator sequences with fewer computational resources.
[0128] In this embodiment, the required operator sequence can be obtained by starting a small number of computing devices in the AI cluster to run the first AI model in parallel using each parallel configuration. Furthermore, by setting a non-parallel mode for each parallel mode, the basic operators required to implement model computation under each parallel mode are determined, and the management device does not need to know the operator list of the first AI model. Even in the case of rapid model updates, only a small amount of cluster computing overhead is required for the cluster to efficiently determine the basic operator sequence.
[0129] 603. The management device obtains the same second operator sequence among M first operator sequences.
[0130] After receiving M first operator sequences, the management device takes the intersection of the operators in the M first operator sequences to obtain the second operator sequence. This second operator sequence can be used as the basic operator sequence, that is, the operator sequence that each computing device needs to execute.
[0131] For example, if N parallelism options include parallel configurations for a first, second, and third parallelism method, but the first parallelism method is not parallel, then there's no need to allocate the neural network in parallel from the first parallelism method within the AI cluster. Instead, parallel configuration is only required from the second and third parallelism methods. The resulting operator sequence can include the basic operator sequence under the first parallelism method. This operator sequence is equivalent to the computational task under the first parallelism method not being parallelized; each computing device executing the task under the first parallelism method only needs to execute the basic operators used for computation, without executing communication-related operators. Similarly, if the first configuration includes parallel configuration options for a first and third parallelism method, but the second parallelism method is not parallelized, then there's no need to allocate the neural network computational task in parallel from the second parallelism method. Instead, parallel configuration is only required from the first and third parallelism methods, resulting in an operator sequence where all computing devices execute all computational tasks under the second parallelism method. This is equivalent to the computational task under the second parallelism method not being parallelized; each computing device executing the task under the second parallelism method only needs to execute the basic operators used for computation, without executing communication-related operators.
[0132] Therefore, by taking the intersection of different sequences of first operators, we can obtain the operators used for model calculation that each computing device needs to execute under the first parallel mode, the second parallel mode, and the third parallel mode, which serve as the basic operator sequence.
[0133] For example, one can disable parallelism for one of the parallel items (i.e., set its parallelism count to 1), configure the parallelism configuration after disabling one parallel item to the AI cluster, and collect the non-parallel operator sequence of any computing device in the AI cluster under one parallel mode, i.e., the aforementioned first operator sequence. After iterating through all parallel modes with parallelism disabled, multiple operator sequences can be obtained, such as... Figure 7 As shown. The intersection of these multiple operator sequences is taken as the basic operator sequence. In the subsequent prediction process of the operator sequences, this basic operator sequence needs to be added to each parallel method, or in other words, the operator sequence that each computing device needs to calculate when the AI cluster runs the first AI model in each parallel method.
[0134] 604. Each management device should independently enable one of the N parallel modes, and configure each parallel mode for the AI cluster.
[0135] For operators introduced in different parallel modes, the management device can individually enable one of the N parallel modes and configure each parallel mode separately for the AI cluster.
[0136] For example, if the N parallel modes include parallel configuration items for the first parallel mode, the second parallel mode, and the third parallel mode, the first parallel mode can be configured separately to generate a parallel mode, the second parallel mode can be configured separately to generate another parallel mode, and so on.
[0137] This application does not limit the execution order of steps 601 and 604. Step 601 can be executed first, or step 604 can be executed first, depending on the actual application scenario.
[0138] 605. The management device collects the third operator sequence for each parallel mode.
[0139] The first AI model can be run on the computing devices in the AI cluster using each parallel method, and the third operator sequence corresponding to each parallel method can be obtained. That is, the operator sequence executed by any computing device in the AI cluster when the first AI model is run using each parallel method.
[0140] To reduce the overhead of computing the first AI model in the AI cluster, the parallelism of each parallel method can be set to the minimum. For example, the number of parallel operations for each parallel method can be set to 2. Therefore, only 2 computing devices need to be started to complete the parallel operation of one parallel method, thereby outputting a set of third operators with better performance at very low operating cost.
[0141] Accordingly, to further reduce the computational resources required to obtain the operator sequence, when running the first AI model in parallel within the AI cluster, the AI cluster can prune the first AI model. For example, it can maintain the type of neural network layers in the first AI model but reduce the number of neural network layers to obtain the second AI model. If the first AI model has been pruned in step 602, then in step 605, each parallel method can be directly used to run the second AI model in parallel. Therefore, when running the second AI model in the AI cluster, compared to running the first AI model, each computing device executes the same or similar operator sequence, but requires fewer computing devices, thus running the AI model with fewer computational resources to obtain the third operator sequence for each parallel method.
[0142] 606. The management device calculates the difference operator between the third operator sequence and the second operator sequence for each parallel mode.
[0143] For the third operator sequence corresponding to each parallel mode, in order to identify the difference between each parallel task and the basic operator sequence, the operator sequence of each parallel task can be obtained by configuring each parallel mode separately and running the first AI model in the AI cluster.
[0144] For example, such as Figure 8 As shown, for each parallelism mode, a separate parallel configuration is performed, and each parallelism mode is configured to the AI cluster. The operator sequence of any computing device in the AI cluster under each parallelism mode is collected. Taking one operator sequence of one computing device under one parallelism mode as an example, the difference between this operator sequence and the basic operator sequence is obtained. By traversing the differences between the operator sequences of all parallel tasks and the basic operator sequence, the difference operator sequence of each parallelism mode is obtained.
[0145] This can be understood as the basic operator sequence being the sequence of basic operators that each computing device needs to execute when running the model. However, due to the introduction of parallel tasks, communication between computing devices is necessary, and this communication introduces additional operators, namely the difference operators in this embodiment. Therefore, this embodiment fully considers the communication overhead operators in parallel scenarios, resulting in a more accurate grasp of the performance impact of communication, and ultimately achieving excellent performance in both communication and operator performance.
[0146] 607. Manage devices to acquire target parallel configurations.
[0147] The target parallel configuration may include the number of parallelism methods, such as data parallelism 4, sequence parallelism 4, tensor parallelism 2, pipeline parallelism 2, or expert parallelism 4, which means that the number of data parallelism is 4, the number of sequence parallelism is 4, the number of tensor parallelism is 2, the number of pipeline parallelism is 2, and the number of expert parallelism is 4, etc.
[0148] The target parallel configuration can be input by the user through the client. For example, before training the model, the user can configure the AI cluster that executes the first AI model in parallel. The required operator sequence can be obtained by inputting one or more sets of target parallel configurations. Here, we will introduce one set of target parallel configurations as an example.
[0149] For example, when a user wants a parallel training scheme for a large model, they can input the required parallel configuration. For instance, they can input the number of parallel operations for one or more of the following configurations, such as data parallel configuration, sequence parallel configuration, tensor parallel configuration, pipeline parallel configuration, or expert parallel configuration, depending on the actual situation of the computing device.
[0150] 608. The management device generates the operator sequence corresponding to the target parallel configuration based on the number of each parallel mode in the target parallel configuration and the difference operator and second operator sequence for each parallel mode.
[0151] Specifically, the management device can determine the coefficient of the difference operator for each parallel mode based on the number of each parallel mode in the target parallel configuration, and combine the coefficient with the second operator sequence to obtain the operator sequence corresponding to the target parallel configuration.
[0152] For example, such as Figure 9 As shown, taking CP decomposition (tensor decomposition) as an example, CP decomposition is a tensor decomposition method that decomposes a high-dimensional tensor into the sum of the outer products of several factor vectors with the same parallelism. When the CP group is not parallelized, CP=1. When CP is enabled with the minimum parallel configuration, CP=2. The difference between the two configurations is the difference operator sequence CP_diff between CP and non-CP. The difference between CP=4 and CP=2 is one CP_diff. The difference operator sequence between the operator lists of CP=4 and CP=1 is twice the difference operator sequence between the operator lists of CP=2 and CP=1, and they are completely identical in composition, both being CP_diff.
[0153] For example, if parallel configuration is performed along the dimensions of CP decomposition and EP (expert parallel) decomposition, we obtain CP_DIFF and the difference operator sequence EP_DIFF of the EP parallel operator sequence. Subsequently, given the operator sequences BASE, CP_DIFF, and EP_DIFF, we can deduce that the operator sequence for parallel configuration 2CP_4EP is represented as BASE+CP_DIFF+3*EP_DIFF.
[0154] In this embodiment, the basic operator sequence for implementing all parallel modes can be obtained by individually disabling the parallel configuration under each parallel mode. Furthermore, the operator sequence required for each parallel mode can be obtained by individually configuring the parallel configuration for each parallel mode. The difference between these two operator sequences represents the incremental operator introduced by parallelism. Combining the basic operator sequence with the incremental operator sequence for each parallel mode yields a complete operator sequence capable of parallel processing in all parallel modes. Therefore, only the first AI model needs to be run within the AI cluster to collect the basic operators for parallel operation of the first AI model, as well as the difference operators for each parallel mode. Even by changing the parallel parameters in the target parallel configuration, the corresponding operator sequence can be output very efficiently and accurately.
[0155] Taking the processing of large models as an example, the end-to-end performance of parallel processing of large models can be reflected through various performance metrics, such as... Figure 10 As shown, the end-to-end performance of large-scale parallel processing can be divided into communication performance and operator performance. Communication performance is the time spent communicating between different computing devices. Correspondingly, when computing devices communicate, communication-related operators will be introduced compared to non-parallel processing. Operator performance can include the time spent determining the operator sequence and the time spent computing a single operator.
[0156] The computation time of an operator needs to be calculated using the operator sequence in conjunction with the computation time of a single operator. In this embodiment, after obtaining the common basic operator sequence and the difference operator sequences for different parallel methods, a complete list of target configuration operators can be assembled for a set of target configurations using fusion methods such as combination or multiplication. Furthermore, the computation time corresponding to the operator sequence can be predicted quickly. For example, users can quickly obtain an operator sequence that matches the parallel configuration using the method provided in this embodiment. If the computation time of the operator sequence does not meet the user's needs, the user can adjust the target parallel configuration and quickly obtain a new operator sequence using the method provided in this embodiment. This allows for a more efficient acquisition of an operator sequence that better suits the user's needs, and the more suitable operator sequence can be used to configure the operator sequence used by the AI cluster to run the first AI model in parallel. That is, there is no need to puncture and reconfigure the operator profile every time the target parallel configuration is changed, thereby significantly shortening the operator prediction time while ensuring the accuracy of operator performance.
[0157] For example, a complete parallel training scheme for large models can be like... Figure 11 As shown, X0 and X1 represent the original data after being split and parallelized; DP0 and DP1 represent different devices that can be used to process different data, representing data parallelism; PP0 and PP1 represent temporal parallelism, representing pipeline parallelism; 0-15 represent computing cards, where a computing card can be a GPU / NPU / TPU, etc., which can be used to process different tensors, representing tensor-dimensional parallelism. That is, the target parallel configuration can specifically include, but is not limited to, data-dimensional parallelism, model parameter-dimensional parallelism, temporal-dimensional parallelism, and tensor-dimensional parallelism. Using the scheme provided in this application's embodiments, different parallelism methods can be disabled to obtain the smallest possible subset of operators in all parallelism methods' operator sequences as the basic operator sequence. Parallel configuration can then be performed separately for each parallelism method to obtain the difference operator sequence under each parallelism method. Finally, the difference operator sequence under each parallelism method is subtracted from the basic operator sequence to obtain the difference operator sequence for each parallel task under each parallelism method. By combining the basic operator sequence with the difference operator sequences of each parallel task under each parallel mode, a more complete overall operator sequence can be obtained that is included in each parallel mode.
[0158] The existing Mixtral-MOE solution is a deep learning-based model architecture that dynamically combines multiple expert models (i.e., sub-models for specific domains) to achieve efficient processing of complex tasks. However, the Mixtral-MOE solution involves numerous operator deployment tasks. Typically, users require a wide variety of training models, but the number of experts capable of performance tuning is limited, and manual tuning is inefficient. For example, it might take two to three days to manually tune a large model. As user needs change, there might be demands for 10 large models. Therefore, it often relies on historical experience for parallel configuration of large model training, which typically doesn't fully utilize the performance of the cluster.
[0159] Therefore, the method provided in this application embodiment can more efficiently obtain the required operator sequence, thereby quickly providing users with a large-scale parallel processing solution for model training, i.e., an operator sequence. For a fully automated large-scale model training performance prediction system, it is not necessary to know the operator composition in advance. Users can define their own operators, and after simply obtaining a few steps of profile files, the combination relationship between the operator sequence and the parallel configuration can be obtained. This allows for the inference of the operator sequence for all configurations, essentially providing a basic tool for obtaining accurate end-to-end operator performance prediction in this application embodiment.
[0160] Verification showed that the method provided in this application embodiment can generate 10 profiles within 15 minutes, complete the prediction of the basic operator sequence, and search 1000 deliberately configured sets. Each configuration search took 0.1 seconds, and the total search time was 2 minutes to obtain the optimal configuration. Reverse verification of the predicted operator sequence and the actual operator sequence showed a quantity error of less than 0.5% and an overall operator time performance error of less than 5%.
[0161] The foregoing has described the method flow provided in the embodiments of this application. The following describes the structure of the apparatus for executing the foregoing method flow.
[0162] See Figure 12 The present application provides a schematic diagram of the structure of an AI model configuration system, including:
[0163] The same operator determination module 1201 is used to run the first AI model with M parallel configurations respectively, obtain M first operator sequences executed by any computing device in the AI cluster running the first AI model, wherein each of the M parallel configurations does not include at least one of the N parallel methods of the first AI model, and the at least one parallel method not included in each of the M parallel configurations is different, and obtain the same second operator sequence among the M first operator sequences, where M is less than or equal to N;
[0164] The difference operator determination module 1202 is used to run the first AI model using one of the N parallel methods respectively, to obtain the third operator sequence corresponding to each of the N parallel methods, and to calculate the difference operator between the third operator sequence corresponding to each of the N parallel methods and the second operator sequence.
[0165] The configuration generation module 1203 is used to generate the operator sequence in the computing device corresponding to the target parallel configuration based on the number of N parallel modes in the target parallel configuration and the difference operator and second operator sequence for each parallel mode.
[0166] In one possible implementation, the aforementioned M equals N, and the number of at least one parallel mode is 1.
[0167] In one possible implementation, the aforementioned identical operator determination module 1201 is specifically used to obtain the second operator sequence by taking the intersection of M first operator sequences.
[0168] In one possible implementation, the aforementioned difference operator determination module 1202 is specifically used to subtract the third operator sequence and the second operator sequence corresponding to each of the N parallel modes to obtain the difference operator corresponding to each parallel mode.
[0169] In one possible implementation, the aforementioned configuration generation module 1203 is specifically used to combine the difference operator corresponding to each of the N parallel modes in the target parallel configuration with the second operator sequence according to the quantity corresponding to each of the parallel modes, to obtain the operator sequence in the computing device corresponding to the target parallel configuration.
[0170] In one possible implementation, the aforementioned identical operator determination module 1201 is specifically used to: run the first AI model in M parallel configurations through the AI cluster, and collect the M first operator sequences executed by any computing device in the AI cluster.
[0171] In one possible implementation, the aforementioned difference operator determination module 1202 is specifically used to: run the first AI model using one of N parallel methods through the AI cluster; and collect the third operator sequence corresponding to each of the N parallel methods for running the first AI model on any computing device in the AI cluster.
[0172] In one possible implementation, the aforementioned M first operator sequences are operator sequences computed by any computing device when the AI runs the second AI model in parallel using M parallel configurations; the aforementioned third operator sequences corresponding to each of the N parallel methods are operator sequences computed by any computing device when the AI cluster runs the second AI model based on each parallel method.
[0173] The second AI model is obtained by trimming the first AI model by the AI cluster. The types of neural network layers included in the second AI model are the same as those included in the first AI model. The number of neural network layers included in the second AI model is less than the number of neural network layers included in the first AI model. Furthermore, the number of computing devices required to run the second AI model in the AI cluster is less than the number of computing devices required to run the first AI model.
[0174] This application also provides a chip system including a processor and a power supply circuit. The power supply circuit supplies power to the processor, which executes the processing steps corresponding to the method provided in this application. For brevity, further details are omitted here. The processor can be implemented using a GPU, or it can be implemented using computing devices such as a DPU, NPU, XPU, SoC, offload card, or accelerator card.
[0175] This application also provides a management device 100. For example... Figure 13 As shown, the management device 100 includes a bus 102, a processor 104, a memory 106, and a communication interface 108. The processor 104, the memory 106, and the communication interface 108 communicate with each other via the bus 102. The management device 100 can be a server or a terminal device. It should be understood that this application does not limit the number of processors and memories in the management device 100.
[0176] Bus 102 can be a Peripheral Component Interconnect Express (PCIe) bus, an Extended Industry Standard Architecture (EISA) bus, a Unified Bus (Ubus or UB), a Compute Express Link (CXL), a Cache Coherent Interconnect for Accelerators (CCIX), etc. The Unified Bus is also known as the Lingqu Bus. Buses can be divided into address buses, data buses, control buses, etc. For ease of representation, Figure 13 The bus is represented by only one line, but this does not mean that there is only one bus or one type of bus. Bus 104 may include a path for transmitting information between various components of management device 100 (e.g., memory 106, processor 104, communication interface 108). The unified bus may also be referred to as the Lingqu bus.
[0177] The processor 104 may include any one or more of the following computing devices: central processing unit (CPU), graphics processing unit (GPU), microprocessor (MP) or digital signal processor (DSP), ASIC, FPGA, CPLD, NPU, SoC, offload card, accelerator card, etc.
[0178] Memory 106 may include volatile memory, such as random access memory (RAM). Processor 104 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid state drive (SSD). Furthermore, memory 106 may also be implemented using storage class memory (SCM), phase change memory (PCM), or other types of storage media.
[0179] It is worth noting that the same type of storage medium can be configured in the same computing device to realize the function of memory 106, or two or more types of storage media can be configured to realize the function of memory 106. This application does not limit this.
[0180] The memory 106 stores executable program code, and the processor 104 executes the executable program code to implement the aforementioned functions. Figure 12 The computing device mentioned herein functions to implement the processing steps in the method provided in this application. That is, the memory 106 stores instructions for executing the method provided in this application.
[0181] Alternatively, the memory 106 may store executable code, which the processor 104 executes to implement the functions of the computing device, etc., thereby implementing the method provided in this application. That is, the memory 106 stores instructions for executing the method provided in this application.
[0182] The communication interface 103 uses transceiver modules such as, but not limited to, network interface cards and transceivers to enable communication between the management device 100 and other devices or communication networks.
[0183] As one possible implementation, the management device 100 may also include a chip system, which includes a processor and a power supply circuit. The power supply circuit supplies power to the processor, and the processor executes the operation steps corresponding to the method provided in this application. For simplicity, further details are omitted here. The processor can be implemented using a GPU, or it can be implemented using computing devices or AI chips such as a DPU, NPU, XPU, SoC, offloading card, or accelerator card.
[0184] As one possible implementation, the management device 100 may include various types of processors 104, meaning the management device 100 is a heterogeneous device. For example, the management device 100 may include a CPU and a GPU, and at least one of the processors 104 may execute the operation steps corresponding to the method provided in this application. For the sake of brevity, further details are omitted here.
[0185] In one possible scenario, the aforementioned management device may be one of the computing devices in a cluster. In this case, embodiments of this application also provide a computing device cluster. The computing device cluster includes at least one computing device. The computing device may be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device may also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.
[0186] Optionally, the aforementioned management device can be one of the devices in the AI cluster. Alternatively, the aforementioned management device can be connected to the AI cluster to collect the operators calculated by the computing devices in the AI cluster.
[0187] Furthermore, the aforementioned management device 100 can also be connected to a computing device cluster via a network to form the cloud service system provided in this application embodiment. For example... Figure 14 As shown, the structure of the management device 100 can be referred to the aforementioned... Figure 13 The corresponding description can be used to execute the aforementioned... Figure 6 The steps are performed by the management device. This AI cluster can perform the steps described above. Figure 6 The steps executed by the AI cluster are shown.
[0188] This application also provides a computer program product containing instructions. The computer program product may be a software or program product containing instructions, capable of running on a computing device or stored on any usable medium. When the computer program product is run on at least one computing device, it causes the at least one computing device to perform the method provided in this application.
[0189] This application also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium capable of being stored by a computing device, or a data storage device such as a data center containing one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state drive). The computer-readable storage medium includes instructions that instruct a computing device to perform the method provided in this application.
[0190] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the protection scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for configuring an artificial intelligence (AI) model, characterized in that, include: The first AI model is run using M parallel configurations. The M sequences of first operators executed by any computing device in the AI cluster running the first AI model are obtained. Each of the M parallel configurations does not include at least one of the N parallel methods of the first AI model, and the at least one parallel method not included in each of the M parallel configurations is different, where M is less than or equal to N. Obtain the identical second operator sequences among the M first operator sequences; The first AI model is run using one of the N parallel methods to obtain the third operator sequence corresponding to each of the N parallel methods. Calculate the difference operator between the third operator sequence and the second operator sequence for each of the N parallelization methods; The operator sequence in the computing device corresponding to the target parallel configuration is generated based on the number of N parallel modes in the target parallel configuration, the difference operator for each parallel mode, and the second operator sequence.
2. The method according to claim 1, characterized in that, The M is equal to the N, and the number of the at least one parallel method is 1.
3. The method according to claim 1 or 2, characterized in that, Obtaining the identical second operator sequence among the M first operator sequences includes: The intersection of the M first operator sequences is used to obtain the second operator sequence.
4. The method according to any one of claims 1-3, characterized in that, The operator for calculating the difference between the third operator sequence and the second operator sequence for each of the N parallelization methods includes: The difference operator corresponding to each of the N parallel methods is obtained by subtracting the third operator sequence from the second operator sequence. Each of the N parallel methods uses the minimum parallelism of each parallel method.
5. The method according to any one of claims 1-4, characterized in that, The step of generating the operator sequence in the computing device corresponding to the target parallel configuration based on the number of N parallel modes in the target parallel configuration, the difference operator for each parallel mode, and the second operator sequence includes: Based on the quantity corresponding to each of the N parallel modes in the target parallel configuration, the difference operator corresponding to each parallel mode is combined with the second operator sequence to obtain the operator sequence in the computing device corresponding to the target parallel configuration.
6. The method according to any one of claims 1-5, characterized in that, The first AI model is run using M parallel configurations, and the M sequences of first operators executed by any computing device in the AI cluster running the first AI model are obtained, including: The first AI model is run in the AI cluster using the M parallel configurations respectively, and the M first operator sequences executed by any computing device in the first AI model of the AI cluster are collected.
7. The method according to any one of claims 1-6, characterized in that, The step of running the first AI model using one of the N parallel methods to obtain a third operator sequence corresponding to each of the N parallel methods includes: The first AI model is run using one of the N parallel methods in the AI cluster. Collect the third operator sequence corresponding to each of the N parallel methods when any computing device in the AI cluster runs it.
8. The method according to any one of claims 1-7, characterized in that, The M first operator sequences are operator sequences calculated by any one of the computing devices when the second AI model is run in parallel using the M parallel configurations of some computing devices in the AI cluster. The third operator sequence corresponding to each of the N parallel methods is the operator sequence calculated by any one of the computing devices when the AI cluster runs the second AI model based on each of the parallel methods. The second AI model is obtained by trimming the first AI model. The types of neural network layers included in the second AI model are the same as those included in the first AI model, but the number of neural network layers included in the second AI model is less than the number of neural network layers included in the first AI model.
9. An AI model configuration system, characterized in that, include: The same operator determination module is used to run the first AI model in M parallel configurations respectively, obtain M first operator sequences executed by any computing device running the first AI model, wherein each of the M parallel configurations does not include at least one of the N parallel methods of the first AI model, and the at least one parallel method not included in each of the M parallel configurations is different, and obtain the same second operator sequence among the M first operator sequences, where M is less than or equal to N; The difference operator determination module is used to run the first AI model using one of the N parallel methods respectively, to obtain the third operator sequence corresponding to each of the N parallel methods, and to calculate the difference operator between the third operator sequence corresponding to each of the N parallel methods and the second operator sequence. The configuration generation module is used to generate the operator sequence in the computing device corresponding to the target parallel configuration based on the number of N parallel modes in the target parallel configuration, the difference operator for each parallel mode, and the second operator sequence.
10. The system according to claim 9, characterized in that, The M is equal to the N, and the number of the at least one parallel method is 1.
11. The system according to claim 9 or 10, characterized in that, The identical operator determination module is specifically used to obtain the second operator sequence by taking the intersection of the M first operator sequences.
12. The system according to any one of claims 9-11, characterized in that, The difference operator determination module is specifically used to subtract the third operator sequence corresponding to each of the N parallel methods from the second operator sequence to obtain the difference operator corresponding to each parallel method. Each of the N parallel methods uses the minimum parallelism of each parallel method.
13. The system according to any one of claims 9-12, characterized in that, The configuration generation module is specifically used to combine the difference operator corresponding to each of the N parallel modes in the target parallel configuration with the second operator sequence according to the quantity corresponding to each of the N parallel modes, to obtain the operator sequence in the computing device corresponding to the target parallel configuration.
14. The system according to claim 13, characterized in that, The identical operator determination module is specifically used for: The first AI model is run in the AI cluster using the M parallel configurations respectively, and the M first operator sequences executed by any computing device in the first AI model of the AI cluster are collected.
15. The system according to any one of claims 9-14, characterized in that, The difference operator determination module is specifically used for: The first AI model is run using one of the N parallel methods in the AI cluster. Collect the third operator sequence corresponding to each of the N parallel methods when any computing device in the AI cluster runs it.
16. The system according to any one of claims 9-15, characterized in that, The M first operator sequences are operator sequences calculated by any one of the computing devices when the second AI model is run in parallel using the M parallel configurations of some computing devices in the AI cluster. The third operator sequence corresponding to each of the N parallel methods is the operator sequence calculated by any one of the computing devices when the AI cluster runs the second AI model based on each of the parallel methods. The second AI model is obtained by the AI cluster by trimming the first AI model. The types of neural network layers included in the second AI model are the same as those included in the first AI model, but the number of neural network layers included in the second AI model is less than the number of neural network layers included in the first AI model.
17. A management device, characterized in that, The management device is a computing node in the AI cluster. The computing node includes at least one computing device, and the management device also includes a processor and a memory. The processor is configured to execute instructions stored in the memory to cause the management device to perform the operational steps of the method as described in any one of claims 1 to 8.
18. A service system, characterized in that, include: The management device is configured to perform the operational steps of the method as described in any one of claims 1 to 8, and the AI cluster is configured to run an AI model under the instruction of the management device.
19. A computer program product containing instructions, characterized in that, When the instruction is executed by the device, the device performs the operational steps of the method as described in any one of claims 1 to 8.
20. A computer-readable storage medium, characterized in that, It includes computer program instructions, which, when executed by the device, cause the device to perform the operational steps of the method as described in any one of claims 1 to 8.