Parallel acceleration method and device for moe sparse large model inference
By employing a differentiated parallel strategy in the pre-filling and decoding stages of the MoE sparse large model, and classifying and processing sequences in parallel according to their length, the problem of unbalanced load caused by the mixing of long and short sequences is solved, thereby improving hardware resource utilization and inference efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-19
Smart Images

Figure CN122241307A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the fields of artificial intelligence and distributed computing technology, and in particular to parallel acceleration methods and devices for MoE sparse large model inference. Background Technology
[0002] In recent years, with the rapid development of artificial intelligence technology, the parameter size of large language models has grown exponentially, and the computational resources required for their training and inference have also increased dramatically. To reduce the computational overhead of large models, the Mixture-of-Experts (MoE) sparse architecture has emerged. The MoE model replaces the fully connected layers in traditional dense models with a mixed expert layer composed of gating functions and multiple expert networks, so that each input character only activates a portion of the experts for computation, thereby achieving sublinear growth in model parameters and computational overhead.
[0003] In parallel inference of MoE sparse large models, a hybrid parallel strategy is usually adopted: the multi-head attention layer adopts data parallelism, that is, the input sequences in the same batch are divided into different devices for parallel processing, and each device holds complete attention layer parameters; the hybrid expert layer adopts expert parallelism, that is, different experts are divided into different devices. After the input characters are gated to calculate the affinity score, they are routed to the device where the corresponding expert is located for calculation through an all-to-all communication operation. The calculation result is then returned to the original device through an all-to-all communication operation so that it can enter the next layer of attention calculation.
[0004] Currently, various solutions have been proposed for optimizing large-scale model inference, such as adjusting parallel strategies or optimizing batch processing methods to improve inference efficiency. However, existing solutions still have limitations when handling mixed requests of long and short sequences. Specifically, in parallel inference of MoE sparse large models, there is a load imbalance problem in the multi-head attention computation stage due to the varying lengths of request sequences. This load imbalance is mainly manifested in the following ways: when a batch contains extremely long sequences, its processing time is much longer than that of a batch containing multiple short sequences; or, devices with more short sequences experience a much faster growth rate in processing load than devices with more long sequences due to the simultaneous processing of multiple sequences. This load imbalance further leads to faster-processing devices waiting for slower-processing devices when global synchronization is required via all-to-all communication operations after the multi-head attention layer completes data parallel computation, resulting in wasted system resources and decreased inference efficiency.
[0005] Therefore, there is an urgent need for a parallel acceleration method that can balance the load of multi-head attention computation in mixed long and short sequence scenarios, so as to improve the end-to-end inference efficiency of MoE sparse large models. Summary of the Invention
[0006] In view of this, embodiments of this application provide a parallel acceleration method and apparatus for MoE sparse large model inference, in order to eliminate or improve one or more defects existing in the prior art.
[0007] The first aspect of this application provides a parallel acceleration method for MoE sparse large model inference, including: In the pre-filling stage of MoE sparse large model inference, each request sequence within the same batch is classified according to a preset pre-filling stage classification strategy. The pre-filling stage classification strategy includes: classifying request sequences with a length less than a first threshold as first-class sequences, classifying request sequences with a length between the first threshold and a second threshold as second-class sequences, and classifying request sequences with a length greater than the second threshold as third-class sequences. The second threshold is greater than the first threshold. If the first type of sequence is obtained by classification, each first type of sequence is distributed to multiple devices in a data-parallel manner for parallel multi-head attention computation; If the second type of sequence is obtained by classification, then the second type of sequence is processed in a sequence parallel manner; If the third type of sequence is obtained by classification, the third type of sequence is divided into sub-sequences, and each sub-sequence is assigned to multiple iterations; and in each iteration, the sub-sequences are processed in a parallel manner.
[0008] In some embodiments of this application, the step of dividing the third type of sequence into sub-sequences and allocating each sub-sequence to multiple iterations includes: The number of segments for the request sequence is determined based on the ratio of the length of the request sequence to the first threshold. The request sequence is divided into multiple sub-length sequences on an average basis based on the number of divisions; Each of the sub-length sequences is assigned one-to-one to multiple iterations.
[0009] In some embodiments of this application, processing the sub-length sequence in a sequence-parallel manner in each iteration includes: In each iteration, if the second type of sequence exists in the current iteration, the sub-length sequence corresponding to the current iteration and the second type of sequence existing in the current iteration are jointly divided into sub-sequences using the sequence parallel method, so that each device can perform multi-head attention calculation in parallel on the assigned sub-sequences.
[0010] In some embodiments of this application, the sequence parallelism method includes: The sequence to be processed is divided into 2d segments on average, wherein the sequence to be processed includes the second type of sequence and / or the sub-length sequence corresponding to the third type of sequence in a single iteration, and d is the number of devices; Assign the i-th segment and the i-th-last segment to the i-th device.
[0011] In some embodiments of this application, in an iteration where the first type of sequence exists, each of the devices is configured to: perform data transmission required for multi-head attention computation using the sequence parallel method while performing multi-head attention computation using the data parallel method, and after receiving the transmitted data, perform multi-head attention computation in parallel on the sub-sequences allocated in the current iteration.
[0012] In some embodiments of this application, the data transmission required for performing multi-head attention computation using the sequential parallel approach includes: A ring communication method is used to transmit key-value cache data between devices, and when key-value cache data is received from a neighboring device, attention is calculated with the local query vector.
[0013] In some embodiments of this application, the data transmission required for performing multi-head attention computation using the sequential parallel approach includes: Through at least one full-switch communication operation, query, key, and value tensors after sequence dimension partitioning are exchanged between various devices, and multi-head attention computation is performed in parallel after obtaining complete sequence data; wherein, the full-switch communication operation and the multi-head attention computation in the data parallel mode are performed asynchronously.
[0014] The second aspect of this application provides another parallel acceleration method for MoE sparse large model inference, including: In the decoding stage of MoE sparse large model inference, each request sequence within the same batch is classified according to a preset decoding stage classification strategy; wherein, the decoding stage classification strategy includes: classifying request sequences with a length less than a target threshold as short sequences, and classifying request sequences with a length equal to or greater than the target threshold as long sequences; If the short sequences are obtained by classification, each short sequence is distributed to multiple devices in a data-parallel manner for parallel multi-head attention computation; If the long sequence is obtained by classification, the key-value cache of the long sequence is divided into multiple key-value cache segments and stored on each of the devices. The input character of the current decoding step is broadcast to the multiple devices, so that each device uses its locally stored key-value cache segment and the input character to perform self-attention calculation to obtain a local result. The local results obtained by each device are reduced to obtain the output character of the current decoding step. The key-value cache corresponding to the output character is stored on the device whose current key-value cache capacity meets the preset load balancing conditions.
[0015] In some embodiments of this application, the input character of the current decoding step includes: the first character generated in the pre-filling stage of MoE sparse large model inference, or the output character generated in the previous decoding step.
[0016] In some embodiments of this application, the preset load balancing condition includes: the current key-value cache capacity is the lowest among all devices.
[0017] In some embodiments of this application, if the short sequence exists in the decoding step, each of the devices is used to: perform broadcast and reduction operations while performing multi-head attention computation using the data parallel approach.
[0018] In some embodiments of this application, dividing the long sequence of key-value cache into multiple key-value cache segments and storing them separately on each of the devices includes: Based on the number of devices, the long sequence of key-value cache is divided into multiple key-value cache segments, and each key-value cache segment is stored one-to-one on each of the devices.
[0019] The third aspect of this application provides yet another parallel acceleration method for end-to-end inference of MoE sparse large models, including: The parallel acceleration method for MoE sparse large model inference as described in the first aspect above, and the parallel acceleration method for MoE sparse large model inference as described in the second aspect above.
[0020] A third aspect of this application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executed, implements the parallel acceleration method for MoE sparse large model inference described in the first aspect, the parallel acceleration method for MoE sparse large model inference described in the second aspect, or the parallel acceleration method for MoE sparse large model end-to-end inference described in the third aspect.
[0021] The fourth aspect of this application provides an electronic device and a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the parallel acceleration method for MoE sparse large model inference described in the first aspect, the parallel acceleration method for MoE sparse large model inference described in the second aspect, or the parallel acceleration method for MoE sparse large model end-to-end inference described in the third aspect.
[0022] The fifth aspect of this application provides a computer program product, including a computer program that, when executed by a processor, implements the parallel acceleration method for MoE sparse large model inference described in the first aspect, the parallel acceleration method for MoE sparse large model inference described in the second aspect, or the parallel acceleration method for MoE sparse large model end-to-end inference described in the third aspect.
[0023] The parallel acceleration method for MoE sparse large model inference provided in this application addresses the scenario of mixed long and short sequences in MoE sparse large model inference by employing differentiated parallel strategies adapted to the computational characteristics of the pre-filling and / or decoding stages. In the pre-filling stage, the requested sequences can be classified into short, long, and ultra-long sequences based on their length. Data parallelism is used for short sequences, sequence parallelism for long sequences, and cross-iteration sequence parallelism for ultra-long sequences. Similarly, in the decoding stage, the requested sequences can be classified into short and long sequences based on their length. Data parallelism is used for short sequences, and sequence parallelism based on key-value cache partitioning is used for long sequences. This application employs a strategy of classifying sequences by length and using differentiated parallelism in both the pre-filling and / or decoding stages, which can solve the problem of uneven resource utilization caused by the mixture of long and short sequences in MoE sparse large model inference, improving hardware resource utilization and end-to-end inference efficiency.
[0024] Additional advantages, objectives, and features of this application will be set forth in part in the description which follows, and will in part become apparent to those skilled in the art upon review of the following description, or may be learned by practice of the application. The objectives and other advantages of this application can be realized and obtained by means of the structures specifically pointed out in the specification and drawings.
[0025] Those skilled in the art will understand that the purposes and advantages that can be achieved with this application are not limited to those specifically described above, and that the above and other purposes that this application can achieve will be more clearly understood from the following detailed description. Attached Figure Description
[0026] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, do not constitute a limitation thereof. The components in the drawings are not drawn to scale but are merely for illustrating the principles of this application. For ease of illustration and description of certain parts of this application, corresponding portions in the drawings may be enlarged, i.e., may appear larger relative to other components in an exemplary device actually manufactured according to this application. In the drawings: Figure 1 This is a schematic diagram of the basic architecture of the MoE hybrid expert model.
[0027] Figure 2 This diagram illustrates the data flow process when performing parallel inference on four computing devices for the MoE sparse large model.
[0028] Figure 3 This is a schematic diagram illustrating an example of self-attention computation based on causal attention masks.
[0029] Figure 4 This is a schematic diagram of a first process for a parallel acceleration method of inference for a first MoE sparse large model provided in an embodiment of this application.
[0030] Figure 5 A diagram illustrating how request sequences within the same batch are categorized by length.
[0031] Figure 6 This is a second flowchart illustrating a parallel acceleration method for inference of a first MoE sparse large model provided in an embodiment of this application.
[0032] Figure 7 This is a schematic diagram illustrating an example of a hybrid parallel strategy combining data parallelism and sequence parallelism for the inference pre-filling stage of the MoE sparse large model provided in this application.
[0033] Figure 8 A schematic diagram of the load balancing segmentation method for ultra-long sequences in the pre-filling stage of MoE sparse large model inference provided in this application.
[0034] Figure 9 In response to Figure 7 Example: Schematic diagram of the execution of the multi-head attention layer under the hybrid parallel strategy in the pre-filling stage.
[0035] Figure 10 This is a flowchart illustrating a parallel acceleration method for inference of a second MoE sparse large model provided in an embodiment of this application.
[0036] Figure 11 This is a schematic diagram illustrating an example of the hybrid parallelism strategy of data parallelism and sequence parallelism in the multi-head attention layer of the inference and decoding stage of the MoE sparse large model provided in this application.
[0037] Figure 12This is a schematic diagram of the general computation hiding method in the hybrid parallel strategy of the inference and decoding stage of the MoE sparse large model provided in this application.
[0038] Figure 13 This is a flowchart illustrating a parallel acceleration method for inference of a third MoE sparse large model provided in an embodiment of this application. Detailed Implementation
[0039] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the embodiments and accompanying drawings. Here, the illustrative embodiments and their descriptions are used to explain this application, but are not intended to limit it.
[0040] It should also be noted that, in order to avoid obscuring this application with unnecessary details, only the structures and / or processing steps closely related to the solution according to this application are shown in the accompanying drawings, while other details that are not closely related to this application are omitted.
[0041] It should be emphasized that the term "including / comprises" as used herein refers to the presence of a feature, element, step, or component, but does not exclude the presence or addition of one or more other features, elements, steps, or components.
[0042] It should also be noted that, unless otherwise specified, the term "connection" in this article can refer not only to a direct connection, but also to an indirect connection involving an intermediary.
[0043] In the following description, embodiments of the present application will be illustrated with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar parts, or the same or similar steps.
[0044] First, it should be noted that the basic architecture of the MoE hybrid expert model mainly includes a multi-head attention layer and a hybrid expert layer, such as... Figure 1 As shown, this architecture can be repeated and concatenated multiple times to form a multi-layer deep neural network. Compared to traditional dense large models, the MoE sparse large model replaces the MLP layer in traditional dense models with a hybrid expert layer consisting of a gating function and multiple experts, FFN0, FFN1, and FFN... E-1These represent the parameters of the hybrid expert layer model corresponding to different experts; the multi-head attention layer is the same as in traditional dense models. In the MoE sparse large model, after each character (token) passes through the gating function, its affinity score with each expert is calculated, and then the TopK experts with the highest affinity scores are selected for computation. Therefore, in the MoE sparse large model, each character only activates some experts for computation, and each character does not need to be computed by all experts, thus achieving sublinear growth of computational cost with model capacity and achieving rapid expansion of model parameters. Compared with traditional dense large models, the MoE sparse large model can significantly reduce computational cost with the same number of parameters. Current mainstream large models such as DeepSeek and Mixtral all adopt the MoE hybrid expert architecture to reduce costs.
[0045] In parallel inference of sparse large-scale MoE models, multi-head attention layers typically employ traditional data parallelism, where each device stores the complete multi-head attention layer model parameters, and the entire input sequence is divided and processed in parallel across different devices. Hybrid expert layers, on the other hand, typically employ expert parallelism, where all experts are evenly distributed across different devices to partition the expert parameters, and the characters processed by each expert on each device are determined by a gating function. In the data parallelism of the multi-head attention layer, the input sequence is divided across devices for multi-head attention layer computation. Afterwards, based on the affinity score calculated by the gating function, a full-to-all communication operation routes the characters to the corresponding expert's device for expert-parallel computation of the MoE layer. After the MoE layer computation is complete, another full-to-all communication operation routes the characters back to their original device, thus enabling the data parallel computation of the next multi-head attention layer. The computation process is as follows: Figure 2 As shown, taking four devices as an example, the multi-head attention layer uses data parallelism, and the hybrid expert layer uses expert parallelism.
[0046] In such Figure 2 In the parallel inference scenario of the MoE sparse large model shown, after the multi-head attention layer completes parallel data computation, it calls the All-to-All full-switch communication operation to route characters to the corresponding expert's device. This All-to-All operation triggers inter-process synchronization. When processing requests with varying lengths, the multi-head attention layer computation can easily lead to load imbalance. The inter-process synchronization caused by the All-to-All operation results in faster processing devices waiting for slower ones, thus reducing computational efficiency. Optimizing the system for requests with varying lengths is a significant challenge for accelerating parallel inference in the MoE sparse large model.
[0047] Current technologies do not consider the load imbalance problem caused by multi-head attention layer data parallel computation when processing long and short sequence requests in expert parallel scenarios. When processing long and short sequence requests simultaneously, mainstream methods balance the key-value cache (KV Cache) across data parallel devices. However, the self-attention computation cost in the pre-filling stage of multi-head attention computation is on the order of the square of the sequence length. Self-attention computation based on causal attention masks is as follows... Figure 3 As shown, the gray squares represent the attention scores that need to be calculated. It's evident that the computational load of self-attention during the pre-filling stage is proportional to the square of the sequence length. Therefore, even if the total sequence length is the same within the same batch, the computational load between different batches can vary significantly due to the varying lengths of individual requests within each batch. For example, one batch might contain a 4K request sequence, while another batch contains four 1K request sequences. Although the total length of a 4K sequence and four 1K sequences is the same, the computational load of the 4K sequence is approximately three times that of the four 1K sequences, resulting in an imbalance in the multi-head attention computational load between batches. In the MoE sparse large model expert parallel inference scenario, after the multi-head attention layer completes data parallel computation, it calls an All-to-All communication operation to route characters to the corresponding expert's device. This All-to-All operation triggers inter-process synchronization, meaning that global synchronization between processes is required after the multi-head attention layer completes data parallel computation. Therefore, the imbalance in the multi-head attention computational load between batches caused by varying request lengths leads to a waste of system resources.
[0048] In existing work, Sarathi-Serve (a system for optimizing the throughput and latency tradeoff of large language model inference) proposed a chunked prefill technique. This technique divides long sequences into chunks, and the prefilling and decoding of each chunk are performed in a mixed batch process. This reduces the interference of long sequence prefilling on decoding computation, thereby achieving the performance effects of reducing latency and improving throughput. However, the main optimization goal of chunked prefilling is local computation under data parallelism, which does not involve inter-process synchronization caused by all-to-all full-switch communication under expert parallelism. Therefore, it cannot solve the performance problem caused by long and short sequence requests in parallel inference of MoE sparse large models.
[0049] Another related work is LoongServe (a long-context large language model service system with an elastic sequence parallelism mechanism), which proposes an elastic sequence parallelism mechanism that can dynamically adjust the degree of parallelism of sequence parallelism based on workload conditions such as sequence length and inference stage, thereby improving resource utilization. However, LoongServe only uses sequence parallelism and adopts the same sequence parallelism strategy within the same batch of iterations, resulting in high communication overhead when processing long and short sequence requests within the same batch.
[0050] Furthermore, existing technologies do not consider the key-value cache imbalance caused by long-short sequence inference during the decoding phase of multi-head attention computation. At the beginning of decoding, each device is allocated a key-value cache of equal capacity to balance the load. However, the key-value cache on devices with more short sequences grows significantly faster than on devices with more long sequences, leading to load imbalance. For example, at the beginning of decoding, device A is responsible for decoding one 4K input sequence, while device B is responsible for decoding four 1K input sequences. Although the key-value cache capacities on both devices are the same at the beginning of decoding, the key-value cache of device B grows four times faster than that of device A because device A generates one new character (token) per iteration, while device B generates four new characters (tokens) per iteration. As the decoding phase progresses, the key-value cache capacity of device B will be significantly higher than that of device A, resulting in load imbalance. Current technologies have not optimized for this situation well, or even ignored this problem.
[0051] In other words, in parallel inference of MoE sparse large models, there is an imbalance in attention computation load when processing long and short sequences, whether in the pre-filling stage or the decoding stage of the multi-head attention computation stage.
[0052] Based on this, to solve the aforementioned technical problems, this application proposes solutions for the pre-filling stage, the decoding stage, and the complete multi-head attention computation stage including the pre-filling and decoding stages of large model inference. Specifically, this application provides embodiments of a parallel acceleration method for the first MoE sparse large model inference in the pre-filling stage, a parallel acceleration method for the second MoE sparse large model inference in the decoding stage, and a parallel acceleration method for the third MoE sparse large model inference in both the pre-filling and decoding stages. It also provides other embodiments of corresponding electronic devices, computer storage media, and computer program products, combining a hybrid parallel strategy of data parallelism and sequence parallelism to solve the load imbalance problem when processing long and short sequences in parallel inference of MoE sparse large models, improve hardware resource utilization, and thus improve the inference efficiency of MoE sparse large models. The specific implementation is described in detail through the following embodiments.
[0053] Based on this, embodiments of this application provide a parallel acceleration method for MoE sparse large model inference that can be implemented by a parallel acceleration device for MoE sparse large model inference, see [link to relevant documentation]. Figure 4The parallel acceleration method for the first MoE sparse large model inference specifically includes the following: Step 100: In the pre-filling stage of MoE sparse large model inference, each request sequence in the same batch is classified according to a preset pre-filling stage classification strategy; wherein, the pre-filling stage classification strategy includes: classifying request sequences with a length less than a first threshold as first-class sequences, classifying request sequences with a length between the first threshold and a second threshold as second-class sequences, and classifying request sequences with a length greater than the second threshold as third-class sequences; wherein, the second threshold is greater than the first threshold.
[0054] It should be noted that in the pre-filling stage of MoE sparse large model inference, multiple request sequences within the same batch are first obtained. Each request sequence consists of several consecutive characters (tokens), such as a question entered by the user or a piece of text. Two length thresholds can be pre-set: a first threshold and a second threshold, with the second threshold being greater than the first threshold. For each request sequence within this batch, its length L (i.e., the number of characters it contains) is obtained, and it is classified according to the following rules: (1) If L < the first threshold, then the request sequence is classified as a first-class sequence (i.e., a short sequence). (2) If the first threshold ≤ L < the second threshold, then the request sequence is classified as a second type of sequence (i.e., a long sequence). (3) If L ≥ the second threshold, then the request sequence is classified as a third type of sequence (i.e., an ultra-long sequence).
[0055] In one specific implementation, the second threshold can be twice the first threshold. These two thresholds can be set according to actual hardware configuration and performance requirements; for example, the first threshold T can be set to 1024 characters, and the second threshold 2T can be set to 2048 characters. Then, based on the pre-set first threshold T, requests within the same batch are divided into three categories: a first type of sequence, a second type of sequence, and a third type of sequence. Sequences with a length less than the first threshold T are classified as first-type sequences; sequences with a length greater than or equal to T and less than the second threshold 2T are classified as second-type sequences; and sequences with a length greater than or equal to 2T are classified as third-type sequences. Figure 5 As shown.
[0056] For example, assuming the first threshold is 1024 characters and the second threshold is 2048 characters, a request sequence with a length of 512 characters is classified as a first-class sequence, a request sequence with a length of 1500 characters is classified as a second-class sequence, and a request sequence with a length of 3000 characters is classified as a third-class sequence.
[0057] Then, different parallel strategies are set according to the classification. See steps 200 to 400 below for details.
[0058] Step 200: If the first type of sequence is obtained by classification, then each first type of sequence is distributed to multiple devices in a data-parallel manner to perform parallel multi-head attention computation.
[0059] Specifically, in step 200, processing in a data-parallel manner means evenly distributing all first-class sequences within the same batch according to the number of devices, so that each device is assigned several complete first-class sequences. Each device independently performs multi-head attention calculation on its assigned first-class sequences, and no communication is required between devices at this stage.
[0060] For example, suppose there are 4 computing devices and a total of 8 Class I sequences in the batch. Each device is assigned 2 Class I sequences, and each device calculates the multi-head attention for these 2 Class I sequences.
[0061] Step 300: If the second type of sequence is obtained by classification, then the second type of sequence is processed in a sequence parallel manner.
[0062] Specifically, in step 300, the sequence parallel approach means: dividing a second-type sequence into multiple subsequences along the sequence dimension, and distributing these subsequences to multiple devices, with each device computing the multi-head attention of its assigned subsequence in parallel.
[0063] For example, suppose there is a second-class sequence of length 1500 and a total of 4 computing devices. The sequence can be divided into 4 subsequences (each subsequence of length 375) and distributed to the 4 devices. Each device can then compute its own subsequence in parallel.
[0064] Step 400: If the third type of sequence is obtained by classification, the third type of sequence is divided into sub-sequences, and each sub-sequence is assigned to multiple iterations; and in each iteration, the sub-sequences are processed in a parallel manner.
[0065] Specifically, in step 400, if a third type of sequence is obtained after classification, it is processed using a cross-iteration sequence parallel approach. The cross-iteration sequence parallel approach includes the following sub-steps: (1) Dividing into sub-sequences: The third type of sequence is divided into multiple sub-sequences along the sequence dimension. It should be noted that the length of the sub-sequence is usually controlled between the first threshold and the second threshold to ensure the efficiency of processing in a sequence-parallel manner in each iteration. However, in some implementations, the length of the sub-sequence may be less than the first threshold. In this case, data parallelism or sequence parallelism can be used in the iteration as needed, which is still within the scope of protection of this application.
[0066] (2) Assignment to multiple iterations: The multiple sub-sequences obtained from the segmentation are assigned one-to-one to multiple iterations for execution. Each iteration processes only one sub-sequence, and different iterations are executed sequentially.
[0067] (3) Processing within each iteration: In each iteration, the sub-length sequence corresponding to this iteration is taken as the object to be processed and processed in the same sequence parallel method as the second type of sequence: the sub-length sequence is divided into multiple sub-sequences and distributed to multiple devices for parallel multi-head attention calculation.
[0068] For example, in the first iteration, the subsequence of length 1500 is divided into 4 subsequences (each subsequence is approximately 375 in length) and distributed to 4 devices for parallel computation; the second iteration is similar.
[0069] In this way, the computational load of the third type of sequence is distributed across multiple iterations, which can effectively avoid the problem of unbalanced load caused by excessive computation in a single iteration.
[0070] In other words, the parallel acceleration method for the first MoE sparse large model inference provided in this application embodiment sets different parallel strategies according to the classification. For short sequence requests within a batch, the requests are evenly distributed to various devices for data parallel computation, and short sequences do not incur communication overhead during computation. For long sequence requests within a batch, they can be evenly distributed to various devices for sequence parallel computation according to the sequence dimension. The sequence parallel computation can be implemented using a ring attention approach, that is, a ring communication topology is constructed for all devices, and each process passes the local or newly received KV (i.e., Key and Value) to the neighboring process, while performing attention computation between the local or newly received KV and the local Query, thereby achieving the effect of mutual hiding of communication and computation. It can also be implemented through a full-switch communication operation, which will be described in detail in subsequent embodiments. For ultra-long sequences within a batch, this application proposes a cross-iteration sequence parallel computation method, that is, the pre-filling computation of ultra-long sequences is spread across multiple iterations to reduce the impact of ultra-long sequence computation on the computation of other sequences in the same batch, while the sequence parallel computation is still used in each iteration.
[0071] As described above, the parallel acceleration method for MoE sparse large model inference provided in this application, during the pre-filling stage of MoE sparse large model inference, distributes the huge computational load across multiple iterations by dividing the third type of sequence (ultra-long sequence) into multiple sub-long sequences and allocating them to multiple iterations for execution, effectively avoiding excessive computation in a single iteration. Simultaneously, a sequence parallel approach is adopted for the second type of sequence (ordinary long sequence), dividing it into multiple devices for parallel computation, further balancing the computational load among devices. By balancing the computational load among devices, attention computation is completed almost simultaneously on each device, minimizing synchronization waiting time and improving hardware resource utilization. For the third type of sequence (ultra-long sequence), cross-iteration sequence parallelism allows its pre-filling computation to be performed in parallel with the iterative computation of other sequences (including short sequences and ordinary long sequences), preventing ultra-long sequences from blocking the entire inference process, significantly reducing the first character generation latency, and improving user experience. Furthermore, the pre-filling stage enables full utilization of computing resources and load balancing, laying a solid foundation for the smooth execution of the subsequent decoding stage. This, in turn, improves the end-to-end inference efficiency of the MoE sparse large model as a whole. In other words, the first parallel acceleration method for MoE sparse large model inference provided in this application can effectively solve the problem of unbalanced computing load caused by the mixing of long and short sequences in parallel inference of MoE sparse large models, improve hardware resource utilization, reduce inference latency, and has significant progressive and practical value.
[0072] To further address the issue of how to reasonably determine the number of sub-sequence segments to control the computational load of each iteration and ensure the effective implementation of parallelism across iteration sequences, in the embodiments of the first MoE sparse large model inference parallel acceleration method provided in this application, see [link to relevant documentation]. Figure 6 Step 400 in the parallel acceleration method for the first MoE sparse large model inference specifically includes the following: Step 410: If the third type of sequence is obtained by classification, the number of segments of the request sequence is determined according to the ratio of the length of the request sequence to the first threshold.
[0073] Specifically, if the length of the third type of sequence is L, then the number of segments I is determined by calculating the ratio of L to the first threshold T. In practical applications, rounding down can be used, i.e. This ensures that the length of each sub-sequence does not exceed the first threshold T, thereby guaranteeing that the computational load of each iteration is controllable.
[0074] For example, assuming the first threshold T = 1024 characters (tokens), if the length of a certain third-class sequence L = 5000, then L / T ≈ 4.88, and the number of segments after rounding down is I = 4; if L = 3000, then L / T ≈ 2.93, and the number of segments is I = 2. For cases where L is exactly a multiple of T, such as L = 4096 (L / T = 4 when T = 1024), the number of segments is I = 4.
[0075] It should be noted that using the floor method can make the length of each sub-sequence approximately equal and not exceeding T, thereby achieving load balancing.
[0076] Step 420: Divide the requested sequence into multiple sub-length sequences on an average basis based on the number of divisions.
[0077] Specifically, the average splitting method is implemented by dividing the original sequence evenly according to its length, making the length of each sub-sequence as equal as possible. When L is not divisible by I, some sub-sequences can be fine-tuned so that the length of each sub-sequence differs by no more than 1 character (token) to maintain load balance.
[0078] Continuing with the previous example, for the case of L=5000, T=1024, and I=4, the original sequence is divided into four sub-sequences of equal length, each with a length of 1250 characters (tokens). For the case of L=3000, T=1024, and I=2, it can be divided into two sub-sequences with lengths of 1500 and 1500 respectively.
[0079] Step 430: Assign each of the sub-length sequences one-to-one to multiple iterations.
[0080] Specifically, each sub-sequence corresponds to an independent iteration, and different sub-sequences are executed sequentially in time without overlapping.
[0081] For example, the four sub-sequences obtained by the above L=5000 segmentation (denoted as P1, P2, P3, P4): (1) Assign the sub-length sequence P1 to the first iteration for execution; (2) Assign the sub-length sequence P2 to the second iteration for execution; (3) Assign the sub-length sequence P3 to the third iteration for execution; (4) Assign the sub-length sequence P4 to the 4th iteration for execution.
[0082] In the first iteration, only the sub-sequence P1 is processed; after the first iteration, the second iteration processes the sub-sequence P2, and so on. This allocation method distributes the computational load of the original third-class sequence across multiple iterations, avoiding excessive computation in a single iteration.
[0083] As can be seen from the above description, the parallel acceleration method for the first MoE sparse large model inference provided in this application determines the number of segments based on a first threshold and performs average segmentation, so that the length of each sub-sequence is controlled near the first threshold. This ensures that the computational load of each iteration is balanced and controllable, laying the foundation for parallel processing of sequences in subsequent iterations.
[0084] To further address the issue of optimizing computational efficiency within a single iteration in cross-iteration parallelism, in the embodiments of the first MoE sparse large model inference parallel acceleration method provided in this application, see [link to relevant documentation]. Figure 6 Step 400 in the parallel acceleration method for the first MoE sparse large model inference also specifically includes the following: Step 440: In each iteration, if the second type of sequence exists in the current iteration, the sub-length sequence corresponding to the current iteration and the second type of sequence existing in the current iteration are jointly divided into various sub-sequences using the sequence parallel method, so that each device performs multi-head attention calculation in parallel on the assigned sub-sequences.
[0085] Specifically, if the second type of sequence does not exist in this iteration, the sequence parallel method is directly used to divide the sub-length sequence corresponding to this iteration into various sub-sequences, so that each device can perform multi-head attention calculation in parallel on the assigned sub-sequences.
[0086] If the second type of sequence exists in the current iteration, then the sub-length sequence corresponding to the current iteration and all second type sequences existing in the current iteration are treated as objects to be processed, and processed uniformly using a sequence parallel approach. Here, "commonly using a sequence parallel approach" means merging these sequences into a single set to be processed, and then performing sequence parallel computation on each sequence in the entire set separately, rather than starting a separate sequence parallel process for each sequence.
[0087] For example, for the sub-long sequence P1 and the second type sequences S1 and S2, they are regarded as three long sequences that need to be processed in parallel in this iteration, and the sequence parallel scheduling and computation are performed uniformly.
[0088] like Figure 7 As shown, the third type of sequence (ultra-long sequence) is split into two sub-long sequences P1 and P2. Sub-long sequences P1 and P2 are respectively assigned to device 1 and device 2 for computation in two iterations in a sequence parallel manner. Among them, sub-long sequence P1 and the second type of sequence (long sequence) in the first iteration are packaged together for sequence parallel computation.
[0089] Then, each device computes the multi-head attention for all subsequences it is assigned in parallel. Once all devices have completed their computations, the iteration ends. If there is a next iteration, the next subsequence is processed, and so on, until all subsequences have been processed, at which point the iteration process ends.
[0090] As can be seen from the above description, the parallel acceleration method for the first MoE sparse large model inference provided in this application can effectively reduce the number of communication times and scheduling overhead, and improve the utilization rate of computing resources within the iteration by packaging the sub-long sequence and the second type of sequence.
[0091] To further address the uneven computational load between devices caused by the varying computational load of self-attention as the sequence position changes (lighter at the beginning and heavier at the end) in sequential parallelism, the sequential parallelism method used in steps 300 and 400 of the first MoE sparse large model inference parallel acceleration method provided in this application specifically includes the following: (1) Divide the sequence to be processed into 2d segments on average, wherein the sequence to be processed includes the second type of sequence and / or the sub-length sequence corresponding to the third type of sequence in a single iteration, and d is the number of devices.
[0092] It should be noted that the "and / or" mentioned above indicates that the sequence to be processed can be any one of the categories, or both categories can exist simultaneously (such as the case of package processing in step 440). For example, in a certain iteration, if a sub-length sequence and a second-category sequence need to be processed, then both sequences are sequences to be processed.
[0093] Let the number of devices in the current system be d (d is a positive integer, such as 4 devices). For each sequence to be processed, divide it into 2d segments along the sequence dimension on an average basis, and number them sequentially as 1, 2, ..., 2d.
[0094] The specific method for average segmentation is as follows: Divide the length L of the sequence by 2d to obtain the basic length of each segment. If L is not divisible by 2d, then some segments are fine-tuned to make the lengths of each segment as equal as possible, with a difference of no more than 1 character (token).
[0095] For example, assuming d=4, then 2d=8. If there is a second-class sequence of length 8000, it is divided into 8 segments, each of length 1000. If the length is 8001, it is divided into 8 segments, of which 7 segments have a length of 1000 and 1 segment has a length of 1001 (which can be arbitrarily assigned to any segment, but usually the remainder is assigned to the later segments).
[0096] (2) Assign the i-th segment and the i-th-last segment to the i-th device.
[0097] Specifically, after the segmentation is completed, each segment is assigned to a different device according to the following rules: Device i (i from 1 to d) is responsible for the i-th segment and the i-th segment from the end (i.e., the 2d-i+1-th segment).
[0098] For example, when d=4: (1) Equipment 1: Responsible for the first section and the last section (i.e., the 8th section); (2) Equipment 2: Responsible for the second section and the second to last section (i.e., the seventh section); (3) Equipment 3: Responsible for the 3rd segment and the 3rd to last segment (i.e., the 6th segment); (4) Equipment 4: responsible for the 4th segment and the 4th to last segment (i.e. the 5th segment).
[0099] In this way, each device receives two segments: one from the first half of the sequence (with less computation) and one from the second half of the sequence (with more computation), thus achieving a balance in the computational load.
[0100] After allocation, each device performs multi-head attention computation in parallel on the two data segments allocated locally. Since the total computational load on each device is roughly equal, the computational load among the devices is balanced, reducing synchronization waits caused by differences in computational load.
[0101] In other words, since self-attention computation uses causal attention masks (such as...) Figure 3 As shown in the figure, the computational workload of each segment in the sequence is different, and the computational workload increases sequentially from front to back. Therefore, when splitting the load of long sequences and ultra-long sequence sub-long sequences in parallel, it is necessary to consider the load balancing problem on each device. This application adopts the following method to divide the load of long sequences and ultra-long sequence sub-long sequences: Assuming that the number of devices is d, and each device is numbered i (i=1, 2, ..., d), the long sequence and ultra-long sequence sub-long sequence are divided into 2d segments on average and numbered sequentially as (1, 2, ..., 2d), where device i is responsible for the i-th segment and the 2d-i+1-th segment, that is, device i is responsible for the calculation of the i-th segment and the i-th segment from the end. Figure 8 The diagram illustrates the parallel segmentation of two sub-sequences P1 and P2 of an ultra-long sequence on two devices, showing that the computational load allocated to each device is the same in each execution iteration.
[0102] As can be seen from the above description, the parallel acceleration method for the first MoE sparse large model inference provided in this application embodiment, through symmetrical partitioning and pairing allocation of front and rear segments, makes the amount of computation undertaken by each device basically equal, which can achieve fine-grained load balancing in sequential parallelism and further improve the overall computing efficiency.
[0103] To further address the issue of device idle waiting caused by inter-device communication required for sequential parallelism in hybrid parallel scenarios, and to improve execution efficiency within iterations by overlapping computation and communication, in an embodiment of the parallel acceleration method for the first MoE sparse large model inference provided in this application, in an iteration containing the first type of sequence, each device is used to: perform data transmission required for multi-head attention computation using the sequential parallelism method while performing multi-head attention computation using the data parallelism method, and after receiving the transmitted data, perform multi-head attention computation on the sub-sequences allocated in this iteration.
[0104] Specifically, the device first determines whether a first-type sequence exists in the current iteration. First-type sequences are short sequences with a length less than a first threshold. These short sequences are processed in data parallel, computed independently by each device, without the need for inter-device communication. For example, in a given iteration, in addition to processing the packaged sub-length sequences and second-type sequences (using sequence parallelism), there are multiple first-type sequences in the batch that need to be processed in this iteration.
[0105] In the iterations where the first type of sequence exists, each device is configured to perform both types of operations simultaneously: (1) Perform multi-head attention computation in a data-parallel manner Each device independently performs multi-head attention computation on the assigned first-class sequence. These computations rely solely on local data and model parameters, requiring no communication with other devices. Therefore, the devices can continuously perform computations throughout the entire iteration process without being idle while waiting for communication.
[0106] For example, device 1 is assigned two first-class sequences Q1 and Q2, and device 1 begins to perform multi-head attention computation on Q1 and Q2.
[0107] (2) Data transmission required for simultaneous sequential parallel processing While performing the aforementioned parallel data computation, each device also participates in the data transmission required for sequential parallelism. Here, "data transmission" refers to the process of exchanging key-value (KV) cache data between devices to achieve sequential parallelism.
[0108] For the packaged sub-sequences and the second type of sequence, sequence parallelism is required. The first step in sequence parallelism is usually the exchange of key-value (KV) cache data held by each device. This communication process overlaps with the data-parallel computation of the devices in time.
[0109] For example, while calculating Q1 and Q2, device 1 is also sending its own sub-length sequence and partial key-value cache (KV) data of the second type sequence to device 2, while simultaneously receiving data sent by device 2.
[0110] Once the key-value buffer (KV) data required for sequence parallelism has been transmitted, each device has received the necessary data from other devices. At this point, each device continues to perform multi-head attention computation on the subsequences allocated in this iteration. Here, subsequences refer to the portion of the packaged sub-length sequence and the second-type sequence that has been segmented and allocated to each device. For example, after the data transmission is complete, device 1 has received the key-value buffer (KV) data sent by device 2. At this point, device 1 begins to perform multi-head attention computation on its allocated sub-length sequence and the sub-sequence of the second-type sequence S.
[0111] Once all devices have completed the above calculations, the iteration ends. In this way, the communication time required for sequential parallelism is covered by the time for data parallel computation, and the devices are always in a computational state throughout the entire iteration process, without any idle waiting.
[0112] In one example, Figure 9 Showing Figure 7 The given example illustrates the execution of the multi-head attention layer under a hybrid parallel strategy during the pre-filling phase. (See diagram below.) Figure 9 As shown, in each execution iteration, the pre-filling calculation of the short sequence is first performed in a data parallel manner; then the pre-filling calculation of the ultra-long sequence sub-long sequence and the long sequence is performed in a sequence parallel manner, wherein the KV data transmission between devices is hidden by local calculation; the ultra-long sequence is divided into two sub-long sequences P1 and P2, which are executed in two iterations respectively.
[0113] As can be seen from the above description, the parallel acceleration method for the first MoE sparse large model inference provided in this application embodiment can effectively hide communication latency by enabling the device to perform data parallel computation while simultaneously transmitting the data required for sequential parallel computation, thereby keeping the device in a computing state at all times and significantly improving resource utilization and overall execution efficiency within the iteration.
[0114] To further address the issue of efficiently transferring key-value cache (KV Cache) data between devices in sequential parallelism, and to minimize the impact of communication latency on computation, in an embodiment of the first MoE sparse large model inference parallel acceleration method provided in this application, one implementation of the data transmission required for the multi-head attention computation performed by the device using the sequential parallel approach is as follows: A ring communication method is used to transmit key-value cache data between devices, and when key-value cache data is received from a neighboring device, attention is calculated with the local query vector.
[0115] Specifically, before data transmission, a ring communication topology is constructed for all devices participating in the computation. Devices are connected in a logical order, with each device having adjacent predecessor and successor devices. For example, assuming there are four devices, numbered 0, 1, 2, and 3, the ring topology would be: Device 0 → Device 1 → Device 2 → Device 3 → Device 0, forming a closed loop.
[0116] During iterations involving sequences of type I, each device, while performing multi-head attention computation in a data-parallel manner (processing sequences of type I), initiates a ring communication process to transfer the key-value cache (KVCache) data required for sequence parallelism. The specific process of ring communication is as follows: (1) Each device will divide the key-value cache (KV Cache) data held locally that needs to be shared with other devices into blocks and prepare to send it to the next device.
[0117] (2) In each round of communication, each device sends its local or recently received key-value cache (KV Cache) data blocks to its successor device, while receiving data blocks from its predecessor device.
[0118] Upon receiving key-value cache data from neighboring devices, each device immediately performs attention calculations with its existing local query vectors, without waiting for all data transmissions to complete. This pipelined overlap of computation and communication further improves efficiency.
[0119] For example, after receiving the first block of key-value cache (KV Cache) data from device 0, device 1 immediately calculates the attention of the data block with the local query vector; at the same time, device 1 continues to send the next block of data to the subsequent device 2, and prepares to receive the next block of data from device 0.
[0120] After multiple rounds of ring communication, all devices obtained the complete key-value cache data and gradually completed the attention calculation with the local query vector. Finally, each device obtained the local results required for sequence parallelism for subsequent processing.
[0121] As can be seen from the above description, the parallel acceleration method for the first MoE sparse large model inference provided in this application embodiment enables the device to start computing immediately after receiving some data through ring communication, thereby achieving pipeline overlap between communication and computing, effectively hiding communication delay, and thus improving the execution efficiency of sequential parallelism.
[0122] To adapt to different parallel strategy requirements, in an embodiment of the parallel acceleration method for the first MoE sparse large model inference provided in this application, another implementation of the data transmission required for the multi-head attention computation performed by the device using the sequential parallel approach is as follows: Through at least one full-switch communication operation, query, key, and value tensors after sequence dimension partitioning are exchanged between various devices, and multi-head attention computation is performed in parallel after obtaining complete sequence data; wherein, the full-switch communication operation and the multi-head attention computation in the data parallel mode are performed asynchronously.
[0123] In other words, in the hybrid parallel strategy of data parallelism and sequence parallelism for the pre-filling stage of MoE sparse large model inference, the sequence parallel computation of long sequences and sub-long sequences of ultra-long sequences adopts a circular attention approach. As an alternative, the sequence parallel computation of long sequences and sub-long sequences of ultra-long sequences can also be implemented in a manner similar to DeepSpeed Ulysses, that is, the query vector, key vector, and value vector are evenly distributed across multiple devices according to the sequence dimension. Then, each device performs one all-to-all communication operation on the query vector, key vector, and value vector, so that each device obtains the query vector, key vector, and value vector of the complete sequence corresponding to different attention heads. After that, each device performs attention computation on the complete sequence corresponding to the attention head. The communication overhead of the three all-to-all communication operations can be masked by the local short sequence computation on the device. If the above alternative solutions are used to implement the sequence parallelism in the hybrid parallelism proposed in this application, although there are differences in the specific implementation of sequence parallelism, they all follow the idea of solving the load imbalance problem of attention computation for long and short sequences through a hybrid parallelism strategy of data parallelism and training parallelism, and should be regarded as the equivalent technical solution of this application.
[0124] Specifically, before data transmission, each device divides its locally held query vector, key vector, and value vector of the sequence to be processed (such as a sub-sequence or a second-type sequence) along the sequence dimension on an average basis. The number of divisions is the same as the number of devices, d. For example, with 4 devices, each device divides its local query vector into 4 blocks, labeled Q0, Q1, Q2, and Q3, and the key vector and value vector are also divided in the same way.
[0125] In the iterations involving sequences of type I, each device asynchronously initiates an All-to-All communication operation while simultaneously performing multi-head attention computation (processing sequences of type I) in a data-parallel manner. An All-to-All communication operation means that each device sends data to all other devices while simultaneously receiving data from all other devices. Specifically, each device sends its segmented query vector, key vector, and value vector to the corresponding target device via an All-to-All communication operation. For example, device 0 sends Q0 to itself (reserved), Q1 to device 1, Q2 to device 2, and Q3 to device 3; the key and value vectors are processed similarly.
[0126] After at least one all-to-all communication operation (typically three, one for query, one for key, and one for value), each device collects the corresponding data blocks from all devices, thus obtaining the complete sequence data locally. For example, device 0 ultimately possesses the Q0 block from all devices, which is the 0th block of the query vector for the entire sequence. However, through recombination, each device will actually have the complete sequence data (because each device is responsible for a different attention head or a different data portion, depending on the specific implementation). In the Ulysses approach, typically after an all-to-all communication operation, each device obtains the complete sequence but with a corresponding different attention head.
[0127] Once the complete sequence data is obtained, each device immediately performs multi-head attention calculations on the locally available data. Since the data is now complete, the calculations can be performed independently without waiting for further communication.
[0128] In this embodiment, the all-to-all communication operation and the multi-head attention computation in data-parallel mode are performed asynchronously. That is, after the device initiates the all-to-all communication operation, it does not block and wait for the communication to complete, but continues to perform data-parallel computation (processing the first type of sequence). While the data-parallel computation is in progress, the communication is completed in the background; after the communication is completed, the device switches to sequence-parallel computation. In this way, the communication time is covered by the data-parallel computation.
[0129] As can be seen from the above description, the parallel acceleration method for the first MoE sparse large model inference provided in this application embodiment enables each device to obtain complete sequence data and perform attention calculation through the all-to-all communication operation, while performing asynchronous execution with data parallel calculation, which can effectively hide communication overhead and provide another efficient way to achieve sequence parallelism.
[0130] In other words, the parallel acceleration method for MoE sparse large model inference provided in this application embodiment adopts a hybrid parallel strategy of data parallelism and sequence parallelism for the pre-filling stage of MoE sparse large model inference. First, requests within the same batch are divided into three categories: short sequences, long sequences, and ultra-long sequences. For short sequence requests within a batch, multi-head attention computation is performed using data parallelism. For long sequence requests within a batch, the long sequence is split across multiple devices, and multi-head attention computation is performed using sequence parallelism. For ultra-long sequence requests within a batch, the ultra-long sequence is split across multiple devices, and multi-head attention computation is performed using cross-iteration sequence parallelism.
[0131] This hybrid parallel strategy, combining data parallelism and sequence parallelism for both long and short sequences, achieves load balancing by segmenting long and ultra-long sequences. It also offers the following benefits: short sequences utilize data parallelism, avoiding unnecessary communication overhead; long sequences utilize sequence parallelism, allowing them to allocate more computational resources and complete computation as quickly as possible, reducing initial token latency; and ultra-long sequences employ cross-iteration sequence parallelism, distributing computation across multiple iterations to minimize the impact of ultra-long sequence computation on other sequence computations.
[0132] This application also provides an embodiment of a second parallel acceleration method for MoE sparse large model inference, which can be implemented by a parallel acceleration device for MoE sparse large model inference. See [link to embodiment]. Figure 10 The parallel acceleration method for the second MoE sparse large model inference specifically includes the following: Step 10: In the decoding stage of MoE sparse large model inference, each request sequence in the same batch is classified according to a preset decoding stage classification strategy; wherein, the decoding stage classification strategy includes: classifying request sequences with a length less than the target threshold as short sequences, and classifying request sequences with a length equal to or greater than the target threshold as long sequences.
[0133] It's important to note that during the decoding phase of MoE sparse large model inference, each request sequence has already undergone pre-padding computation and generated its first output character. The decoding phase is a character-by-character generation process, generating a new output character in each iteration until the maximum generation length is reached or a terminating character is encountered. During this phase, each request sequence maintains a key-value cache to store the key and value vectors of the generated characters, avoiding redundant computations.
[0134] In step 10, multiple request sequences within the same batch are first acquired, each already in the decoding state. A target threshold is pre-set to distinguish between short and long sequences. This threshold can be set according to actual hardware configuration and performance requirements; for example, it can be set to 1024 characters (token).
[0135] For each request sequence within this batch, obtain its current length L (i.e., the number of generated characters plus the length of the input prompt word), and classify it according to the following rules: (1) If L < target threshold, then classify the request sequence as a short sequence; (2) If L ≥ the target threshold, then the request sequence is classified as a long sequence.
[0136] For example, assuming the target threshold is 1024 characters, a request sequence of length 800 is classified as a short sequence, and a request sequence of length 1500 is classified as a long sequence.
[0137] In one or more embodiments of this application, the target threshold may be the same as or different from the first threshold, and can be set according to actual application requirements. Correspondingly, if the target threshold is equal to the first threshold, the short sequence in the decoding stage and the short sequence in the pre-filling stage can both refer to sequences with a length less than T; while the long sequence in the decoding stage specifically refers to a sequence equal to or greater than T, and the long sequence in the pre-filling stage specifically refers to a sequence between T and 2T.
[0138] Step 20: If the short sequences are obtained by classification, each short sequence is distributed to multiple devices in a data-parallel manner for parallel multi-head attention computation.
[0139] Specifically, in step 20, short sequences are processed in a data-parallel manner. All short sequences within the same batch are evenly distributed according to the number of devices, ensuring each computing device is assigned several complete short sequences. Each device independently performs multi-head attention computation on its assigned short sequences; no communication is required between devices at this stage. Each device maintains its own key-value cache for the short sequences it is responsible for, and updates it in each decoding step.
[0140] For example, assuming there are 4 computing devices and a batch of 8 short sequences, each device is assigned 2 short sequences. In each decoding step, device 1 independently calculates the attention of the 2 short sequences it is responsible for, generates the corresponding output characters, and updates the key-value cache (KV cache) of these short sequences.
[0141] Step 30: If the long sequence is obtained by classification, the key-value cache of the long sequence is divided into multiple key-value cache segments and stored on each of the devices respectively. The input character of the current decoding step is broadcast to the multiple devices so that each device can use its locally stored key-value cache segment and the input character to perform self-attention calculation to obtain a local result. The local results obtained by each device are reduced to obtain the output character of the current decoding step. The key-value cache corresponding to the output character is stored on the device whose current key-value cache capacity meets the preset load balancing conditions.
[0142] It should be noted that the input characters for the current decoding step include: the first character generated during the pre-filling stage of the MoE sparse large model inference, or the output character generated in the previous decoding step. By clearly defining the source of the input characters for the current decoding step (the first character generated during the pre-filling stage or the output character generated in the previous decoding step), it is ensured that those skilled in the art can accurately understand the input source for the decoding stage.
[0143] In a preferred embodiment, step 30, which involves dividing the long key-value cache sequence into multiple key-value cache segments and storing them on each of the devices, may include: dividing the long key-value cache sequence into multiple key-value cache segments on an even basis according to the number of devices, and storing each key-value cache segment one-to-one on each of the devices. By dividing the long key-value cache sequence evenly according to the number of devices and storing it one-to-one on each device, it ensures that each device initially bears a balanced storage load, avoiding localized storage hotspots caused by uneven partitioning, laying a good foundation for subsequent dynamic load balancing, and simplifying the management and access of the key-value cache.
[0144] In a preferred embodiment, the preset load balancing condition includes: the current key-value cache capacity is the lowest among all devices. This provides a simple and effective dynamic storage strategy, enabling newly generated key-value caches to be preferentially allocated to the lightest-loaded devices, thereby maximizing the balance of storage load among devices and preventing some devices from becoming performance bottlenecks due to key-value cache overload.
[0145] In other words, for long sequences within a batch, since the computational and memory access volume during the decoding phase is proportional to the sequence length, the key-value cache (KV Cache) of the long sequence is evenly distributed across multiple devices for sequence-parallel computation. Because the computation result during the decoding phase is independent of the order of the key-value cache (KV Cache), this application employs a flexible sequence-parallel approach for multi-head attention computation of long sequences. The specific computation process is as follows: the input character is broadcast to each device, and then each device performs self-attention computation between the input character and its local key-value cache (KV Cache) segment. Afterwards, a reduction operation is performed across the devices to output a new character. The key-value cache (KV Cache) of the new character is dynamically selected for storage on the device with the lowest key-value cache (KV Cache) capacity. The key-value cache (KV Cache) capacity on each device can be dynamically monitored at runtime. This hybrid parallel strategy ensures that the growth rate of the key-value cache (KV Cache) on each device during the decoding phase is similar, and the reduction amount of the key-value cache (KV Cache) on each device when the long sequence exits is also similar, thus maintaining a dynamic balance of key-value cache (KV Cache) on each device. A schematic diagram of a hybrid parallel strategy of data parallelism and sequence parallelism for the multi-head attention layer in the decoding stage on four devices is shown below. Figure 11 As shown.
[0146] Specifically, step 30 involves processing long sequences using a sequence-parallel approach based on key-value caching, and includes the following sub-steps: (1) Divide the long-sequence key-value cache into multiple key-value cache segments and store them on different devices respectively: For each long sequence, its generated key-value cache is evenly divided into multiple key-value cache segments along the sequence dimension, with the number of segments being the same as the number of devices, d. These key-value cache segments are then stored on different devices, with each device holding one key-value cache segment for that long sequence.
[0147] For example, assuming there are 4 devices, and a long sequence has already generated a key-value cache (KVCache) of length 2000, it is divided into 4 key-value cache segments, each of length 500, and stored on device 1, device 2, device 3, and device 4 respectively.
[0148] (2) Broadcast the input character of the current decoding step to all devices: In each decoding step, one input character needs to be processed. For the first decoding step, the input character is the first character generated in the pre-padding stage; for subsequent decoding steps, the input character is the output character generated in the previous decoding step.
[0149] Send the input character to all devices participating in the calculation through a Broadcast operation, so that each device can receive the input character.
[0150] For example, if the input character in the current decoding step is "love", then broadcast this character to Device 1, Device 2, Device 3, and Device 4.
[0151] (3) Each device performs self-attention calculation on the input character using the local key-value cache segment to obtain a partial result: After each device receives the broadcast input character, it performs self-attention calculation (Self-Attention) on the input character using the locally stored key-value cache segment. Specifically, the device calculates the attention score using the query vector (Query) of the input character and the key vector (Key) in the local key-value cache segment, and then performs weighted summation of the attention score and the value vector (Value) to obtain the partial result (Partial Result) of the device.
[0152] For example, Device 1 calculates the attention using the query vector of the input character "love" and the local key-value cache segment (the key-value of the first 500 characters of the long sequence) to obtain the partial result R1; Device 2 obtains the partial result R2; Device 3 obtains the partial result R3; Device 4 obtains the partial result R4.
[0153] (4) Perform a reduction operation on the partial results obtained by each device to obtain the output character of the current decoding step: Merge the partial results of all devices through a Reduce operation to obtain the final output character (Output Token) of the current decoding step. The reduction operation can be summation, averaging, or other predefined merging methods, depending on the implementation of the attention mechanism.
[0154] For example, perform summation reduction on R1, R2, R3, and R4 to obtain the final output character "you".
[0155] (5) Store the key-value cache corresponding to the output character on the device where the current key-value cache capacity meets the preset load balancing condition. <In a preferred implementation, the preset load balancing condition may be that the current key-value cache capacity is the lowest among all devices. That is, each time a new character is generated, its key-value cache is stored on the device with the lowest occupied capacity currently.
[0158] For example, if the key-value cache capacity of the current device 1 is 1000, device 2 is 800, device 3 is 1200, and device 4 is 900, then the device 2 with the lowest capacity is selected to store the key-value cache of the new character "你".
[0159] In this way, the newly generated key-value cache is dynamically allocated to the devices with lower load, gradually filling the capacity gap between devices and achieving dynamic balance of the key-value cache.
[0160] From the above description, it can be seen that in the parallel acceleration method for the second MoE sparse large model inference provided by the embodiments of this application, in the decoding stage of the MoE sparse large model inference, by splitting the key-value cache of the long sequence onto multiple devices, each device only stores and calculates a part of the long sequence, effectively dispersing the storage and calculation pressure of the long sequence. At the same time, through the method of broadcasting input characters and reducing local results, parallel decoding calculation of the long sequence is achieved. More importantly, by dynamically storing the newly generated key-value cache on the device with the lowest current capacity, the growth rate of the key-value cache of each device tends to be balanced, effectively solving the problem of dynamic load imbalance that the key-value cache of the device with more short sequences grows too fast and the device with more long sequences grows too slow. This dynamic balance mechanism avoids performance bottlenecks caused by the overload of the key-value cache of some devices, significantly improving the utilization rate of hardware resources and the throughput of the decoding stage. In addition, this solution retains the data parallel processing method for short sequences, avoiding the reduction of efficiency of short sequences due to unnecessary communication overhead, enabling the entire decoding stage to efficiently process request batches with mixed long and short sequences. In summary, the parallel acceleration method for the second MoE sparse large model inference provided by the embodiments of this application can effectively solve the problem of storage load imbalance caused by the mixture of long and short sequences in the parallel decoding stage of the MoE sparse large model, improve the utilization rate of hardware resources, increase the decoding throughput, reduce the end-to-end inference latency, and has significant progressiveness and practical value.
[0161] In order to further solve the problem that the device is idle and waiting due to the broadcast and reduction communication operations required by the long sequence when processing long and short sequences in the decoding stage, in the embodiments of the parallel acceleration method for the second MoE sparse large model inference provided by this application, if there is such a short sequence in the decoding step, each of the devices is used to: while performing the multi-head attention calculation in the data parallel manner, perform the broadcast and reduction operations.
[0162] In other words, to reduce the communication overhead caused by sequence parallelism of long sequences during the decoding stage, this application proposes a computational hiding method, which hides the communication overhead of broadcast and reduction operations caused by sequence parallelism through local short sequence computation on the device. Figure 12 As shown.
[0163] Specifically, the general calculation hiding mechanism in the decoding phase is as follows: (1) Determine if a short sequence exists in the current decoding step: At the beginning of each decoding step, first determine if a short sequence exists in the current step. Short sequences are computed independently by each device using multi-head attention, without the need for inter-device communication. For example, in a certain decoding step, while processing a long sequence L (using sequence parallelism based on key-value caching), there are also multiple short sequences S1 and S2 in this batch that need to be processed in this step.
[0164] (2) If a short sequence exists in the current decoding step, each device is configured to perform both types of operations simultaneously: Each device independently performs multi-head attention computation on its assigned short sequences. These computations rely solely on local data and model parameters, requiring no communication with other devices. Therefore, devices can continuously perform computations throughout the entire decoding process. For example, if device 1 is assigned two short sequences S1 and S2, it begins multi-head attention computation on S1 and S2, generates the output characters for these two short sequences in the current decoding step, and prepares to update its key-value cache.
[0165] Furthermore, while performing the aforementioned parallel data computation, each device also participates in the broadcast and reduce operations required for long sequence processing. The broadcast operation sends the input character of the current decoding step to all devices, ensuring that each device receives the input character and performs self-attention computation with the locally stored long sequence key-value cache segment. The reduce operation merges the local results from all devices into the final output character after each device completes its local self-attention computation and obtains its local result.
[0166] This communication process overlaps with the device's parallel data computation in time. The device initiates broadcast and reduction communication asynchronously, continuing to execute short sequences of computation while the communication is in progress.
[0167] For example, while computing short sequences S1 and S2, device 1 asynchronously initiates a broadcast operation to send the input character "love" to devices 2, 3, and 4, and simultaneously receives the broadcast input character from other devices. In a subsequent stage, device 1 asynchronously initiates a reduction operation to send the locally computed partial result to other devices, and simultaneously receives the partial results from other devices for merging.
[0168] (3) Continue parallel computation of the sequence after communication is completed: After the broadcast and reduction operations are completed, the device has obtained the complete data required for long sequence processing (the input characters of the broadcast) or has merged the partial results into the final output (the reduction result). At this time, the device continues to perform parallel computation of the sequence based on key-value cache partitioning on the long sequence.
[0169] It should be noted that, due to the overlap between communication and computation, when the device completes short sequence computation, the broadcast and reduction operations may have already been completed or are close to being completed, thereby reducing the idle time of the device due to waiting for communication.
[0170] (4) Complete the current decoding step: After all devices have completed the above calculations and communication, the current decoding step ends. In this way, the communication time of broadcast and reduction operations is covered by the parallel calculation time of short sequence data, and the devices are always in the calculation state throughout the entire decoding step without idle waiting.
[0171] As can be seen from the above description, the parallel acceleration method for the second MoE sparse large model inference provided in this application embodiment can effectively hide communication latency by enabling the device to asynchronously perform broadcast and reduction operations while performing parallel computation of short sequence data, thereby keeping the device in a computing state at all times, and thus significantly improving the resource utilization and overall throughput of the decoding step.
[0172] In other words, the parallel acceleration method for MoE sparse large model inference provided in this application embodiment adopts a hybrid parallel strategy of data parallelism and sequence parallelism in the decoding stage of MoE sparse large model inference. First, sequences within the same batch are divided into short sequences and long sequences. For short sequences within the batch, multi-head attention calculation is performed in a data parallel manner, avoiding unnecessary communication overhead for short sequences. For long sequences within the batch, the KV Cache of the long sequence is split across multiple devices, and a flexible sequence parallel approach is used for multi-head attention calculation. The specific calculation process is as follows: the input token is broadcast to each device, and then each device performs self-attention calculation with the input token and its local KV Cache segment. Afterwards, a reduction operation is performed between the devices to output a new token. The KV Cache of the new token is dynamically selected for storage on the device with the lower KV Cache capacity. The communication overhead of the above broadcast and reduction operations is hidden by the local short sequence calculation. The advantage of this approach is that the growth rate of the KV Cache on each device is similar, and the reduction amount of the KV Cache on each device when a long sequence exits is also similar, thus maintaining a dynamic balance of the KV Cache on each device.
[0173] To address the imbalance in attention computation load when processing long and short sequences during parallel inference of MoE sparse large models, this application proposes solutions for each stage of the pre-filling and decoding phases of large model inference. Specifically, this application also provides a third parallel acceleration method for MoE sparse large model inference, which can be implemented using a parallel acceleration device for MoE sparse large model inference. See [link to relevant documentation]. Figure 13 The parallel acceleration method for the third MoE sparse large model inference specifically includes the following: Step 01: Execute the parallel acceleration method for the first MoE sparse large model inference.
[0174] Step 02: Execute the parallel acceleration method for the second MoE sparse large model inference.
[0175] The parallel acceleration method for the first MoE sparse large model inference mentioned in the embodiments of the third MoE sparse large model inference provided in this application can be implemented using the processing flow of the first MoE sparse large model inference parallel acceleration method in the above embodiments, and the process will not be repeated here. Similarly, the parallel acceleration method for the second MoE sparse large model inference mentioned in the embodiments of the third MoE sparse large model inference provided in this application can be implemented using the processing flow of the second MoE sparse large model inference parallel acceleration method in the above embodiments, and the process will not be repeated here.
[0176] For multi-head attention computation in the pre-filling stage, this application proposes a hybrid parallel strategy combining data parallelism and sequence parallelism. First, requests within the same batch are categorized into three types: short sequences, long sequences, and ultra-long sequences. For short sequence requests within a batch, data parallelism is used for multi-head attention computation. For long sequence requests within a batch, the long sequence is split across multiple devices, and sequence parallelism is used for multi-head attention computation. For ultra-long sequence requests within a batch, the ultra-long sequence is split across multiple devices, and cross-iteration sequence parallelism is used for multi-head attention computation. This hybrid parallel strategy, combining data parallelism and sequence parallelism for both short and long sequences, achieves load balancing through segmentation of long and ultra-long sequences. It also offers the following benefits: short sequences, using data parallelism, avoid unnecessary communication overhead; long sequences, using sequence parallelism, allow long sequences to receive more computational resources, enabling them to complete computation quickly and reducing first-token latency; ultra-long sequences, using cross-iteration sequence parallelism, distribute the computation of ultra-long sequences across multiple iterations, thereby reducing the impact of ultra-long sequence computation on other sequences within the same batch.
[0177] For multi-head attention computation in the decoding phase, this application also proposes a hybrid parallel strategy combining data parallelism and sequence parallelism, but the execution process differs significantly from the pre-filling phase. In the decoding phase, one token is output at a time, and the newly generated token is used as input to compute the next token. During decoding, both the KV cache from the pre-filling phase and the KV cache for newly generated tokens are cached, thus avoiding redundant computation of already processed inputs when generating new tokens. The computational and memory access volumes in the decoding phase are proportional to the sequence length. For the decoding phase, this application categorizes sequences within the same batch into short sequences and long sequences. For short sequences within the batch, data parallelism is used for multi-head attention computation. For long sequences within a batch, the key-value (KV) cache of the long sequence is split across multiple devices, and multi-head attention computation is performed using a sequence-parallel approach. Specifically, the input token is broadcast to each device, and each device performs self-attention computation between the input token and its local KV cache segment. Afterward, a reduction operation is performed across devices to output a new token. The KV cache of the new token is dynamically selected for storage on the device with the lowest KV cache capacity. The advantage of this approach is that the growth rate of the KV cache on each device is similar, and the decrease in KV cache capacity across devices is also similar when a long sequence exits, thus maintaining a dynamic balance in the KV cache across all devices.
[0178] In summary, the third parallel acceleration method for MoE sparse large model inference provided in this application proposes a hybrid parallel strategy of data parallelism and sequence parallelism for the pre-filling stage of MoE sparse large model inference, which can perfectly adapt to the problem of varying request sequence lengths during MoE sparse large model inference. Sequences within a batch are divided according to length, and data parallelism, sequence parallelism, and cross-iteration sequence parallelism are applied to short sequences, long sequences, and ultra-long sequences respectively. Data parallelism for short sequences does not incur communication overhead, while sequence parallelism and cross-iteration sequence parallelism for long and ultra-long sequences achieve load balancing for self-attention computation, while improving the processing speed of long sequences. Compared to processing methods that do not differentiate between sequence lengths or single parallel strategies, the method proposed in this application can solve the load imbalance problem with lower communication overhead. The third parallel acceleration method for MoE sparse large model inference provided in this application also proposes a hybrid parallel strategy of data parallelism and sequence parallelism for the decoding stage of MoE sparse large model inference. Sequences within the same batch are divided into short sequences and long sequences. For short sequences within a batch, multi-head attention computation is performed using data parallelism, avoiding unnecessary communication overhead. For long sequences within a batch, the KV cache for long sequences is split across multiple devices, and multi-head attention computation is performed using a flexible sequence parallelism approach. Compared to existing work, the hybrid parallelism strategy of data parallelism and sequence parallelism proposed in this application can make the growth rate of KV cache on each device during the decoding stage similar, and the reduction amount of KV cache on each device when a long sequence exits is also similar, thus maintaining a dynamic balance of KV cache on each device. In addition, compared to LoongServe using only sequence parallelism, the hybrid parallelism strategy proposed in this application can effectively reduce communication volume and hide the communication overhead of sequence parallelism through device-local short sequence computation.
[0179] Furthermore, the parallel acceleration methods for the first, second, or third MoE sparse large model inference using the parallel acceleration device in the above embodiments of this application can be implemented in either a server or a client device. The specific implementation can be chosen based on the processing power of the client device and the limitations of the user's usage scenario. This application does not impose any limitations on this. If all operations are performed in the client device, the client device may further include a processor for specific processing of the parallel acceleration of MoE sparse large model inference.
[0180] The aforementioned client device may have a communication module (i.e., a communication unit) that can communicate with a remote server to achieve data transmission with the server. The server may include a server on the task scheduling center side; in other implementation scenarios, it may also include a server on an intermediate platform, such as a server on a third-party server platform that has a communication link with the task scheduling center server. The server may include a single computer device, a server cluster consisting of multiple servers, or a distributed server structure.
[0181] The server and the client device can communicate using any suitable network protocol, including those not yet developed as of the date of this application. Such network protocols may include, for example, TCP / IP, UDP / IP, HTTP, HTTPS, etc. Furthermore, such network protocols may also include RPC (Remote Procedure Call Protocol) and REST (Representational State Transfer Protocol) protocols used on top of the aforementioned protocols.
[0182] This application also provides an electronic device, which may include a processor, a memory, a receiver, and a transmitter. The processor is used to execute the parallel acceleration method for the first MoE sparse large model inference, the second MoE sparse large model inference, or the third MoE sparse large model inference mentioned in the above embodiments. The processor and the memory can be connected via a bus or other means, taking a bus connection as an example. The receiver can be connected to the processor and the memory via wired or wireless means.
[0183] The processor can be a central processing unit (CPU). The processor can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations of the above types of chips.
[0184] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as the program instructions / modules corresponding to the first, second, or third MoE sparse large model inference parallel acceleration methods in the embodiments of this application. The processor executes various functional applications and data processing by running the non-transitory software programs, instructions, and modules stored in the memory, thereby implementing the first, second, or third MoE sparse large model inference parallel acceleration methods in the above method embodiments.
[0185] The memory may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created by the processor, etc. Furthermore, the memory may include high-speed random access memory and non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory may optionally include memory remotely located relative to the processor, which can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0186] The one or more modules are stored in the memory, and when executed by the processor, they execute the parallel acceleration method for MoE sparse large model inference in the implementation embodiment.
[0187] In some embodiments of this application, the user equipment may include a processor, a memory, and a transceiver unit. The transceiver unit may include a receiver and a transmitter. The processor, memory, receiver, and transmitter may be connected via a bus system. The memory is used to store computer instructions, and the processor is used to execute the computer instructions stored in the memory to control the transceiver unit to send and receive signals.
[0188] As one implementation method, the functions of the receiver and transmitter in this application can be implemented by transceiver circuits or dedicated transceiver chips, and the processor can be implemented by dedicated processing chips, processing circuits or general-purpose chips.
[0189] As another implementation approach, the server provided in this application embodiment can be implemented using a general-purpose computer. That is, the program code implementing the processor, receiver, and transmitter functions is stored in memory, and the general-purpose processor implements the processor, receiver, and transmitter functions by executing the code in memory.
[0190] This application also provides a computer-readable storage medium storing a computer program thereon. When executed by a processor, the computer program implements the steps of the aforementioned parallel acceleration method for the first MoE sparse large model inference, the second MoE sparse large model inference, or the third MoE sparse large model inference. The computer-readable storage medium can be a tangible storage medium, such as random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disks, hard disks, removable storage disks, CD-ROMs, or any other form of storage medium known in the art.
[0191] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the aforementioned parallel acceleration method for the first MoE sparse large model inference, the parallel acceleration method for the second MoE sparse large model inference, or the parallel acceleration method for the third MoE sparse large model inference.
[0192] Those skilled in the art will understand that the exemplary components, systems, and methods described in conjunction with the embodiments disclosed herein can be implemented in hardware, software, or a combination of both. Whether implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application. When implemented in hardware, it can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this application are programs or code segments used to perform the required tasks. The programs or code segments can be stored on a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried on a carrier wave.
[0193] It should be clarified that this application is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of this application is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of this application.
[0194] In this application, features described and / or illustrated for one embodiment may be used in the same or similar manner in one or more other embodiments, and / or combined with or in place of features of other embodiments.
[0195] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Various modifications and variations can be made to the embodiments of this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.
Claims
1. A parallel acceleration method for MoE sparse large model inference, characterized in that, include: In the pre-filling stage of MoE sparse large model inference, each request sequence within the same batch is classified according to a preset pre-filling stage classification strategy. The pre-filling stage classification strategy includes: classifying request sequences with a length less than a first threshold as first-class sequences, classifying request sequences with a length between the first threshold and a second threshold as second-class sequences, and classifying request sequences with a length greater than the second threshold as third-class sequences. The second threshold is greater than the first threshold. If the first type of sequence is obtained by classification, each first type of sequence is distributed to multiple devices in a data-parallel manner for parallel multi-head attention computation; If the second type of sequence is obtained by classification, then the second type of sequence is processed in a sequence parallel manner; If the third type of sequence is obtained by classification, the third type of sequence is divided into sub-sequences, and each sub-sequence is assigned to multiple iterations; and in each iteration, the sub-sequences are processed in a parallel manner.
2. The parallel acceleration method for MoE sparse large model inference according to claim 1, characterized in that, The step of dividing the third type of sequence into sub-sequences and distributing each sub-sequence to multiple iterations includes: The number of segments for the request sequence is determined based on the ratio of the length of the request sequence to the first threshold. The request sequence is divided into multiple sub-length sequences on an average basis based on the number of divisions; Each of the sub-length sequences is assigned one-to-one to multiple iterations.
3. The parallel acceleration method for MoE sparse large model inference according to claim 1, characterized in that, The step of processing the sub-length sequence in the sequence-parallel manner in each iteration includes: In each iteration, if the second type of sequence exists in the current iteration, the sub-length sequence corresponding to the current iteration and the second type of sequence existing in the current iteration are jointly divided into sub-sequences using the sequence parallel method, so that each device can perform multi-head attention calculation in parallel on the assigned sub-sequences.
4. The parallel acceleration method for MoE sparse large model inference according to claim 1, characterized in that, The sequence parallelism method includes: The sequence to be processed is divided into 2d segments on average, wherein the sequence to be processed includes the second type of sequence and / or the sub-length sequence corresponding to the third type of sequence in a single iteration, and d is the number of devices; Assign the i-th segment and the i-th-last segment to the i-th device.
5. The parallel acceleration method for MoE sparse large model inference according to claim 3, characterized in that, In the iteration where the first type of sequence exists, each of the devices is configured to: perform data transmission required for multi-head attention computation using the data parallel method while performing multi-head attention computation using the sequence parallel method, and, after receiving the transmitted data, perform multi-head attention computation in parallel on the sub-sequences allocated in the current iteration.
6. The parallel acceleration method for MoE sparse large model inference according to claim 5, characterized in that, The data transmission required for performing multi-head attention computation using the sequence parallel method includes: A ring communication method is used to transmit key-value cache data between devices, and when key-value cache data is received from a neighboring device, attention is calculated with the local query vector.
7. The parallel acceleration method for MoE sparse large model inference according to claim 5, characterized in that, The data transmission required for performing multi-head attention computation using the sequence parallel method includes: Through at least one full-switch communication operation, query, key, and value tensors after sequence dimension partitioning are exchanged between various devices, and multi-head attention computation is performed in parallel after obtaining complete sequence data; wherein, the full-switch communication operation and the multi-head attention computation in the data parallel mode are performed asynchronously.
8. A parallel acceleration method for MoE sparse large model inference, characterized in that, include: In the decoding stage of MoE sparse large model inference, each request sequence within the same batch is classified according to a preset decoding stage classification strategy; wherein, the decoding stage classification strategy includes: classifying request sequences with a length less than a target threshold as short sequences, and classifying request sequences with a length equal to or greater than the target threshold as long sequences; If the short sequences are obtained by classification, each short sequence is distributed to multiple devices in a data-parallel manner for parallel multi-head attention computation; If the long sequence is obtained by classification, the key-value cache of the long sequence is divided into multiple key-value cache segments and stored on each of the devices. The input character of the current decoding step is broadcast to the multiple devices, so that each device uses its locally stored key-value cache segment and the input character to perform self-attention calculation to obtain a local result. The local results obtained by each device are reduced to obtain the output character of the current decoding step. The key-value cache corresponding to the output character is stored on the device whose current key-value cache capacity meets the preset load balancing conditions.
9. The parallel acceleration method for MoE sparse large model inference according to claim 8, characterized in that, The input characters for the current decoding step include: the first character generated during the pre-filling stage of MoE sparse large model inference, or the output character generated in the previous decoding step.
10. The parallel acceleration method for MoE sparse large model inference according to claim 8, characterized in that, The preset load balancing conditions include: the current key-value cache capacity is the lowest among all devices.
11. The parallel acceleration method for MoE sparse large model inference according to claim 8, characterized in that, If the short sequence exists in the decoding step, each of the devices is used to: perform broadcast and reduction operations while performing multi-head attention computation using the data parallel approach.
12. The parallel acceleration method for MoE sparse large model inference according to claim 8, characterized in that, The step of dividing the long sequence of key-value cache into multiple key-value cache segments and storing them separately on each of the devices includes: Based on the number of devices, the long sequence of key-value cache is divided into multiple key-value cache segments, and each key-value cache segment is stored one-to-one on each of the devices.
13. A parallel acceleration method for end-to-end inference in a MoE sparse large model, characterized in that, include: The parallel acceleration method for MoE sparse large model inference as described in any one of claims 1 to 7, and the parallel acceleration method for MoE sparse large model inference as described in any one of claims 8 to 12.
14. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the parallel acceleration method for MoE sparse large model inference as described in any one of claims 1 to 7, the parallel acceleration method for MoE sparse large model inference as described in any one of claims 8 to 12, or the parallel acceleration method for MoE sparse large model end-to-end inference as described in claim 13.
15. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the computer program implements the parallel acceleration method for MoE sparse large model inference as described in any one of claims 1 to 7, the parallel acceleration method for MoE sparse large model inference as described in any one of claims 8 to 12, or the parallel acceleration method for MoE sparse large model end-to-end inference as described in claim 13.
16. A computer program product, comprising a computer program, characterized in that, When executed by a processor, the computer program implements the parallel acceleration method for MoE sparse large model inference as described in any one of claims 1 to 7, the parallel acceleration method for MoE sparse large model inference as described in any one of claims 8 to 12, or the parallel acceleration method for MoE sparse large model end-to-end inference as described in claim 13.