Computational optimization method, device, system and apparatus for hybrid expert model

By using a mask matrix to filter valid tokens and generate a tightly packed feature submatrix in a hybrid expert model, the high communication overhead and latency caused by cross-device data transmission are solved, thereby improving computational efficiency and optimizing performance.

CN122240986APending Publication Date: 2026-06-19SHANGHAI BIREN TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI BIREN TECH CO LTD
Filing Date
2026-03-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In distributed deployment scenarios of hybrid expert models, the data transmission process across devices in existing technologies leads to high communication overhead and latency, affecting model performance.

Method used

Valid tokens are filtered using a mask matrix, and a tightly packed feature submatrix is ​​generated. Local experts then perform calculations only on valid tokens. Finally, the calculation results are integrated through lightweight aggregation communication, avoiding invalid data from consuming computing resources and reducing inverse permutation operations.

Benefits of technology

It significantly improves computational efficiency, reduces communication overhead and computational latency, and enhances the overall performance of the model.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240986A_ABST
    Figure CN122240986A_ABST
Patent Text Reader

Abstract

This invention discloses a computational optimization method, apparatus, system, and device for a hybrid expert model. The method includes: obtaining feature matrices and mask matrices corresponding to all input tokens; wherein the mask matrix is ​​used to determine the allocation status of each input token among local experts; for each local expert, using the mask matrix, converting the feature matrix into a feature submatrix containing only the corresponding valid token, and recording the index information of each valid token among all input tokens; calculating the feature submatrix corresponding to each local expert to obtain the corresponding calculation result; and based on the index information, aggregating the calculation results corresponding to each local expert to the output result of a shared expert to obtain the global calculation result corresponding to all input tokens. This invention can effectively improve the computational efficiency of the model and significantly reduce communication overhead.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to a computational optimization method, apparatus, system, and electronic device using a hybrid expert model. Background Technology

[0002] With the rapid development of artificial intelligence technology, deep learning large models are increasingly widely used in fields such as natural language processing and computer vision. Mixture of Experts (MoE) models, with their advantages of "sparse activation and scalability", have become one of the mainstream architectures for building ultra-large-scale models.

[0003] In a distributed deployment scenario with parallel models, multiple experts of the MoE model are distributed across different computing devices (i.e., AI processors). After the input token undergoes routing computation through a gating network, existing technologies typically use "permuation operators and unpermuation operators" to perform cross-device communication (i.e., all-to-all communication) between different computing devices based on the routing results. However, this cross-device data transmission process is often accompanied by high communication overhead and latency, which in turn leads to a decrease in the overall performance of the MoE model. Summary of the Invention

[0004] The purpose of this invention is to provide a computational optimization method, apparatus, system, and electronic device for a hybrid expert model. By filtering valid tokens through a mask matrix and generating a tightly arranged feature submatrix, and by having each local expert perform calculations only on valid tokens, invalid data can be avoided from occupying computational resources, thereby effectively improving the computational efficiency of the model. Furthermore, the results are integrated only through a lightweight aggregation communication at the end, without the need for inverse permutation operations, which can significantly reduce communication overhead.

[0005] A first aspect of the present invention provides a computational optimization method for a hybrid expert model, applied to an artificial intelligence processor; wherein the artificial intelligence processor is equipped with one or more local experts; the method includes: Obtain the feature matrix and mask matrix corresponding to all input tokens; wherein, the mask matrix is ​​used to determine the allocation status of each input token among the local experts; For each local expert, the feature matrix is ​​converted into a feature submatrix containing only the corresponding valid token using the mask matrix, and the index information of each valid token among all the input tokens is recorded; Calculate the feature submatrix corresponding to each local expert to obtain the corresponding calculation results; Based on the index information, the calculation results corresponding to each local expert are aggregated into the output result of the shared expert to obtain the global calculation result corresponding to all the input tokens.

[0006] Optionally, the step of aggregating the calculation results corresponding to each local expert to the output result of the shared expert based on the index information includes: Based on the index information, the calculation results corresponding to each local expert are weighted and accumulated to obtain the intermediate result of local aggregation, and the relevant position information of the intermediate result in the global calculation result is determined. Based on the relevant location information and in conjunction with the protocol operations, the intermediate results are aggregated into the output of the shared expert.

[0007] Optionally, during the process of calculating the kth feature submatrix using the kth local expert, the artificial intelligence processor performs data preprocessing operations of the (k+1)th local expert in parallel; wherein, the data preprocessing operations include: transformation operations of the (k+1)th feature submatrix and recording operations of the corresponding index information; wherein, k≥1.

[0008] Optionally, the mask matrix is ​​generated by the gating network after routing and allocating all the input tokens.

[0009] Optionally, the feedforward neural network corresponding to each local expert includes: an upward projection weight matrix, a non-linear activation function, and a downward projection weight matrix.

[0010] A second aspect of the present invention provides a computational optimization apparatus for a hybrid expert model, applied to an artificial intelligence processor; wherein the artificial intelligence processor is equipped with one or more local experts; the apparatus includes: The data acquisition module is used to acquire the feature matrix and mask matrix corresponding to all input tokens; wherein, the mask matrix is ​​used to determine the allocation status of each input token among the local experts; The matrix transformation module is used to transform the feature matrix into a feature submatrix containing only the corresponding valid token for each local expert using the mask matrix, and to record the index information of each valid token among all the input tokens; The expert calculation module is used to calculate the feature submatrix corresponding to each local expert and obtain the corresponding calculation results; The result aggregation module is used to aggregate the calculation results corresponding to each local expert to the output result of the shared expert based on the index information, so as to obtain the global calculation results corresponding to all the input tokens.

[0011] Furthermore, the result aggregation module includes: The local aggregation unit is used to perform weighted summation of the calculation results corresponding to each local expert based on the index information to obtain the intermediate result of local aggregation, and to determine the relevant position information of the intermediate result in the global calculation result; The global aggregation unit is used to aggregate the intermediate results to the output of the shared expert based on the relevant location information and in conjunction with reduction operations.

[0012] Furthermore, during the process of calculating the kth feature submatrix using the kth local expert, the artificial intelligence processor executes the data preprocessing operation of the (k+1)th local expert in parallel; wherein the data preprocessing operation includes: the transformation operation of the (k+1)th feature submatrix and the recording operation of the corresponding index information; wherein k≥1.

[0013] A third aspect of the present invention provides a computational optimization system for a hybrid expert model, comprising multiple artificial intelligence processors operating in parallel; wherein all input tokens acquired by each of the artificial intelligence processors are identical, and are used to execute the computational optimization method for a hybrid expert model as described in any embodiment of the first aspect.

[0014] A fourth aspect of the present invention provides an electronic device including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, wherein the processor, when executing the computer program, implements the computational optimization method of the hybrid expert model as described in any embodiment of the first aspect.

[0015] Compared with existing technologies, embodiments of the present invention provide a computational optimization method, apparatus, system, and electronic device for hybrid expert models. The embodiments of the present invention first filter out the valid tokens corresponding to each expert using a mask matrix and generate a tightly arranged feature submatrix, while recording the index information of the valid tokens. Next, each local expert performs calculations only on the feature submatrix, avoiding invalid tokens from consuming computational resources, thereby significantly improving computational efficiency. Finally, based on the recorded index information, a lightweight aggregation operation integrates the computational results of each expert into the output result of the shared expert, without requiring additional reverse arrangement operations, which can significantly reduce communication overhead and computational latency of the final result. Attached Figure Description

[0016] Figure 1 This is a flowchart illustrating an embodiment of the computational optimization method for the hybrid expert model provided by the present invention; Figure 2 This is a flowchart illustrating another embodiment of the computational optimization method for the hybrid expert model provided by the present invention; Figure 3This is a schematic diagram of the structure of an embodiment of the computational optimization device for the hybrid expert model provided by the present invention; Figure 4 This is a schematic diagram of the structure of an embodiment of the artificial intelligence processor provided by the present invention.

[0017] Figure 5 This is a schematic diagram of the structure of an embodiment of the electronic device provided by the present invention. Detailed Implementation

[0018] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0019] The artificial intelligence processor involved in this invention can be any one of CPU (Central Processing Unit), GPU (Graphics Processing Unit), TPU (Tensor Processing Unit), NPU (Neural Network Processing Unit), DPU (Deep Learning Processing Unit), APU (Accelerated Processing Unit), and GPGPU (General-Purpose computing on Graphics Processing Unit), depending on its application to a specific product or technology in the embodiments of this invention.

[0020] It is worth noting that in the initial scheme proposed by the inventors, a "full token computation" method was adopted for each local expert on the artificial intelligence processor. That is, each local expert directly performed feed-forward network (FFN) computation based on the feature matrix corresponding to the full input tokens, and only after the computation was completed, the computation results of valid tokens were filtered through the mask matrix output by the gating network.

[0021] After research, the inventors discovered that the mask matrix of the MoE model exhibits extremely strong sparsity, meaning that the number of valid tokens corresponding to each expert is relatively small compared to the total number of input tokens. This results in a large number of invalid elements (corresponding to features of invalid tokens) in the left matrix (i.e., the feature matrix) during FFN calculation, requiring matrix multiplication to be performed on a large amount of meaningless data. This not only causes a serious waste of computing resources but also increases computational latency. To address this, the inventors further proposed a computational optimization method, device, system, and equipment for hybrid expert models. The optimization idea is "sparse matrix compact arrangement + calculation of only valid tokens." That is, valid tokens are pre-selected through the mask matrix, transforming the sparse global feature matrix into a compact feature submatrix containing only valid tokens. This allows experts to perform calculations only on valid data, thereby reducing invalid calculations at the source and improving the performance of the FFN operator. In addition, each card independently completes local calculations, and the global results are integrated through only one lightweight reduction operation, without the need for additional reverse permutation operations.

[0022] See Figure 1 This is a flowchart illustrating an embodiment of the computational optimization method for hybrid expert models provided by the present invention.

[0023] A first aspect of the present invention provides a computational optimization method for a hybrid expert model, applied to an artificial intelligence processor; wherein the artificial intelligence processor is equipped with one or more local experts; the method includes steps S1 to S4, as follows: Step S1: Obtain the feature matrix and mask matrix corresponding to all input tokens; wherein, the mask matrix is ​​used to determine the allocation status of each input token among the local experts; Step S2: For each local expert, the feature matrix is ​​converted into a feature submatrix containing only the corresponding valid token using the mask matrix, and the index information of each valid token in all the input tokens is recorded; Step S3: Calculate the feature submatrix corresponding to each local expert to obtain the corresponding calculation results; Step S4: Based on the index information, aggregate the calculation results corresponding to each local expert to the output result of the shared expert to obtain the global calculation results corresponding to all the input tokens.

[0024] It should be noted that the embodiments of the present invention adopt an Expert Parallelism (EP) scheme, that is, the expert set of a MoE layer is divided across multiple parallel computing devices (i.e., artificial intelligence processors). For example, if there are 256 experts in the MoE layer, and the EP8 scheme is adopted, these 256 experts need to be distributed across 8 GPUs, with 32 local experts deployed on each GPU, and the above 8 GPUs share the same input token.

[0025] In step S1, the artificial intelligence processor (such as a GPU) obtains the feature matrix of all input tokens in the current batch, as well as the mask matrix pre-computed by the gating network.

[0026] A token, also known as a "lexicon," is the smallest processing unit obtained after splitting input data. In natural language processing tasks, a token can be a single word or sub-word extracted from text; in computer vision tasks, a token can be a block of an image. Each token corresponds to a set of feature data (a vector representation in a high-dimensional space), therefore the feature matrix corresponding to all input tokens is [token_num, hidden_size]; where token_num is the total number of input tokens, and hidden_size is the feature dimension of the model.

[0027] The mask matrix is ​​a sparse matrix used to describe the allocation relationship between tokens and local experts. It is generated by the gating network after routing calculations for all input tokens. The mask matrix has dimensions [expert_num, token_num], where expert_num is the number of local experts on the current AI processor. The element values ​​in the matrix represent the "allocation state" between the corresponding token and the corresponding local expert. If an element is zero, it means that the corresponding token is invalid for that local expert, i.e., the local expert does not need to process this token. If it is a non-zero value, it means that the corresponding token is a "valid token" for that local expert, and this value is the weight / probability assigned to that local expert by the gating network. Obviously, the mask matrix obtained by each AI processor is different.

[0028] Local experts are a group of experts deployed within a single AI processor, each responsible for processing the input data assigned to them.

[0029] In step S2, the AI ​​processor uses the mask matrix to filter out the corresponding valid tokens for each local expert and converts the feature matrix into a feature submatrix containing only the corresponding valid tokens. Simultaneously, the AI ​​processor records the position index information of each extracted valid token among all input tokens. This process occurs only within the card, without inter-card interaction or all-to-all communication.

[0030] In step S3, each local expert performs feedforward neural network (FFN) computation only on its own corresponding feature submatrix. Since invalid tokens have been removed from the feature submatrix, the matrix multiplication operation at this point is highly intensive and efficient. In other words, local experts only perform computation on their own valid tokens, avoiding invalid operations on a large number of zero-value or meaningless data, thereby significantly reducing the load on the tensor computation core and improving computational efficiency.

[0031] In step S4, the shared expert is also a feedforward neural network used to extract general features from all input tokens and obtain the corresponding output results. This embodiment of the invention aggregates the computation results of each local expert within each AI processor to the output result of the shared expert, obtaining a global computation result. This global computation result improves the generalization performance of MoE. This process does not require the execution of the unpermutation operator; instead, it uses index information and reduction operations to accumulate the computation results of the local experts onto the corresponding data in the output result of the shared expert.

[0032] As can be seen from the above, the embodiments of the present invention significantly reduce the computational load of the tensor computation core by eliminating invalid data, and complete the global result integration with only one lightweight reduction operation without the need for additional reverse permutation operations, which can significantly reduce communication overhead (only one communication interaction).

[0033] In an optional embodiment, the step of aggregating the calculation results corresponding to each local expert to the output result of the shared expert based on the index information includes: Based on the index information, the calculation results corresponding to each local expert are weighted and accumulated to obtain the intermediate result of local aggregation, and the relevant position information of the intermediate result in the global calculation result is determined. Based on the relevant location information and in conjunction with the protocol operations, the intermediate results are aggregated into the output of the shared expert.

[0034] Specifically, for all local experts deployed on the current AI processor (single card), the local aggregation operation within the single card is first performed. That is, based on the valid token index information recorded in step S2, and combined with the weights assigned by the gating network (such as non-zero values ​​in the mask matrix), the calculation results corresponding to each local expert are weighted and accumulated to obtain the intermediate result of local aggregation. The relevant position information of this intermediate result in the global calculation result is determined (that is, the original index of the valid tokens corresponding to all local experts within the single card is integrated), providing a precise positioning basis for the subsequent aggregation to the output result of the shared expert.

[0035] After completing local aggregation within a single card, the intermediate results within that card are accumulated to the shared expert's output using a reduction operation (here, a lightweight reduction based on index-based accumulation) based on the location index provided by the relevant location information. Once all intermediate results within all cards have been reduced, the global computation result is obtained. In this process, each card independently completes its local computation, with only one lightweight reduce operation and no additional reverse permutation operations. This improves the parallelism of each card, thereby increasing aggregation efficiency.

[0036] In an optional embodiment, the artificial intelligence processor performs a data preprocessing operation of the (k+1)th local expert in parallel while calculating the kth feature submatrix using the kth local expert; wherein the data preprocessing operation includes: a transformation operation of the (k+1)th feature submatrix and a recording operation of the corresponding index information; wherein k≥1.

[0037] Specifically, while performing the feedforward network computation task (including but not limited to matrix multiplication, activation function calculation, etc.) of the kth feature submatrix corresponding to the kth local expert, the data preprocessing operation of the k+1th local expert is started and executed in parallel, such as filtering out the required set of valid tokens from the full set of input tokens and generating the tightly arranged k+1th feature submatrix, and recording the index information of each valid token in the k+1th feature submatrix in the original input tokens.

[0038] This invention, by introducing a pipelined parallel mechanism for computation and data preprocessing, effectively masks the delay between computation and data preparation, further improving the overall computational efficiency of the MoE model. Of course, during local aggregation, this invention can also perform the FFN calculation of the k-th local expert in parallel while weighted accumulating the calculation results of the k-th local expert.

[0039] like Figure 2The diagram shown is a flowchart of another embodiment of the computational optimization method for the hybrid expert model provided by this invention. The following detailed explanation focuses on the execution flow of any AI processor in the EP scheme (assuming a single card deployment of 32 local experts): In step S1, the feature matrix corresponding to all input tokens is obtained (i.e., Figure 2 The “input” in the model has dimensions [token_num, hidden_size]; where token_num is the total number of input tokens and hidden_size is the model feature dimension (taking the deepseek model as an example, hidden_size=7168).

[0040] Simultaneously, all input tokens are routed and allocated through a gating network to generate a mask matrix (i.e., Figure 2 The "mask matrix" in the text has dimensions [expert_num, token_num]=[32, token_num]; where expert_num is the number of local experts on the current AI processor, and is equal to 32; a 0 value in the mask matrix indicates that the corresponding token is invalid for that local expert, and a non-zero value (i.e., assigned probability / weight) indicates that the corresponding token is a valid processing object for that local expert.

[0041] In step S2, the "① Count valid token information" operation is first performed. This involves counting the number of valid tokens corresponding to each local expert (denoted as valid_token_num) based on the mask matrix, and simultaneously recording the index information of each valid token in the input tokens. This yields a valid token count matrix (denoted as valid_token_len, with dimension [expert_num]) and an index coordinate matrix (denoted as valid_token_index, with dimension [expert_num, token_num]) for the 32 experts. Since the number of valid tokens differs among experts, empty positions in the index coordinate matrix are filled with 0, and subsequent readings only retrieve the preceding index corresponding to that number.

[0042] Next, perform the operation of "② Convert the left matrix to a feature sub - matrix". That is, for each local expert, according to valid_token_len and valid_token_index, select the valid token feature vectors of this expert from the full - scale feature matrix, and arrange them closely as a feature sub - matrix. The dimension is compressed from the original [token_num, hidden_size] to [valid_token_num, hidden_size]. Since the valid tokens are sparsely distributed and valid_token_num << token_num, the feature sub - matrix has removed the redundant feature data corresponding to the invalid tokens.

[0043] In step S3, each local expert performs calculations on the closely arranged feature sub - matrix based on its own FFN structure (including the upper projection weight matrix, non - linear activation function, and lower projection weight matrix). Specifically, it includes: (1) First - stage matrix multiplication (i.e., "③ mma_up does matrix multiplication"): Perform matrix multiplication on the feature sub - matrix ([valid_token_num, hidden_size]) and the upper projection weight matrix of the FFN. Only calculate for the valid tokens to avoid wasting computing power on invalid tokens; (2) Non - linear activation function calculation: Perform the "④ swiglu" activation operation on the result of the first - stage matrix multiplication. The operation scale is only for the dimension corresponding to valid_token_num; (3) Second - stage matrix multiplication (i.e., "⑤ mma_down does matrix multiplication"): Perform matrix multiplication on the activated result and the lower projection weight matrix of the FFN to obtain the calculation result of this local expert for the valid tokens, and its dimension is [valid_token_num, hidden_size].

[0044] In step S4, based on the valid token index information (i.e., valid_token_index) recorded in step S2, and through the "⑥ reduce" operation (i.e., reduction operation), accumulate the calculation results of each local expert to the output result of the shared expert to obtain the final global calculation result of all input tokens, and its dimension is [token_num, hidden_size]; among them, the global calculation result is obtained by accumulating the calculation results of the local experts and the output result of the shared expert in all artificial intelligence processors.

[0045] As can be seen from the above, the calculation optimization method of the hybrid expert model provided by the present invention has the following advantages: (1) Reduce the computational load of matrix multiplication Tcore (tensor computation core): Taking a scenario with 256 tokens as an example, in the initial scheme, each local expert needs to perform 4×n Tcore matrix multiplication operations, while the present invention only needs to complete it in 1×n times, which greatly reduces the computational power it occupies; where n≥1.

[0046] (2) Reduce the computation of Vcore (vector computation core): Since the number of valid tokens corresponding to local experts is much smaller than the total number of tokens, more computational resources can be allocated to the hidden_size related operations to improve the feature expression capability of the model; (3) Eliminate additional performance time consumption: The final global calculation result is accumulated by adding the calculation results of all local experts to the output result of the shared expert through the original index of the valid token. There is no need to execute the Unpermutation operator, thus avoiding the latency loss caused by the operator.

[0047] See Figure 3 This is a schematic diagram of an embodiment of the computational optimization device for hybrid expert models provided by the present invention.

[0048] A second aspect of the present invention provides a computational optimization apparatus for a hybrid expert model, applied to an artificial intelligence processor; wherein the artificial intelligence processor is equipped with one or more local experts; the apparatus includes: The data acquisition module 11 is used to acquire the feature matrix and mask matrix corresponding to all input tokens; wherein, the mask matrix is ​​used to determine the allocation status of each input token among the local experts; The matrix transformation module 12 is used to transform the feature matrix into a feature submatrix containing only the corresponding valid token for each local expert through the mask matrix, and to record the index information of each valid token among all the input tokens; Expert calculation module 13 is used to calculate the feature submatrix corresponding to each local expert and obtain the corresponding calculation results; The result aggregation module 14 is used to aggregate the calculation results corresponding to each local expert to the output result of the shared expert based on the index information, so as to obtain the global calculation results corresponding to all the input tokens.

[0049] Furthermore, the result aggregation module includes: The local aggregation unit is used to perform weighted summation of the calculation results corresponding to each local expert based on the index information to obtain the intermediate result of local aggregation, and to determine the relevant position information of the intermediate result in the global calculation result; The global aggregation unit is used to aggregate the intermediate results to the output of the shared expert based on the relevant location information and in conjunction with reduction operations.

[0050] Furthermore, during the process of calculating the kth feature submatrix using the kth local expert, the artificial intelligence processor executes the data preprocessing operation of the (k+1)th local expert in parallel; wherein the data preprocessing operation includes: the transformation operation of the (k+1)th feature submatrix and the recording operation of the corresponding index information; wherein k≥1.

[0051] It should be noted that the computational optimization apparatus for hybrid expert models provided in the second aspect of the present invention can realize all the processes of the computational optimization method for hybrid expert models described in any of the first aspects of the present invention. The functions and technical effects of each module and unit in the apparatus are the same as the functions and technical effects of the computational optimization method for hybrid expert models described in any of the first aspects of the present invention, and will not be repeated here.

[0052] A third aspect of the present invention provides a computational optimization system for a hybrid expert model, comprising multiple artificial intelligence processors operating in parallel; wherein all input tokens acquired by each of the artificial intelligence processors are identical, and are used to execute the computational optimization method for a hybrid expert model as described in any of the first aspects of the present invention.

[0053] It should be noted that the computational optimization system of the hybrid expert model belongs to the expert parallel distributed deployment mode, with N artificial intelligence processors (i.e., "artificial intelligence processor 0" to "artificial intelligence processor N-1"); where N>1. Each artificial intelligence processor receives the exact same input token and independently and in parallel executes the computational optimization method of the hybrid expert model described in any embodiment of the first aspect. The N artificial intelligence processors, through the reduce operation, accumulate the computational results of their respective local experts to the output result of the shared expert, and finally obtain the global computational result corresponding to all input tokens.

[0054] See Figure 4 The present invention provides a schematic diagram of the structure of an embodiment of an artificial intelligence processor.

[0055] The artificial intelligence processor provided in this embodiment of the invention includes multiple computing units (CUs) and global memory (GLM). The CUs are the core components for performing hybrid expert model (MoE) computation tasks, including but not limited to tensor computation cores (Tcores), vector computation cores (Vcores), and group shared memory (GSM). The GLM stores the feature matrices, mask matrices, and intermediate results of local aggregations corresponding to all input tokens. The GSM stores the feature submatrices corresponding to local experts, the index information corresponding to valid tokens, and the computation results of local experts.

[0056] Each computing unit is responsible for processing one or more local experts. Its internal Tcore performs matrix multiplication of the FFN, efficiently calculating only the feature submatrices corresponding to the local experts, without redundant computational waste. The Vcore is responsible for parsing the mask matrix, selecting the corresponding valid tokens for each local expert, and performing general computational tasks such as calculating nonlinear activation functions and weighted summation in the FFN. During computation, multiple computing units are allowed to work in parallel, each processing its assigned local expert, achieving parallel acceleration at the expert level. The AI ​​processor also performs a lightweight reduction operation through the inter-chip interconnect network to integrate global results, without needing to execute the unpermutation operator, significantly reducing communication overhead (only one inter-card interaction).

[0057] See Figure 5 This is a schematic diagram of an embodiment of the electronic device provided by the present invention.

[0058] A fourth aspect of the present invention provides an electronic device including a processor 21, a memory 22, and a computer program stored in the memory 22 and configured to be executed by the processor 21, wherein the processor 21, when executing the computer program, implements the computational optimization method of the hybrid expert model described in any of the first aspects of the present invention.

[0059] Preferably, the computer program can be divided into one or more modules / units (such as computer program one, computer program two, ...), and the one or more modules / units are stored in the memory 22 and executed by the processor 21 to complete the present invention. The one or more modules / units can be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program in the electronic device.

[0060] The processor 21 can be any one of a CPU (Central Processing Unit), GPU (Graphics Processing Unit), TPU (Tensor Processing Unit), NPU (Neural Network Processing Unit), DPU (Deep Learning Processing Unit), APU (Accelerated Processing Unit), and GPGPU (General-Purpose Computing on Graphics Processing Unit). The processor 21 is the control center of the electronic device, connecting various parts of the electronic device via various interfaces and lines.

[0061] The memory 22 mainly includes a program storage area and a data storage area. The program storage area can store the operating system, applications required for at least one function, etc., and the data storage area can store related data, etc. In addition, the memory 22 can be a high-speed random access memory, or a non-volatile memory, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card, and a flash card, etc., or the memory 22 can also be other volatile solid-state storage devices.

[0062] It should be noted that the aforementioned electronic devices may include, but are not limited to, processors and memory, as will be understood by those skilled in the art. Figure 5 The structural block diagram shown is merely a structural example of the above-described electronic device and does not constitute a limitation on the structure of the above-described electronic device. The above-described electronic device may include more or fewer components than shown, or combine certain components, or different components.

[0063] The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the technical principles of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.

Claims

1. A computational optimization method using a hybrid expert model, characterized in that, Applied to an artificial intelligence processor; wherein the artificial intelligence processor is deployed with one or more local experts; the method includes: Obtain the feature matrix and mask matrix corresponding to all input tokens; wherein, the mask matrix is ​​used to determine the allocation status of each input token among the local experts; For each local expert, the feature matrix is ​​converted into a feature submatrix containing only the corresponding valid token using the mask matrix, and the index information of each valid token among all the input tokens is recorded; Calculate the feature submatrix corresponding to each local expert to obtain the corresponding calculation results; Based on the index information, the calculation results corresponding to each local expert are aggregated into the output result of the shared expert to obtain the global calculation result corresponding to all the input tokens.

2. The computational optimization method for the hybrid expert model as described in claim 1, characterized in that, The aggregation of calculation results corresponding to each local expert to the output result of the shared expert based on the index information includes: Based on the index information, the calculation results corresponding to each local expert are weighted and accumulated to obtain the intermediate result of local aggregation, and the relevant position information of the intermediate result in the global calculation result is determined. Based on the relevant location information and in conjunction with the protocol operations, the intermediate results are aggregated into the output of the shared expert.

3. The computational optimization method for the hybrid expert model as described in claim 1, characterized in that, During the calculation of the kth feature submatrix using the kth local expert, the artificial intelligence processor performs data preprocessing operations of the (k+1)th local expert in parallel; wherein, the data preprocessing operations include: transformation operations of the (k+1)th feature submatrix and recording operations of the corresponding index information; wherein, k≥1.

4. The computational optimization method for the hybrid expert model as described in claim 1, characterized in that, The mask matrix is ​​generated by the gating network after routing and allocating all the input tokens.

5. The computational optimization method for the hybrid expert model as described in claim 1, characterized in that, Each local expert's corresponding feedforward neural network includes: an upward projection weight matrix, a non-linear activation function, and a downward projection weight matrix.

6. A computational optimization device using a hybrid expert model, characterized in that, An application to an artificial intelligence processor; wherein the artificial intelligence processor is deployed with one or more local experts; the device includes: The data acquisition module is used to acquire the feature matrix and mask matrix corresponding to all input tokens; wherein, the mask matrix is ​​used to determine the allocation status of each input token among the local experts; The matrix transformation module is used to transform the feature matrix into a feature submatrix containing only the corresponding valid token for each local expert using the mask matrix, and to record the index information of each valid token among all the input tokens; The expert calculation module is used to calculate the feature submatrix corresponding to each local expert and obtain the corresponding calculation results; The result aggregation module is used to aggregate the calculation results corresponding to each local expert to the output result of the shared expert based on the index information, so as to obtain the global calculation results corresponding to all the input tokens.

7. The computational optimization apparatus for a hybrid expert model as described in claim 6, characterized in that, The result aggregation module includes: The local aggregation unit is used to perform weighted summation of the calculation results corresponding to each local expert based on the index information to obtain the intermediate result of local aggregation, and to determine the relevant position information of the intermediate result in the global calculation result; The global aggregation unit is used to aggregate the intermediate results to the output of the shared expert based on the relevant location information and in conjunction with reduction operations.

8. The computational optimization apparatus for a hybrid expert model as described in claim 6, characterized in that, During the calculation of the kth feature submatrix using the kth local expert, the artificial intelligence processor performs data preprocessing operations of the (k+1)th local expert in parallel; wherein, the data preprocessing operations include: transformation operations of the (k+1)th feature submatrix and recording operations of the corresponding index information; wherein, k≥1.

9. A computational optimization system using a hybrid expert model, characterized in that, It includes multiple artificial intelligence processors that operate in parallel; wherein all input tokens acquired by each of the artificial intelligence processors are identical, and are used to execute the computational optimization method of the hybrid expert model as described in claims 1 to 5.

10. An electronic device, characterized in that, It includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, wherein the processor, when executing the computer program, implements a computational optimization method for a hybrid expert model as described in any one of claims 1 to 5.