Method for token routing and device therefor

The token routing method in MoE models optimizes expert selection by identifying commonly used experts across tokens, addressing memory footprint and inference time issues, ensuring efficient and adaptive performance in large language models.

WO2026127167A1PCT designated stage Publication Date: 2026-06-18SEOUL NATIONAL UNIVERSITY R&DB FOUNDATION

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
SEOUL NATIONAL UNIVERSITY R&DB FOUNDATION
Filing Date
2024-12-27
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Large Language Models (LLMs) utilizing Mixture of Experts (MoE) face challenges with increased memory footprint and inference time during batch-based operations, leading to degraded performance due to uneven expert utilization and inefficient memory access.

Method used

A method for token routing in MoE models that identifies expert candidate groups commonly included across multiple tokens, reducing the number of experts used by selecting the most frequently included expert for each token, thereby optimizing memory access and inference time without reducing the total number of parameters.

🎯Benefits of technology

This approach enhances inference performance by minimizing the number of experts used, reducing memory access and inference time, while maintaining accuracy and adaptively adjusting to service conditions, thus balancing speed and energy consumption.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2024021284_18062026_PF_FP_ABST
    Figure KR2024021284_18062026_PF_FP_ABST
Patent Text Reader

Abstract

A mixture of experts (MOE) model of the present disclosure comprises: an expert module which includes a plurality of experts; and a router which receives an input batch including one or more input tokens and performs routing. The router: determines, among the plurality of experts, one or more expert candidates including one or more experts for routing each of the one or more input tokens; determines a candidate map indicating a relationship between the one or more input tokens and the one or more expert candidates; and performs an expert search on the basis of the candidate map, wherein on the basis of the expert search, input tokens related to at least some expert candidates can be routed to experts included in the at least some expert candidates who occupy a majority of the one or more expert candidates.
Need to check novelty before this filing date? Find Prior Art

Description

Method for token routing and device for the same

[0001] The present disclosure relates to a method for routing tokens and an apparatus for the same, and more specifically, to a method for routing tokens in a large language model and an apparatus for the same.

[0002]

[0003] Large Language Models (LLMs) utilizing Mixture of Experts (MoE) are technologies designed to enhance the efficiency and scalability of large-scale language models. MoE performs computations by simultaneously leveraging multiple model components, known as experts, and activating only the most suitable expert for each input. This approach significantly increases the total number of model parameters while keeping the number of parameters activated during actual inference for each input limited, thereby allowing the computational load to be managed at a level comparable to existing LLMs.

[0004] One of the key advantages of MoE LLM lies in the efficient use of parameters. Compared to traditional LLMs, MoE has a much larger total number of parameters, yet only a fraction of them are actually used in each task. This allows the model to dynamically select the combination of experts best suited for each data point or task, thereby reducing unnecessary computations and optimizing memory usage. Consequently, although MoE models have a large total number of parameters, they can achieve similar or better accuracy on specific downstream tasks using significantly fewer parameters compared to non-MoE models with the same amount of parameters.

[0005] This structure can offer significant advantages in enhancing the model's generalization and adaptability by enabling different experts to learn distinct characteristics, particularly regarding complex and diverse data. Consequently, each expert is optimized for a specific type of input, and the integrated MoE LLM can effectively handle various types of language comprehension and generation tasks. As such, MoE LLM presents a new paradigm for large-scale language models and is regarded as an innovative method that strikes a good balance between computational efficiency and model performance.

[0006]

[0007] The present disclosure aims to provide a method for token routing and an apparatus for the same.

[0008] The problems to be solved by the present disclosure are not limited to those mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art to which the present disclosure belongs from the description below.

[0009]

[0010] To solve the above technical problem, the present disclosure comprises: an expert module including a plurality of experts; and a router that receives an input batch including at least one input token and performs routing, wherein the router determines at least one expert candidate including at least one expert among the plurality of experts for routing each of the at least one input token, determines a candidate map representing the relationship between the at least one input token and the at least one expert candidate, and performs an expert search based on the candidate map, wherein, based on the expert search, an input token associated with the at least one expert candidate may be routed to an expert included in the at least one expert candidate that has the most experts among the at least one expert candidate.

[0011] Additionally, the present disclosure may indicate that the candidate map represents information about an expert candidate associated with a specific input token for each of the at least one input tokens.

[0012] Additionally, the present disclosure may determine the at least one expert included in the at least one expert candidate based on the descending cumulative sum of the routing scores of each of the at least one experts included in the at least one expert candidate and a preset specific threshold.

[0013] Additionally, the present disclosure allows the router to use the candidate map to search for an expert capable of processing the most input tokens.

[0014] Additionally, the present disclosure states that the expert search excludes the routed token from the candidate map and can be performed repeatedly until all of the at least one input token is routed.

[0015] Additionally, the present disclosure allows an input token having a form in which the routing score of the expert with the highest routing score is higher than a specific threshold to be routed to the expert with the highest routing score.

[0016] Additionally, the present disclosure describes a MoE model in which routing based on expert search is performed only for input tokens among the at least one input token, wherein the routing score of the expert with the highest routing score is equal to or smaller than a specific threshold.

[0017] Additionally, the present disclosure relates to a method performed by a router of a mixture of experts (MoE) model comprising an expert module including a plurality of experts and a router that receives an input batch including at least one input token and performs routing, wherein the method comprises: determining at least one expert candidate including at least one expert among the plurality of experts for routing each of the at least one input token; determining a candidate map representing the relationship between the at least one input token and the at least one expert candidate; and performing an expert search based on the candidate map, wherein, based on the expert search, the input token associated with the at least one expert candidate can be routed to an expert included in the at least one expert candidate that has the largest number of experts among the at least one expert candidate.

[0018] Additionally, the present disclosure may indicate that the candidate map represents information about an expert candidate associated with a specific input token for each of the at least one input tokens.

[0019] Additionally, the present disclosure may determine the at least one expert included in the at least one expert candidate based on the descending cumulative sum of the routing scores of each of the at least one experts included in the at least one expert candidate and a preset specific threshold.

[0020] Additionally, the present disclosure may further include the step of searching for an expert capable of processing the most input tokens using the candidate map.

[0021] Additionally, the present disclosure states that the expert search excludes the routed token from the candidate map and can be performed repeatedly until all of the at least one input token is routed.

[0022] Additionally, the present disclosure may not perform routing based on the expert search for an input token having a form in which the routing score of the expert with the highest routing score among the at least one input token is higher than a specific threshold.

[0023] Additionally, the present disclosure allows an input token having a form in which the routing score of the expert with the highest routing score is higher than a specific threshold to be routed to the expert with the highest routing score.

[0024] Additionally, the present disclosure allows routing based on the expert search to be performed only on input tokens among the at least one input token, wherein the routing score of the expert with the highest routing score is equal to or smaller than a specific threshold.

[0025]

[0026] According to the present disclosure, token routing can be efficiently performed in a large language model.

[0027] The effects that can be achieved in the present disclosure are not limited to those mentioned above, and other unmentioned effects will be clearly understood by those skilled in the art to which the present disclosure belongs from the description below.

[0028]

[0029] FIGS. 1 and FIGS. 2 are drawings showing an example of the structure of a switch transformer to aid in understanding the present disclosure.

[0030] FIG. 3 is a figure showing the difference in performance between when routing is performed using the expert with the highest routing score according to one embodiment of the present disclosure and when routing is performed without using the expert with the highest routing score.

[0031] FIGS. 4 and FIGS. 5 are drawings for illustrating an example of an expert candidate determination process according to one embodiment of the present disclosure.

[0032] FIG. 6 is a diagram illustrating an expert candidate map intercalation / generation procedure and a route pool search operation according to one embodiment of the present disclosure.

[0033] FIGS. 7 and 8 are drawings for illustrating an expert search procedure according to one embodiment of the present disclosure.

[0034] FIG. 9 is a figure showing an example of the result of expert search performed according to one embodiment of the present disclosure.

[0035] FIG. 10 is a diagram showing an example of a block diagram of a MoE model according to one embodiment of the present disclosure.

[0036] FIG. 11 is a figure showing an example of a method performed by a router of an MoE model according to one embodiment of the present disclosure.

[0037]

[0038] Embodiments of the present disclosure will be described in detail below with reference to the drawings. However, detailed descriptions of known functions or configurations that may obscure the gist of the present disclosure in the following description and the attached drawings are omitted. Additionally, throughout the present disclosure, the term "comprising" any component means that, unless specifically stated otherwise, it does not exclude other components but may include additional components.

[0039] Additionally, terms such as first, second, etc. may be used to describe various components, but said components should not be limited by said terms. Such terms may be used for the purpose of distinguishing one component from another. For example, without departing from the scope of the rights of the present disclosure, the first component may be named the second component, and similarly, the second component may be named the first component.

[0040] The terms used in this disclosure are used merely to describe specific embodiments and are not intended to limit this disclosure. The singular expression includes the plural expression unless the context clearly indicates otherwise. In this application, terms such as “comprising” or “comprising” are intended to specify the existence of the described features, numbers, steps, actions, components, parts, or combinations thereof, and should be understood as not precluding the existence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof.

[0041] Unless specifically defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as generally understood by those skilled in the art to which this disclosure pertains. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant technology, and should not be interpreted in an ideal or overly formal sense unless explicitly defined in this application.

[0042] Hereinafter, various embodiments of the present disclosure are described with reference to the accompanying drawings. However, this is not intended to limit the present disclosure to specific embodiments and should be understood to include various modifications, equivalents, and / or alternatives of embodiments of the present invention.

[0043]

[0044] A Mixture-of-Expert (MoE) transformer-based language model can have a structure that performs operations by dynamically determining the model parameters to be used based on the input token (a vector corresponding to a single word in the case of a language model). More specifically, the MoE transformer-based language model can configure multiple expert modules by copying the transformer's Feedforward Network (FFN) and generate a score probability distribution (routing score) for expert(s) by passing the language token input to the MoE transformer-based language model through a gate network. From the generated score probability distribution (routing score), the MoE model utilizes a Top-K algorithm to select the K experts with the highest scores, routes the token to these K experts for processing, and subsequently calculates the weighted sum of the routing result vectors and the routing scores.

[0045] In batch-based inference, multiple experts can be used to process multiple input tokens, each handling a different input token; however, this can lead to an increase in the memory footprint. Since the performance of generative inference tends to be bounded by the memory bandwidth of accelerators (e.g., GPUs (Graphics Processing Units)), an increase in the memory footprint can lead to a degradation in generative inference performance.

[0046] The present disclosure relates to a method for routing that, instead of using the K experts with the highest routing scores during routing, searches for a routable expert candidate group for each token, analyzes the searched expert candidate groups, and performs routing using the expert most commonly included in the expert candidate groups. For example, if there is an expert candidate group 1 consisting of Expert 1, Expert 2, and Expert 3; an expert candidate group 2 consisting of Expert 2, Expert 3, and Expert 4; an expert candidate group 3 consisting of Expert 2, Expert 3, and Expert 5; and an expert candidate group 4 consisting of Expert 1, Expert 2, and Expert 4, then Expert 2, who is commonly included in Expert Group 1, Expert Group 2, Expert Group 3, and Expert Group 4, may be the expert most commonly included in the expert candidate groups.

[0047] According to the method of the present disclosure, batch-based Large Language Model (MoE) inference can be performed using only a smaller number of experts compared to Top-K routing, thereby saving inference time and memory access energy. In particular, as large MoE models are being developed that utilize dozens of MoE layers containing dozens of experts, the method of the present disclosure is expected to contribute to improving inference performance.

[0048] Expert Pruning and Skipping techniques, developed to lighten the computation of MoE models, may permanently remove unnecessary MoE experts or skip the computation itself for experts with low routing scores. However, maintaining a large number of parameters is essential for large language models to learn more knowledge, and not reducing the amount of computation may also be the most ideal approach for achieving high accuracy in language models. This disclosure proposes a method to improve batch-based inference performance by optimizing memory access while maintaining both the number of parameters and the amount of computation.

[0049] The gate network in MoE is a key component responsible for assigning input tokens to each expert. Tokens are provided in the form of vectors, and based on these vector-based tokens, the gate network can be used in the process of selecting a specific expert.

[0050] In the following, the configuration and operation of a gateway network (or routing network, router) are described. More specifically, the following description may be an explanation of existing methods to aid in understanding the present disclosure.

[0051] Input tokens are provided in the form of vectors, and their size represents the model's intrinsic embedding size. The gate network acquires the input vectors and can organize them into a 2D matrix of the form (embedding size x number of experts). This 2D matrix is ​​generated through learning and serves to calculate a score indicating how well each expert fits the corresponding token.

[0052] More specifically, the input token is a vector of the embedding size dimension, and the gate network performs a matrix multiplication of the input token and the gate network matrix, thereby producing a score for each expert in the form of (number of experts x 1). The score calculated in this process represents the probability that a specific token will be assigned to each expert, and can be converted into a probability distribution by applying a softmax function (a softmax function is a function that converts each value of the input vector into a probability value between 0 and 1 such that the sum of all values ​​is 1), and the score converted into a probability distribution is referred to as the routing score.

[0053] When the routing scores of all tokens within a batch are organized into a single two-dimensional matrix, a probability distribution (routing matrix) for experts is formed, and this routing matrix can be used to select the K experts with the highest scores for each token through the Top-K algorithm. Through the aforementioned process, each token can be efficiently routed to the appropriate expert. Therefore, the role of the gate network can be utilized as a core operational structure for allocating tokens to experts.

[0054] Examples of representative MoE models are described with reference to Figures 1 and 2.

[0055] Figure 1 shows a switch transformer, which is one of the representative MoE models, and Figure 2 shows a Mixtral, which is one of the representative MoE models.

[0056] Referring to Fig. 1, the Switch Transformer utilizes a Feed Forward Network (FFN) as experts, and models using a minimum of 8 to a maximum of 2048 experts can be used in the Switch Transformer. More specifically, models using 8, 16, 32, 64, 128, and 2048 experts have been developed, and users can freely modify the number of experts according to their needs. The Switch Transformer tends to continuously improve its downstream task performance as the number of experts increases.

[0057] Referring to Figure 2, when an input is input, it can be routed to an appropriate expert by a router. In the case of a Mixtral model, a model using 8 experts can be used.

[0058] In the case of Switch Transformer models, while a method using only a single expert per token may be employed, there also exists a Top-K MoE approach that activates multiple experts simultaneously to process a single token. Models based on the Top-K MoE approach aim to achieve high accuracy with a small number of experts by performing additional computations within a fixed number of experts. Through the Top-K MoE approach, more robust and flexible model performance can be secured by combining the strengths of multiple experts for each input.

[0059] Below, existing inference methods to aid in understanding the present disclosure are described.

[0060] Batched inference is a method that improves computational efficiency by collecting input data in batches and passing it through an artificial intelligence (AI) model. The reason for the increased efficiency of batched inference is that it enhances the reuse rate of AI model parameters. For example, if there are 10 pieces of input data and AI computations are performed sequentially without grouping them into batches, the parameters included in the AI ​​model might need to be read from DRAM 10 times. Since the entire AI model generally does not load into the computer's cache, parameters may need to be read from DRAM every time data is processed. On the other hand, using batch-based computation allows the AI ​​model to be read from DRAM only once to process all 10 pieces of data. Consequently, this reduces the number of DRAM accesses for the AI ​​model parameters by 9. In LLM services, batching is used in the word generation process to efficiently provide chatting services for various user queries. However, when applying batching to MoE models, depending on the case, the diverse use of internal experts may lead to a decrease in the reuse rate of total model parameters, ultimately resulting in a problem where more memory access is required.

[0061] Expert Skipping is a method that improves overall inference speed by skipping operations for experts with low routing scores. However, current Expert Skipping uses human-hardcoded thresholds as the skipping criteria, making it difficult to adjust these criteria to operate adaptively in various situations. Due to these limitations, there is a risk that inference performance may drop significantly in specific scenarios, or the benefits of improved inference efficiency may be limited.

[0062] Not all experts included in a MoE model are utilized evenly, and a phenomenon may occur where the importance of experts is divided across specific tasks. In the case of Memory-efficient NLLB-MoE (Non-Linear Language Modeling with Mixture of Experts), experts are not utilized evenly in AI-based language translation, and the expert subset used may be fixed depending on the target language pair. Consequently, Expert Pruning can be applied to deactivate experts not used for tasks involving specific language pairs. This naturally reduces the memory footprint by the number of pruned experts. However, in language model inference service environments where a single model must immediately process a wide variety of tasks, efficiently utilizing a large model may be more appropriate than using pruned models. Therefore, permanently reducing the parameters of an LLM through Pruning presents a problem in that the model's knowledge may eventually reach its limits.

[0063] The present disclosure addresses the problem of increased inference time and memory access volume caused by an increased memory footprint that may occur during batch inference while maintaining the number of parameters. While expert pruning methods that reduce the memory burden by reducing the number of parameters may be disadvantageous for acquiring knowledge that increases rapidly due to the reduction of parameters, the method of the present disclosure may be more advantageous for learning a large amount of knowledge because it does not reduce the total number of parameters. In addition, it is distinguished in that it maintains the amount of computation to maintain high accuracy.

[0064] Language models are characterized by the fact that there may be multiple correct answers for the final generated result; therefore, even if MoE routing is changed during the word generation process, the final string generation result may not differ significantly from the result generated when routing is not manipulated. Furthermore, even if the results differ considerably, a result with similar accuracy may be generated as the final result. In other words, the present disclosure can be seen as having differentiation in that existing methods have adopted an approach that seeks to increase computational efficiency while maintaining the final result as much as possible.

[0065] Below, with reference to Figure 3, the difference in performance between the case where routing is performed using the expert with the highest routing score and the case where routing is performed without using the expert with the highest routing score will be explained.

[0066] FIG. 3 is a figure showing the difference in performance between when routing is performed using the expert with the highest routing score according to one embodiment of the present disclosure and when routing is performed without using the expert with the highest routing score.

[0067] More specifically, Figure 3 shows the performance (rouge score) when a switch transformer model using Top-1 routing (i.e., routing using the single expert with the highest routing score among experts) and 32 experts in each MoE layer is trained on a task of generating a summary of the content of a given text, and then the routing to the single expert with the highest score is intentionally changed to the Nth expert during the routing process. Here, the Rouge-N score is an indicator of how well the N-grams (sequences consisting of N consecutive items (words, characters, phonemes, etc.) in a given text or speech sample) of the words constituting the text generated based on the reference text match.

[0068] In Figure 3, row 0 of the y-axis represents the case where the expert operation itself is skipped, row 1 represents Top-1 routing (i.e., routing using the expert with the highest routing score), and row N>1 represents the rouge score when the decision to route using the expert with the highest routing score (Top-1 routing) is manipulated to route to the expert with the Nth highest score. Additionally, values ​​such as 0.2 / 0.4 shown on the x-axis represent the probability of performing the routing manipulation. For example, when the index of the y-axis is 2 and the value of the x-axis is 0.4, it represents the rouge score when the decision to route using the expert with the highest routing score (Top-1 routing) is manipulated to route to the expert with the second highest score with a 40% probability. If routing is performed 10 times, Top-1 routing may be performed 6 times, and routing using the expert with the second highest score may be performed 4 times. Referring to the table on the left in Figure 3, it can be seen that when routing is performed with an index of 2 on the y-axis and a value of 0.4 on the x-axis, the rouge score decreases by only 0.2 points compared to when Top-1 routing is performed. Additionally, when the index of the y-axis is 2 and the value of the x-axis is 0.2, the rouge score is shown when the decision to route using the expert with the highest routing score (Top-1 routing) is manipulated with a 20% probability to route to the expert with the second highest score; in this case, it can be seen that the rouge score actually increases by 0.2 points compared to when Top-1 routing is performed. In other words, according to Figure 3, it can be seen that, depending on the situation, performing routing changes with an appropriate probability can have a favorable effect on overall routing performance rather than performing only Top-1 routing.

[0069] Based on the above results, the present disclosure proposes an algorithm (greedy algorithm) that, rather than an algorithm that fixes the number of experts capable of routing a single token to Top-K, has multiple expert candidate groups corresponding one-to-one with each of the multiple tokens and searches for the expert most commonly included among the expert candidates capable of routing for the entire batch. Through this, different tokens within a batch share routing to a specific expert (i.e., multiple different tokens can be routed to the same single expert), and consequently, the effect of reducing the number of experts used in the entire batch can be expected.

[0070] For convenience of explanation, the method proposed in this disclosure may be referred to as Route Pooling. The components of the method proposed in this disclosure include (1) routing scores and matrices, (2) candidate thresholds (CT), (3) expert candidates, and (4) expert candidate maps. The above components will be described in detail below.

[0071] (1) Routing score and matrix: The routing score, which is the result of multiplying the gate network (routing network, router) matrix by the embedding vector corresponding to each token and normalizing it through the softmax function, represents the probability that each token will be assigned to each expert. At this time, the routing matrix may be a 2D matrix that organizes the routing scores calculated for all tokens in a batch. For example, in the case of the element (m, n) located at row m and column n of the 2D routing matrix, the (m, n) element may represent the routing score of the n-th expert among all experts associated with the m-th token in the input batch.

[0072] Expert Candidate: Represents a set of experts capable of routing a token. When determining an expert candidate associated with a specific token, the candidate threshold described below may be used.

[0073] Candidate Threshold (CT): The candidate threshold is a value used to determine which expert candidates can route each token. For example, the candidate threshold can be a value tuned by the AI ​​service provider to a value between 0.0 and 1.0 for the Calibration dataset. As the set candidate threshold approaches 0.0, it approximates the Top-K algorithm, and as it approaches 1.0, it may exhibit an effect similar to an ensemble of multiple separate non-MoE models.

[0074] Expert Candidate Map: An expert candidate map refers to a two-dimensional matrix that visualizes expert candidates determined based on an expert candidate threshold, representing them in token-expert relationships.

[0075] Based on the concepts of (1) routing score and matrix, (2) candidate threshold (CT), (3) expert candidate, and (4) expert candidate map described above, the method proposed in the present disclosure will be explained in detail below.

[0076] According to one embodiment of the present disclosure, routing scores for experts associated with each token in a batch are calculated, and expert candidates may be determined based on the calculated routing scores. The routing scores for experts associated with each token are normalized through a softmax operation, and the sum of the normalized routing scores of all experts associated with each token may be set to 1. For each of the tokens included in the input batch, the normalized routing scores of all experts associated with the token may be sorted in descending order, and then the cumulative sum may be calculated, and experts up to the point where the calculated cumulative sum exceeds a candidate threshold may be determined as expert candidates associated with the corresponding token.

[0077] FIGS. 4 and 5 are diagrams illustrating an example of an expert candidate determination process according to an embodiment of the present disclosure. FIGS. 4 and 5 assume a MoE model composed of five experts, but this is for convenience of explanation only and the method of the present disclosure is not limited thereto. It goes without saying that the method of the present disclosure can be applied to a MoE model composed of at least one expert.

[0078] Referring to Fig. 4, examples of expert routing score calculations related to tokens are illustrated. In Fig. 4, Low, Medium, and High represent expert sensitivity.

[0079] Based on the example in Fig. 4 where expert sensitivity is Low, routing scores for experts (E1 to E5) associated with a specific token within a batch are calculated and normalized (i.e., the sum of the normalized routing scores of experts E1 to E5 can be set to 1). It can be seen that when expert sensitivity is Low, the normalized routing scores of the experts are distributed relatively evenly. That is, in the example in Fig. 4, when expert sensitivity is Low, it can be seen that the routing score of the expert with the highest normalized routing score is less than 0.4 (approximately 0.3). Subsequently, as shown in Fig. 5, the normalized routing scores of all experts (E1 to E5) associated with the token are sorted in descending order, and a cumulative sum can be calculated. Experts (E2, E3, E1) up to the point where the calculated cumulative sum exceeds the candidate threshold of 0.8 can be determined as expert candidates associated with the corresponding token.

[0080] Next, based on the example in Fig. 4 where the expert sensitivity is Medium, routing scores for experts (E1 to E5) associated with a specific token within a batch are calculated and normalized (i.e., the sum of the normalized routing scores of experts E1 to E5 can be set to 1). It can be seen that when the expert sensitivity is Medium, the normalized routing scores of the experts are distributed relatively less evenly compared to when it is Low. That is, in the example in Fig. 4, when the expert sensitivity is Medium, the routing score of the expert with the highest normalized routing score corresponds to 0.4. Subsequently, as shown in Fig. 5, the normalized routing scores of all experts (E1 to E5) associated with the token are sorted in descending order, and a cumulative sum can be calculated. Experts (E3, E4) whose calculated cumulative sum exceeds the candidate threshold of 0.8 can be determined as expert candidates associated with the corresponding token.

[0081] Next, based on the example of High expert sensitivity in Fig. 5, routing scores for experts (E1 to E5) associated with a specific token within a batch are calculated and normalized (i.e., the sum of the normalized routing scores of experts E1 to E5 can be set to 1). It can be seen that when expert sensitivity is High, the normalized routing scores of the experts are distributed in a biased manner toward a specific expert. That is, in the example of Fig. 4, when expert sensitivity is High, it can be seen that the routing score of the expert with the highest normalized routing score corresponds to 0.8. Subsequently, as shown in Fig. 5, the normalized routing scores of all experts (E1 to E5) associated with the token are sorted in descending order, and a cumulative sum can be calculated. Only expert E1, whose calculated cumulative sum exceeds the candidate threshold of 0.8, can be determined as the expert candidate associated with the token.

[0082] Returning to the description of the process in which the method of the present disclosure is performed, after the aforementioned expert routing score calculation and candidate determination operations are performed for all tokens in the input batch, the method of the present disclosure may perform an expert candidate calculation / generation procedure.

[0083] Referring to Fig. 6, the procedure for calculating / generating expert candidates is explained.

[0084] FIG. 6 is a diagram illustrating an expert candidate map generation / interval and route pool search operation according to one embodiment of the present disclosure. FIG. 6 relates to a case where six tokens are included in an input batch and the MoE model includes five experts, but this is for convenience of explanation only, and the method of the present disclosure can be extended and applied to cases where an input patch includes at least one token and a MoE model includes at least one expert.

[0085] Referring to FIG. 6, it can be seen that the expert candidate map is information representing the relationship between a token and an expert candidate for each of the tokens included in the input batch. More specifically, in FIG. 6, as a result of calculating the expert routing score and determining the expert candidate for Token 1 (T1), it can be seen that an expert candidate including Expert 1 (E1), Expert 2 (E2), and Expert 3 (E3) is determined for Token 1 (T1). Additionally, as a result of calculating the expert routing score and determining the expert candidate for Token 2 (T2), it can be seen that an expert candidate including only Expert 5 (E5) is determined for Token 2 (T2). In this case, Token 2 may be a token with high expert sensitivity.

[0086] Next, as a result of calculating the expert routing score for Token 3 (T3) and determining the expert candidates, it can be seen that an expert candidate including Expert 3 (E3) and Expert 4 (E4) is determined for Token 3 (T3). Additionally, as a result of calculating the expert routing score for Token 4 (T4) and determining the expert candidates, it can be seen that an expert candidate including Expert 2 (E2) and Expert 3 (E3) is determined for Token 4 (T4). Furthermore, as a result of calculating the expert routing score for Token 5 (T5) and determining the expert candidates, it can be seen that an expert candidate including only Expert 1 (E1) is determined for Token 5 (T5). In this case, Token 5 may be a token with high expert sensitivity. Finally, as a result of calculating the expert routing score for Token 6 (T6) and determining the expert candidates, it can be seen that an expert candidate including only Expert 5 (E5) is determined for Token 6 (T6). In this case, Token 6 may be a token with high expert sensitivity.

[0087] Returning to the description of the process in which the method of the present disclosure is performed, after the expert candidate calculation / generation procedure is performed, an expert search procedure (Route-Pool Search) may be performed according to the method of the present disclosure. The expert search procedure may be a process of searching for the expert most commonly included in the expert candidates for each token, and this process may be performed until routing for all tokens is completed.

[0088] Referring to FIGS. 6 to 8, an expert search procedure according to one embodiment of the present disclosure will be described.

[0089] FIG. 6 is a diagram illustrating the expert candidate map and expert search procedure as described above, and FIGS. 7 and 8 are diagrams illustrating the expert search procedure according to one embodiment of the present disclosure.

[0090] First, referring to Fig. 6, it can be seen that in the expert candidate diagram, Expert 3 (E3) is included in the most common expert candidates among the expert candidates of the six tokens. More specifically, Expert 3 (E3) is included in the expert candidate of Token 1 (T1), is included in the expert candidate of Token 3 (T3), and is included in the expert candidate of Token 4 (T4). That is, Expert 3 (E3) is included in the expert candidate associated with a total of three tokens (T1, T3, T4). On the other hand, Expert 1 (E1) is included in the expert candidate of Token 1 (T1) and is included in the expert candidate of Token 5 (T5). That is, Expert 1 (E1) is included in the expert candidate associated with a total of two tokens (T1, T5). Additionally, Expert 2 (E2) is included in the expert candidate of Token 1 (T1) and is included in the expert candidate of Token 4 (T4). That is, Expert 2 (E2) is included in the expert candidates associated with a total of 2 tokens (T1, T4). Also, Expert 4 (E4) is included only in the expert candidates of Token 3 (T3). That is, Expert 4 (E4) is included in the expert candidates associated with a total of 1 token (T3). Finally, Expert 5 (E5) is included in the expert candidates of Token 2 (T2) and is included in the expert candidates of Token 6 (T6). That is, Expert 5 (E5) is included in the expert candidates associated with a total of 2 tokens (T2, T6). Consequently, among the experts E1 to E5, Expert 3 (E3) is commonly included in the most expert candidates; thus, through expert search, the 3 tokens (T1, T3, T4) associated with Expert 3 (E3) can be routed to Expert 3 (E3).

[0091] However, even after this process is completed, the expert search can continue to be performed because the expert to be routed to Token 2 (T2), Token 5 (T5), and Token 6 (T6) has not yet been determined.

[0092] Referring to Fig. 7, the expert search process will be explained further. Referring to Fig. 7, tokens in which the expert to be routed has been confirmed can be excluded from the expert candidate map. In the expert candidate map from which the tokens in which the expert to be routed has been confirmed have been excluded, Expert 1 (E1) is included only in the expert candidate of Token 5 (T5). That is, Expert 1 (E1) is included in the expert candidate associated with a total of 1 token (T5). Next, Experts 2 through 4 (E2 ~ E4) are not included in the expert candidate of any token. Finally, Expert 5 (E5) is included in the expert candidate of Token 2 (T2) and in the expert candidate of Token 6 (T6). That is, Expert 5 (E5) is included in the expert candidate associated with a total of 2 tokens (T2, T6). As a result, among the experts from E1 to E5, expert 5 (E5) is commonly included in the most expert candidates, so through expert search, two tokens (T2, T6) associated with expert 5 (E5) can be routed to expert 5 (E5).

[0093] However, even if this process is completed, the expert search can continue as the expert to whom Token 5 should be routed has not yet been determined.

[0094] Referring to Fig. 8, the expert search process is described further. Referring to Fig. 8, the tokens in which the expert to be routed is determined can be excluded from the expert candidate diagram of Fig. 7. In the expert candidate diagram from which the tokens in which the expert to be routed is determined have been excluded, Expert 1 (E1) is included only in the expert candidate of Token 5 (T5). That is, Expert 1 (E1) is included in the expert candidate associated with a total of 1 token (T5). Next, Experts 2 through 5 (E2 ~ E5) are not included in the expert candidate of any token. Consequently, among the experts E1 through E5, Expert 1 (E1) is commonly included in the most expert candidates; thus, through expert search, the 1 token (T5) associated with Expert 1 (E1) can be routed to Expert 1 (E1). Through this process, the expert to be routed for all tokens has been determined, and thus the expert search process can be completed.

[0095] As a result, routing can be performed using only some of the experts among the total experts, rather than using all 5 experts that make up the MoE model to route 6 tokens.

[0096] Additionally, although not illustrated in FIGS. 6 to 8, there may be multiple experts that are commonly included in the largest number of expert candidates during the expert search process. For example, there may be cases where Expert 1 is included in three expert candidates and Expert 2 is also included in three expert candidates. In this case, since there may be no significant difference in performance regardless of which expert is selected, either of the two experts can be selected as the expert for routing.

[0097] FIG. 9 is a figure illustrating an example of the result of expert search performed according to one embodiment of the present disclosure. More specifically, FIG. 9 is a figure illustrating the result of expert search in FIG. 6 through FIG. 8. Referring to FIG. 9, at least one token routed to a specific expert may be set as a token pool. More specifically, token 1 (T1), token 3 (T3), and token 4 (T4) that are commonly routed to expert 3 (E3) may be set as token pool 1 (910). Additionally, token 2 (T2) and token 6 (T6) that are commonly routed to expert 5 (E5) may be set as token pool 2 (920). Finally, token 5 (T5) that is routed to expert 1 (E1) may be set as token pool 2 (930).

[0098] Additionally, as illustrated in FIGS. 4 and 5, the routing score distribution of tokens can be classified into low, medium, and high based on expert sensitivity, which indicates sensitivity to route pooling. For example, among the tokens included in the input batch, a specific token may not have a significant impact on performance even if routing to an expert other than the Top-K expert associated with that token is performed according to the method of the present disclosure. However, for another specific token, the routing score for a specific expert associated with that token is calculated to be very high compared to other experts, so if routing to an expert other than the Top-K expert associated with that token is performed, the accuracy may be significantly reduced. Therefore, the route pooling method of the present disclosure can minimize routing operations (routing to an expert other than the Top-K expert associated with that token as a result of route pooling) for tokens with high expert sensitivity through a candidate threshold, and improve performance by routing experts overlappingly for tokens with low expert sensitivity.

[0099] effect

[0100] The method of the present disclosure has the advantage of being able to balance the trade-off between the accuracy and performance (speed and energy) of the final task by adjusting the candidate threshold. That is, it is possible to adaptively adjust the threshold according to service conditions. For example, in a chat model service such as ChatGPT, when there is a surge in customer queries, the threshold can be raised to slightly lower accuracy and improve speed, thereby satisfying service-level agreements (SLAs) related to speed. Conversely, when there are few queries, the threshold can be lowered to provide a stable and high-accuracy language model service similar to the Top-K method.

[0101] The method of the present disclosure may be a greedy method that can effectively reduce the total number of experts to be routed at a low cost. Considering the examples of FIGS. 6 to 8, if the method of the present disclosure is not used, in the worst case, five experts must be used, but through the proposed technique, this can be reduced to three. In particular, considering a situation where there are 64 to 2048 experts per MoE layer, the method of the present disclosure can drastically reduce batch-based inference time by reducing the use of many experts and contribute to power saving by reducing memory access.

[0102] According to the present disclosure, there is an effect that the number of experts used in batch-based inference can be reduced. More specifically, by drastically reducing the number of experts actually accessed in MoE models that utilize tens to thousands of experts, the amount of memory access can be reduced, and consequently, the inference time and energy consumption associated with memory access can be reduced.

[0103] Furthermore, according to the present disclosure, additional operations required to make routing decisions can be processed at a low cost because they do not handle data as large as the embedding size of a large artificial intelligence model. Processing can be sufficiently achieved with less computation and memory access than that required to process a single expert.

[0104] FIG. 10 is a diagram showing an example of a block diagram of a MoE model according to one embodiment of the present disclosure.

[0105] Referring to FIG. 10, the MoE model (1000) may include an expert module (1010) that includes a plurality of experts.

[0106] Additionally, the MoE model may include a router (1020) that receives an input batch containing at least one input token and performs routing.

[0107] At this time, the router (1020) determines at least one expert candidate including at least one expert among the plurality of experts for routing each of the at least one input token, determines a candidate map representing the relationship between the at least one input token and the at least one expert candidate, and performs an expert search based on the candidate map, wherein, based on the expert search, the input token associated with the at least one expert candidate may be routed to the expert included in the at least one expert candidate that has the most experts among the at least one expert candidate.

[0108] FIG. 11 is a figure showing an example of a method performed by a router of an MoE model according to one embodiment of the present disclosure.

[0109] First, the router can determine at least one expert candidate, including at least one expert among a plurality of experts, for routing each of at least one input token (1110).

[0110] Next, the router can determine a candidate map representing the relationship between the at least one input token and the at least one expert candidate (1120).

[0111] Next, the router can perform an expert search based on the candidate map (1130). Next, based on the expert search, an input token associated with at least some expert candidates can be routed to the expert included in the most of the at least some expert candidates among the at least one expert candidate.

[0112]

[0113] Methods according to the embodiments described in the claims or detailed description of the present disclosure may be implemented in the form of hardware, software, or a combination of hardware and software.

[0114] When implemented in software, a computer-readable storage medium may be provided for storing one or more programs (software modules). One or more programs stored in the computer-readable storage medium are configured for execution by one or more processors within an electronic device. One or more programs include instructions that cause the electronic device to execute methods according to the embodiments described in the claims or specification of this disclosure.

[0115] Such programs (software modules, software) may be stored in random access memory, non-volatile memory including flash memory, read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), magnetic disc storage devices, compact disc-ROM (CD-ROM), digital versatile discs (DVDs), or other forms of optical storage devices, magnetic cassettes. Alternatively, they may be stored in memory composed of some or all of these. Additionally, each constituent memory may include multiple units.

[0116] Additionally, the program may be stored on an attachable storage device that can be accessed via a communication network such as the Internet, Intranet, LAN (local area network), WAN (wide area network), or SAN (storage area network), or a combination thereof. Such a storage device may be connected to a device performing an embodiment of the present disclosure through an external port. Additionally, a separate storage device on a communication network may be connected to a device performing an embodiment of the present disclosure.

[0117] In the specific embodiments of the present disclosure described above, the components included in the disclosure are expressed in a singular or plural form according to the specific embodiments presented. However, the singular or plural expression is selected to suit the situation presented for convenience of explanation, and the present disclosure is not limited to singular or plural components; even if a component is expressed in the plural form, it may be composed of a singular form, and even if a component is expressed in the singular form, it may be composed of a plural form.

[0118] Meanwhile, although specific embodiments have been described in the detailed description of the present disclosure, it is understood that various modifications are possible within the scope of the present disclosure. Therefore, the scope of the present disclosure should not be limited to the described embodiments, but should be defined by the claims set forth below as well as equivalents thereof.

[0119] [National R&D projects that supported this invention]

[0120] [Project ID] 2710008550

[0121] [Assignment No.] II210863

[0122] [Ministry Name] Ministry of Science and ICT

[0123] [Project Management (Specialized) Agency Name] Korea Institute of Information & Communications Technology Planning & Evaluation

[0124] [Research Project Name] Development of Novel PIM Semiconductor Technology

[0125] [Project Title] Development of an Intelligent In-Memory Error Correction Device for High-Reliability Memory

[0126] [Name of Project Performing Organization] Seoul National University Industry-Academic Cooperation Foundation

[0127] [Research Period] 2021.04.01 ~ 2024.12.31

[0128]

[0129] [National R&D projects that supported this invention]

[0130] [Project ID] 2710007826

[0131] [Assignment No.] 00256081

[0132] [Ministry Name] Ministry of Science and ICT

[0133] [Project Management (Specialized) Agency Name] Korea Institute of Information & Communications Technology Planning & Evaluation

[0134] [Research Project Name] Information and Communication Broadcasting Innovation Talent Development (R&D)

[0135] [Research Project Title] Graduate School of Artificial Intelligence and Semiconductors (Seoul National University)

[0136] [Name of Project Performing Organization] Seoul National University Industry-Academic Cooperation Foundation

[0137] [Research Period] July 1, 2023 ~ December 31, 2028

Claims

1. An expert module comprising multiple experts; and Includes a router that receives an input batch containing at least one input token and performs routing, The above router is, Determining at least one expert candidate including at least one expert among the plurality of experts for routing each of the at least one input token, and Determine a candidate map representing the relationship between the above at least one input token and the above at least one expert candidate, and Perform an expert search based on the above candidate map, A mixture of experts (MoE) model in which, based on the above expert search, input tokens associated with at least some expert candidates are routed to the expert included in the most at least some expert candidates among at least one expert candidate.

2. In Paragraph 1, The above candidate map is a MoE model that represents information about expert candidates associated with a specific input token for each of the above at least one input token.

3. In Paragraph 2, A MoE model in which the at least one expert included in the at least one expert candidate is determined based on the descending cumulative sum of the routing scores of each of the at least one experts included in the at least one expert candidate and a preset specific threshold.

4. In Paragraph 1, The above router is a MoE model that uses the above candidate map to search for an expert capable of processing the most input tokens.

5. In Paragraph 4, The above expert search excludes the routed token from the candidate map and is performed repeatedly until all of the at least one input token is routed, in a MoE model.

6. In Paragraph 1, A MoE model in which routing based on the expert search is not performed for input tokens having a form in which the routing score of the expert with the highest routing score among the above at least one input token is higher than a specific threshold.

7. In Paragraph 6, A MoE model in which input tokens having a form in which the routing score of the expert with the highest routing score is higher than a specific threshold are routed to the expert with the highest routing score.

8. In Paragraph 6, A MoE model in which routing based on the expert search is performed only for input tokens among the above at least one input token, wherein the routing score of the expert with the highest routing score is equal to or smaller than a specific threshold.

9. A method performed by the router of a mixture of experts (MoE) model comprising an expert module including multiple experts and a router that receives an input batch including at least one input token and performs routing, wherein A step of determining at least one expert candidate including at least one expert among the plurality of experts for routing each of the at least one input token; A step of determining a candidate map representing the relationship between the at least one input token and the at least one expert candidate; and The method includes the step of performing an expert search based on the above candidate map, A method for routing an input token associated with at least some expert candidates to an expert included in the most at least some expert candidates among at least one expert candidate, based on the above expert search.

10. In Paragraph 9, A method in which the above candidate map represents information about an expert candidate associated with a specific input token for each of the above at least one input token.

11. In Paragraph 10, A method in which the at least one expert included in the at least one expert candidate is determined based on the descending cumulative sum of the routing scores of each of the at least one experts included in the at least one expert candidate and a preset specific threshold.

12. In claim 9, the above method is, A method comprising the step of searching for an expert capable of processing the most input tokens using the above candidate map.

13. In Paragraph 12, A method in which the above expert search excludes the routing completed token from the above candidate map and is performed repeatedly until all of the above at least one input token is routed.

14. In Paragraph 9, A method in which, among the above at least one input token, routing based on the expert search is not performed for an input token having a form in which the routing score of the expert with the highest routing score is higher than a specific threshold.

15. In Paragraph 14, A method in which an input token having a form in which the routing score of the expert with the highest routing score is higher than a specific threshold is routed to the expert with the highest routing score.

16. In Paragraph 14, A method in which routing based on the expert search is performed only on input tokens among the above at least one input token, wherein the routing score of the expert with the highest routing score is equal to or smaller than a specific threshold.