Inference method and apparatus for artificial-intelligence model, and computer program product
By setting up and optimizing the acquisition and storage format of draft lexical units in a layered manner, the problem of slow inference speed in existing technologies is solved, thereby improving the inference performance of artificial intelligence models and user experience.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2025-08-25
- Publication Date
- 2026-07-02
AI Technical Summary
Existing AI models use slow reasoning methods, which increases user waiting time and negatively impacts the user experience.
A hierarchical approach is adopted to set draft lexicons with different access performance. Draft lexicons are retrieved from the set with high access performance first, and inference operations are performed in parallel. The storage format and source of draft lexicons are dynamically adjusted to improve inference performance.
By optimizing the acquisition and storage of draft terms, retrieval time was reduced, inference speed and performance were improved, and user experience was enhanced.
Smart Images

Figure CN2025116725_02072026_PF_FP_ABST
Abstract
Description
Reasoning methods, devices, and computer program products for artificial intelligence models
[0001] This application claims priority to Chinese Patent Application No. 202411919023.2, filed on December 23, 2024, entitled "Reasoning Method, Apparatus, and Computer Program Product for Artificial Intelligence Models", the entire contents of which are incorporated herein by reference. Technical Field
[0002] This disclosure relates to the field of computer technology, and more particularly to inference methods for artificial intelligence models, inference devices for artificial intelligence models, and computer program products. Background Technology
[0003] Currently, artificial intelligence (AI) technology is being applied in an increasing number of fields. In these fields, the need for reasoning based on user input is gradually increasing. Consequently, reasoning methods used in AI models are also developing rapidly. Current reasoning methods utilize a single approach (a standalone draft model, self-speculative decoding, or retrieval-based speculative reasoning) to generate draft tokens, which are then passed to a large language model (LLM) for verification to obtain acceptable tokens. This type of reasoning method is not yet very fast at obtaining draft tokens, resulting in slow reasoning speed, which increases the time users spend waiting for the reasoning results and negatively impacts the user experience. Summary of the Invention
[0004] Embodiments of this disclosure provide a reasoning method for an artificial intelligence model, a reasoning apparatus for an artificial intelligence model, an electronic device, a computer-readable storage medium, and a computer program product.
[0005] According to a first aspect of this disclosure, a reasoning method for an artificial intelligence model is provided. The method includes acquiring reasoning terms for a reasoning operation. The method also includes acquiring draft terms from at least two sets of draft terms based on the reasoning terms. Each of the at least two sets of draft terms has a different data storage format. The at least two sets of draft terms are arranged into at least two hierarchical levels, from high to low, based on the access performance of the draft terms they store. Each set of draft terms corresponds to one level. When acquiring draft terms, draft terms are acquired layer by layer, starting with the set with high access performance. The method also includes performing the reasoning operation in parallel based on the reasoning terms and draft terms. This method, by hierarchically setting up sets of draft terms with different access performance and prioritizing the acquisition of draft terms from the set with high access performance, helps to reduce the time spent retrieving draft terms.
[0006] In some embodiments of this disclosure, the method further includes, when the inference metric of the first draft lexicon in the first draft lexicon set of the at least two draft lexicon sets meets a preset condition, converting the data storage format of the first draft lexicon from the first data storage format corresponding to the first draft lexicon set to the second data storage format corresponding to the second draft lexicon set, and migrating the first draft lexicon to the second draft lexicon set. The first draft lexicon set and the second draft lexicon set are adjacent layers, and the access performance of the first draft lexicon set is lower than that of the second draft lexicon set.
[0007] In some further embodiments of this disclosure, the preset conditions include that the inference metric of the first draft word is higher than the inference metric of a specified draft word in the second draft word set. The specified draft word is the draft word with the lowest inference metric in the second draft word set. The method also includes converting the data storage format of the specified draft word from the second data storage format to the first data storage format, and migrating the specified draft word to the first draft word set.
[0008] In some further embodiments of this disclosure, the preset conditions include that the inference metric of the first draft lexicon is higher than a first metric threshold. The first metric threshold may be a fixed value or may be dynamically updated as the inference operation iterates.
[0009] In these embodiments, the method dynamically adjusts draft lexes in each layer, migrating draft lexes with high inference metrics to a set of draft lexes with high access performance, and migrating draft lexes with low inference metrics to a set of draft lexes with low access performance. This method prioritizes draft lexes with high inference metrics, which is beneficial for improving the inference performance of the LLM performing inference operations.
[0010] In some embodiments of this disclosure, the number of the at least two draft lexicon sets is three. The first draft lexicon set with the lowest access performance includes draft lexicons obtained from a static lexicon library. The second draft lexicon set with intermediate access performance includes draft lexicons obtained from a dynamic lexicon library. The third draft lexicon set with the highest access performance includes draft lexicons obtained from the second draft lexicon set whose inference metrics meet preset conditions.
[0011] In some embodiments of this disclosure, the static lexicon library includes lexicons from a specified scenario.
[0012] In some embodiments of this disclosure, the dynamic lexicon includes at least one of the following: input lexicons for inference operations, output lexicons for inference operations, search results of the input lexicons from a search engine, lexicons obtained by prospective autospeculation inference on the input lexicons, lexicons obtained by skip-level inference on the input lexicons, or lexicons obtained by autoregressive inference on the input lexicons using a small model.
[0013] In these embodiments, by fusing dynamic draft lexics from a dynamic lexicographical library with static draft lexics from a static lexicographical library, this method improves the diversity of draft lexicographical sources, making it more conducive to guessing the output of the LLM. Retaining draft lexics with high inference metrics also improves the chances of guessing the LLM output. Thus, this method can provide the LLM with draft lexics that take into account both local semantics and global domain knowledge.
[0014] In some embodiments of this disclosure, the data storage format for the first draft lexicon set is a suffix array. The data storage format for the second draft lexicon set is a trie tree. The data storage format for the third draft lexicon set is a trie tree represented by key-value pairs. The retrieval time complexity of the suffix array, trie tree, and trie tree represented by key-value pairs decreases sequentially, and therefore their access performance increases sequentially.
[0015] In some embodiments of this disclosure, the number of the at least two draft lexicon sets is two. The first draft lexicon set, which has low access performance, includes draft lexicons obtained from a static lexicon library. The second draft lexicon set, which has high access performance, includes draft lexicons obtained from a dynamic lexicon library.
[0016] In some embodiments of this disclosure, the number of the at least two draft lexicon sets is two. The first draft lexicon set with low access performance includes draft lexicons obtained from a static lexicon library. The second draft lexicon set with high access performance includes draft lexicons obtained from the first draft lexicon set whose inference metrics meet preset conditions.
[0017] In some embodiments of this disclosure, the number of the at least two draft lexicon sets is two. The first draft lexicon set with low access performance includes draft lexicons obtained from a dynamic lexicon library. The second draft lexicon set with high access performance includes draft lexicons obtained from the first draft lexicon set whose inference metrics meet preset conditions.
[0018] In some embodiments of this disclosure, the capacity of a first draft lexicon set with low access performance is greater than the capacity of a second draft lexicon set with high access performance.
[0019] In some embodiments of this disclosure, the second draft lexical set, which has high access performance, is stored in the static random access memory (SRAM) or high bandwidth memory (HBM) of a graphics processing unit (GPU) or neural network processing unit (NPU), or in the memory of a central processing unit (CPU). The first draft lexical set, which has low access performance, is stored in the CPU's memory or on a hard disk.
[0020] In these embodiments, the high-performance draft lexicon set is stored in hardware devices closer to the computing unit, which improves hardware processing speed and reduces data transfer volume. Such hardware devices typically have small storage capacities, and the small size of the high-performance draft lexicon set allows it to be stored in such devices.
[0021] In some embodiments of this disclosure, the inference metric for each draft word in the at least two draft word sets is determined based on at least one of the following: the frequency of the draft word in the at least two draft word sets, the latest acceptance rate of the draft word, the historical acceptance rate of the draft word, or the source of the draft word.
[0022] In some embodiments of this disclosure, during the process of obtaining draft vocabularies from the first draft vocabulary set of the at least two draft vocabulary sets, it is determined whether the candidate draft vocabulary groups matching the inference vocabularies in the first draft vocabulary set include multiple candidate draft vocabulary groups. If it is determined that the candidate draft vocabulary groups matching the inference vocabularies in the first draft vocabulary set include multiple candidate draft vocabulary groups, then one or more candidate draft vocabulary groups are selected from the multiple candidate draft vocabulary groups according to the inference metric of the multiple candidate draft vocabulary groups. The draft vocabularies in the selected candidate draft vocabulary groups are determined as the obtained draft vocabularies. In this way, the method can select the draft vocabularies with the highest inference metric, which is beneficial to improving the inference performance of the LLM performing the inference operation.
[0023] In some embodiments of this disclosure, the inference metric for a single draft term is calculated as the sum of a first coefficient multiplied by the frequency of the draft term in the at least two draft term sets and the weight of the draft term, plus the product of a second coefficient and the latest acceptance rate of the draft term, minus the cost value of the draft term. The sum of the first and second coefficients is one. The weight of the draft term is associated with whether the draft term is hit in the inference operation. The initial cost value of the draft term is associated with the source of the draft term. The cost value of the draft term is set to a default value after the draft term is hit.
[0024] In some embodiments of this disclosure, the method further includes calculating an inference metric for each node in the trie if the number of nodes in a single trie exceeds a node count threshold N. The method also includes deleting a first node from the trie if its inference metric is lower than a second metric threshold. Furthermore, after deleting the first node, if the number of nodes in the trie exceeds N, the method sorts all nodes in the trie in descending order according to the inference metric and deletes nodes whose sort count exceeds N. The inference metric for each node is determined based on at least one of the following: the frequency of the node's occurrence in at least two draft term sets, the node's latest acceptance rate, the node's historical acceptance rate, or the node's origin. In these embodiments, the method dynamically updates the trie, preventing the trie from becoming too large and affecting retrieval speed.
[0025] In some embodiments of this disclosure, the method further includes determining an inference metric for each node in each trie within a set of draft lexical trees comprising trie trees. The method also includes determining an inference metric for the trie tree based on the inference metric for each node in each trie tree. Furthermore, if the number of trie trees in a single draft lexical set exceeds a tree threshold M corresponding to that draft lexical set, the method sorts all trie trees in the draft lexical set in descending order according to the inference metric and removes trie trees whose sorting exceeds M from the draft lexical set. The inference metric for each node is determined based on at least one of the following: the frequency of the node's occurrence in at least two draft lexical sets, the node's latest acceptance rate, the node's historical acceptance rate, or the node's origin. In these embodiments, the method can dynamically update the draft lexical sets, avoiding an excessive number of trie trees in the draft lexical sets that could affect retrieval speed.
[0026] In some embodiments of this disclosure, the application scenarios of the method include at least one of the following: knowledge question answering scenarios, code generation scenarios, article summary scenarios, or online chat scenarios.
[0027] According to a second aspect of this disclosure, an inference apparatus for an artificial intelligence model is provided. The apparatus includes a first acquisition module, a second acquisition module, and an execution module. The first acquisition module is configured to acquire inference terms for inference operations. The second acquisition module is configured to acquire draft terms from at least two sets of draft terms based on the inference terms. Each of the at least two sets of draft terms has a different data storage format. The at least two sets of draft terms are arranged into at least two hierarchical levels based on the access performance of the draft terms they store, from high to low. Each set of draft terms corresponds to one level. When acquiring draft terms, draft terms are acquired layer by layer, starting with the set with high access performance. The execution module is configured to execute inference operations in parallel based on the inference terms and draft terms. The apparatus hierarchically sets draft terms with different access performances and preferentially acquires draft terms from the set with high access performance. Therefore, the apparatus can reduce the time for retrieving draft terms and correspondingly improve the speed of executing inference operations.
[0028] In a third aspect of this disclosure, an electronic device is provided. The electronic device includes at least one processor and a memory. The memory is coupled to the at least one processor and has instructions stored thereon. When executed by the at least one processor, the instructions cause the electronic device to perform the method according to a first aspect of this disclosure.
[0029] In a fourth aspect of this disclosure, a computer-readable storage medium is provided on which a computer program is stored. The computer program is executed by a processor to implement the method described according to a first aspect of this disclosure.
[0030] In a fifth aspect of this disclosure, a computer program product is provided, comprising computer-executable instructions. When executed by a processor, the instructions implement some or all of the steps of the method described according to a first aspect of this disclosure.
[0031] Understandably, the inference device for the artificial intelligence model of the second aspect, the electronic device of the third aspect, the computer storage medium of the fourth aspect, or the computer program product of the fifth aspect provided above are all used to execute the method provided in the first aspect. Therefore, the explanations or descriptions regarding the first aspect also apply to the second, third, fourth, and fifth aspects. Furthermore, the beneficial effects achievable by the second, third, fourth, and fifth aspects can be referred to the beneficial effects in the corresponding methods, and will not be repeated here. Attached Figure Description
[0032] The above and other objects, features and advantages of this disclosure will become more apparent from the accompanying drawings, in which like reference numerals generally denote like parts.
[0033] Figure 1 illustrates a schematic diagram of an example environment in which the apparatus and / or methods of embodiments of the present disclosure may be implemented.
[0034] Figure 2 shows an exemplary flowchart of a reasoning method for an artificial intelligence model according to an embodiment of the present disclosure.
[0035] Figure 3 illustrates an exemplary schematic diagram of a reasoning method for an artificial intelligence model according to an embodiment of the present disclosure.
[0036] Figure 4 shows an exemplary schematic diagram of a trie represented by key-value pairs.
[0037] Figure 5 shows an exemplary schematic diagram of a trie.
[0038] Figure 6 shows an exemplary schematic diagram of a suffix array.
[0039] Figure 7 shows a schematic diagram of an inference apparatus for an artificial intelligence model according to an embodiment of the present disclosure.
[0040] In the various accompanying figures, the same or corresponding reference numerals indicate the same or corresponding parts. The elements in the accompanying figures are schematic and not drawn to scale. Detailed Implementation
[0041] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.
[0042] In the description of embodiments of this disclosure, the term "comprising" and similar terms should be understood as open-ended inclusion, i.e., "including but not limited to". The term "based on" should be understood as "at least partially based on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first", "second", etc., may refer to different or the same objects. Other explicit and implicit definitions may also be included below.
[0043] Unless otherwise defined, all terms used herein (including technical and scientific terms) shall have the same meaning as commonly understood by one of ordinary skill in the art to which this subject matter pertains. It will be further understood that terms such as those defined in commonly used dictionaries shall be interpreted as having the meaning consistent with their meaning in the context of the specification and in the relevant art, and shall not be interpreted in an idealized or overly formal form unless otherwise explicitly defined herein. As used herein, the statement of “connecting” or “coupling” two or more parts together shall mean that these parts are directly joined together or joined through at least one intermediate component.
[0044] In the field of artificial intelligence, large language models (LLMs) can be used for inference. LLMs typically employ autoregressive decoding for inference. In autoregressive decoding, the input token for the next iteration depends on the output token of the current iteration. Thus, LLMs can perform serial decoding word by word. In AI, tokens are sometimes alternatively referred to as lexical symbols, identifiers, tags, or tokens. Each token represents a discrete element within the input text. For example, depending on the inference task, the discrete element used as a token can be a character, word, phrase, term, sentence, symbol, punctuation mark, or other meaningful element. However, serial decoding word by word results in a slow inference process and high inference latency. Furthermore, generating each token requires transferring parameters used for inference from storage to computation, making memory access bandwidth a bottleneck for improving inference speed.
[0045] One approach to improve inference speed is to generate several sets of draft lexical units outside the LLM (each set containing several draft lexical units). The LLM then performs parallel decoding and sampling verification on these draft lexical units to obtain the accepted lexical units (the units used at the end of the inference operation). The draft lexical units in each set typically have semantic relationships. This process can be viewed as one inference iteration (or one round of inference iteration). After multiple iterations, an inference text composed of multiple accepted lexical units can be obtained. This approach allows for the generation of multiple lexical units in a single inference iteration, thereby reducing the number of inference iterations, reducing memory access bandwidth usage, and improving inference performance. Furthermore, the LLM's verification ensures that the accuracy of the inference result does not decrease; therefore, parallel decoding technology using draft lexical units is becoming increasingly widely used. In this approach, how to generate draft lexical units becomes a crucial factor affecting inference speed.
[0046] One way to generate draft lexics is to use a small model for autoregressive inference. Alternatively, a large model can be used for skip-level inference. Furthermore, lookahead speculative inference or retrieval-based speculative inference can also be used to generate draft lexics. However, these inference methods are not very fast, increasing the time users spend waiting for inference results and impacting the user experience.
[0047] This disclosure proposes an inference method for artificial intelligence models. The method proposes a hierarchical structure of draft word sets with different access performance, prioritizing the retrieval of draft words from sets with high access performance. In this method, inference words for the inference operation are retrieved. Then, based on the inference words, draft words are retrieved from at least two sets of draft words. Each of these at least two sets of draft words has a different data storage format. These at least two sets of draft words constitute at least two hierarchical levels based on the access performance of the draft words they store, from high to low. Each set of draft words corresponds to one level. When retrieving draft words, draft words are retrieved layer by layer, starting from the set with high access performance. Subsequently, the inference operation is performed in parallel based on the inference words and the draft words. This method prioritizes retrieving draft words from sets with high access performance, which helps reduce the time spent retrieving draft words.
[0048] The embodiments of this disclosure will now be described in further detail with reference to the accompanying drawings. Figure 1 shows a schematic diagram of an example environment 10 in which the apparatus and / or methods of the embodiments of this disclosure may be implemented. The apparatus and / or methods of the embodiments of this disclosure may be applied to an artificial intelligence (AI) cluster 12. The AI cluster 12 may include a management node and multiple AI servers 120-1, ..., 120-n. The management node is used to manage the multiple AI servers 120-1, ..., 120-n. Each AI server 120-1, ..., 120-n includes a central processing unit (CPU) 121, one or more graphics processing units (GPUs) / neural network processing units (NPUs) 123-1, 123-2, and a bus 122. The bus 122 includes, for example, a high-speed serial computer expansion bus (PCIe bus). The CPU 121 and the GPU / NPU 123-1, 123-2 can communicate via the bus 122. CPU 121 and GPU / NPU 123-1, 123-2 can also share the same physical memory. CPU 121 and GPU / NPU 123-1, 123-2 can also interconnect via a proprietary protocol. Client 11 can send inference requests to AI service cluster 12. One or more AI servers in AI service cluster 12 can perform inference operations.
[0049] Figure 2 shows an exemplary flowchart of an inference method 200 for an artificial intelligence model according to an embodiment of the present disclosure. This method 200 can be executed by an inference device for an artificial intelligence model. Here, the inference device for the artificial intelligence model can be a CPU 121 or a cloud computing node, such as the AI server 120-1 in Figure 1. The method 200 will now be illustrated schematically using the AI server 120-1 as the execution entity. It should be noted that although two GPUs / NPUs are shown in Figure 1, the number of GPUs / NPUs can be more or less.
[0050] At box 202, CPU 121 in AI server 120-1 acquires the inference terms for the inference operation. In the first inference iteration, the inference terms are the input terms from the user. In each subsequent inference iteration, the inference terms are ordered combinations of the accepted terms from previous inference iterations. In this context, accepted terms refer to the terms accepted after the inference operation.
[0051] At box 204, CPU 121 retrieves draft tokens from at least two draft token sets based on inference tokens. Each of the at least two draft token sets has a different data storage format. The at least two draft token sets are arranged in at least two tiers, from high to low, based on the access performance of the draft tokens they contain. Each draft token set corresponds to one tier. When retrieving draft tokens, draft tokens are retrieved tier by tier, starting with the draft token set with the highest access performance. Here, the retrieved draft tokens may include one or more sets of draft tokens. Each set of draft tokens may include one or more draft tokens. Each draft token set has different access performance due to its different data storage format. Access performance includes, for example, retrieval time complexity.
[0052] At box 206, the GPU / NPU 123-1 and 123-2 in AI server 120-1 perform inference operations in parallel based on inference lexics and draft lexics. In some embodiments of this disclosure, inference lexics and draft lexics can be combined, and inference operations can be performed in parallel on this combination through LLM.
[0053] Method 200 hierarchically sets draft terminology sets with different access performance, and prioritizes retrieving draft terminology from the set with high access performance. This helps reduce the time spent retrieving draft terminology, decrease inference latency, and improve inference performance. Method 200 can be applied to inference operations on text, audio, video, images, etc.
[0054] In some embodiments of this disclosure, the operations at boxes 202 to 206 may be performed repeatedly until the inference termination condition is met. Each repeated execution of the operations at boxes 202 to 206 is referred to as one round of inference iteration. The inference termination condition may be that the inference result reaches a preset word length. The inference termination condition may also be that the inference result includes a terminator. For example, if the inference result reaches the preset word length, the inference process ends. Alternatively, if a terminator appears in the inference result, the inference process ends.
[0055] Figure 3 illustrates an exemplary schematic diagram of a reasoning method for an artificial intelligence model according to an embodiment of the present disclosure. Figure 3 exemplarily shows the at least two draft lexical sets as three draft lexical sets, including a first draft lexical set, a second draft lexical set, and a third draft lexical set. The first draft lexical set may also be referred to as a "draft database" in the context. The second draft lexical set may also be referred to as "draft memory" in the context. The third draft lexical set may also be referred to as a "draft cache" in the context. The draft database, draft memory, and draft cache are arranged in a bottom-up order. The priority and access performance of these three draft lexical sets increase in a bottom-up order, and their capacity decreases in a bottom-up order, forming a "pyramid" shape as shown in the dashed box in Figure 3. In the context, the at least two draft lexical sets may alternatively be referred to as a "pyramid." In embodiments of the present disclosure, the number of layers in the "pyramid" may be two or more. When the number of layers in the "pyramid" is two, the at least two sets of draft vocabularies can be a draft database and a draft memory, or a draft database and a draft cache, or a draft memory and a draft cache.
[0056] The draft database with the lowest access performance includes draft lexicons retrieved from a static lexicon. Draft memory with intermediate access performance includes draft lexicons retrieved from a dynamic lexicon. The draft cache with the highest access performance includes draft lexicons retrieved from draft memory whose inference metrics meet preset criteria. Here, the inference metrics of draft lexicons are correlated with the inference performance of the LLM. The higher the inference metrics of draft lexicons, the better the inference performance of the LLM when using them for parallel inference.
[0057] In some embodiments of this disclosure, the static lexicon library may include a domain knowledge base, a constant lexicon library (common phrases, historical question-and-answer statistics, etc.), etc. The domain knowledge base and the constant lexicon library may include lexicons applicable to multiple scenarios. In one example, the static lexicon library may include lexicons from a specified scenario. In this context, lexicons in the static lexicon library may be referred to as static lexicons. The CPU 121 can perform reasoning scenario recognition based on the input lexicons entered by the user. For example, intention recognition can be performed on the input lexicons using natural language processing technology to determine the reasoning scenario desired by the user. In some embodiments of this disclosure, the reasoning scenario may be a code generation scenario, an article summary scenario, an online chat scenario, a knowledge question-and-answer scenario, or an intelligent question-and-answer scenario. The reasoning scenarios listed herein are merely exemplary, and the scenarios to which the embodiments of this disclosure can be applied are not limited to the above scenarios. In one example, the user may, for example, enter "Please help generate a piece of bubble sort code" as an input lexicon. Through natural language processing technology, the CPU 121 can recognize from the input lexicons that it is currently in a code generation scenario. In another example, the user may, for example, enter an article and "Please help summarize the above article" as input lexicons. Using natural language processing (NLP) technology, CPU 121 can identify the current context of an article summary from the input words. Similarly, in other reasoning scenarios, CPU 121 can also identify the current reasoning context from the input words using NLP technology.
[0058] After identifying the reasoning scenario, CPU 121 can obtain a static lexicon based on the reasoning scenario. The static lexicon is, for example, a static N-gram library (N-Grams). Static N-Grams include, for example, static N-grams (N-Grams) associated with the reasoning scenario. Static N-Grams include, for example, one or more semantically related static lexicons. In one example, CPU 121 can generate static N-Grams locally based on the input lexicons. The static N-Grams in the static N-Grams can be obtained, for example, from a local database or knowledge graph. CPU 121 can also receive static N-Grams associated with the input lexicons from other CPUs 121. Other CPUs 121 can, for example, pre-generate multiple static N-Grams and, upon receiving a reasoning scenario associated with the input lexicons, find the static N-Grams for that reasoning scenario and send the static N-Grams to that CPU 121. Because each static N-Gram in a static N-Grams is associated with the inference scenario, the probability of draft terms selected from static N-Grams being accepted remains high during the cold start phase. Thus, by leveraging static N-Grams, especially during the cold start phase, the acceptance rate of draft terms can be improved, thereby enhancing inference performance. Since static lexicons are typically large, a large capacity is allocated to them.
[0059] In some embodiments of this disclosure, the dynamic lexicon includes at least one of the following: input lexicons for inference operations, output lexicons for inference operations, search results from a search engine for input lexicons, lexicons obtained through prospective autospecional inference on input lexicons, lexicons obtained through hierarchical inference on input lexicons, or lexicons obtained through autoregressive inference on input lexicons using a small model. Since dynamic lexicons are typically smaller than static lexicons, they are given a smaller capacity than static lexicons. In this context, lexicons in a dynamic lexicon may be referred to as dynamic lexicons.
[0060] In the example of Figure 3, CPU 121 first retrieves a draft word that matches the inference word from the draft cache. If a draft word matching the inference word is retrieved from the draft cache, CPU 121 identifies this draft word as the retrieved draft word. If no draft word matching the inference word is retrieved from the draft cache, CPU 121 retrieves a draft word matching the inference word from the draft memory. If a draft word matching the inference word is retrieved from the draft memory, CPU 121 identifies this draft word as the retrieved draft word. If no draft word matching the inference word is retrieved from the draft memory, CPU 121 retrieves a draft word matching the inference word from the draft database. If a draft word matching the inference word is retrieved from the draft database, CPU 121 identifies this draft word as the retrieved draft word. If no draft term matching the inference term is obtained from the draft database, CPU 121 determines that no draft term was obtained in this round of inference iteration and instructs LLM to perform inference operations in the manner of autoregressive inference.
[0061] In some embodiments of this disclosure, the draft cache data storage format is a trie (a trie stored as key-value pairs). Figure 4 shows an exemplary schematic diagram of a trie represented by key-value pairs. In the example of Figure 4, the key is "I am", the first value is "happy.", the second value is "happy to", and the third value is "here waiting for". Nodes included by the key are shaded. Assuming the inference term is "I am", a trie starting with "I am" needs to be found. After finding a trie starting with "I am", the CPU 121 determines whether the candidate draft tuples matching the inference term "I am" in the draft cache include multiple candidate draft tuples. In the example of Figure 4, the candidate draft tuples are "happy.", "happy to", and "here waiting for". If CPU 121 determines that the candidate draft tuples matching the inference tuples in the draft cache include multiple candidate draft tuples, it selects one or more candidate draft tuples from these multiple candidate draft tuples based on their inference metrics. CPU 121 then identifies the draft tuples from the selected candidate draft tuples as the acquired draft tuples. For example, assuming the inference metrics of the candidate draft tuples "happy." and "happy to" are higher than those of the candidate draft tuple "here waiting for", and there are two candidate draft tuples to be selected, then CPU 121 identifies the draft tuples from the candidate draft tuples "happy." and "happy to" as the acquired draft tuples. This allows CPU 121 to select the draft tuple with the highest inference metric, which is beneficial for improving the inference performance of the LLM performing the inference operation.
[0062] In some embodiments of this disclosure, the data storage format of the draft memory is a trie. Figure 5 shows an exemplary schematic diagram of a trie. Comparing Figure 5 and Figure 4, it can be seen that each node in the trie is independent when key-value pairs are not used for representation. During the process of retrieving draft tuples from the draft memory, assuming the inference tuple is "I am", it is only necessary to find a trie that starts with "am". It is not required that the parent node of "am" in that trie is "I", and the trie may not include the parent node of "am". After finding a trie that starts with "am", the CPU 121 determines whether the candidate draft tuples matching the inference tuple "am" in the draft cache include multiple candidate draft tuples. In the example of Figure 5, the candidate draft tuples are "happy.", "happy to", and "here waiting for". If CPU 121 determines that the candidate draft tuples matching the inference tuples in the draft cache include multiple candidate draft tuples, it selects one or more candidate draft tuples from these multiple candidate draft tuples based on their inference metrics. CPU 121 then identifies the draft tuples from the selected candidate draft tuples as the acquired draft tuples. For example, assuming the inference metrics of the candidate draft tuples "happy." and "happy to" are higher than those of the candidate draft tuple "here waiting for", and there are two candidate draft tuples to be selected, then CPU 121 identifies the draft tuples from the candidate draft tuples "happy." and "happy to" as the acquired draft tuples. This allows CPU 121 to select the draft tuple with the highest inference metric, which is beneficial for improving the inference performance of the LLM performing the inference operation.
[0063] In some embodiments of this disclosure, the draft database is stored in a suffix array format. Figure 6 shows an exemplary schematic diagram of the suffix array. The suffix array is an array obtained by sorting all the suffixes of a string. Figure 6 shows the "original suffix strings" numbered 0 to 8. These original suffix strings are sorted lexicographically to obtain the "lexicographically sorted suffix strings." The corresponding numbers to these suffix strings are sorted accordingly to obtain the suffix array. The basic principle is to utilize the ordered nature of the suffix array. Since the suffix array stores the sorting information of all suffixes of the text string, binary search can determine the possible intervals where patterns may appear within logarithmic time complexity, and then linear scanning within this interval can accurately match the patterns, greatly improving matching efficiency.
[0064] The time complexity of a trie (prefix tree) represented by key-value pairs is O(1). The time complexity of a trie is O(k), where k is the length of the query string (e.g., the last k terms in the inference tokens). The time complexity of a suffix array is O(logN), where N is the length of the suffix array. The time complexity of suffix arrays, tries, and tries represented by key-value pairs decreases in that order, and therefore their access performance increases in that order.
[0065] Returning to Figure 3, as shown by arrow ①, when draft words are obtained from the "pyramid," at box 310, CPU 121 masks the draft words. As mentioned above, assuming the obtained draft words are "happy." and "happy to," CPU 121 merges the inference word "I am" with the obtained draft words "happy." and "happy to" into "I am happy to." CPU 121 can obtain "I am happy to" and "I am happy." by applying a mask to "I am happy to." At box 320, GPU / NPU 123-1 and 123-2 perform parallel decoding of "I am happy to" and "I am happy." respectively. By masking, the number of draft words provided to the LLM can be reduced, thus reducing the amount of data transfer between CPU 121 and GPU / NPU 123-1 and 123-2. Here, LLM can use known parallel decoding techniques to decode "I am happy to" and "I am happy.", or it can use parallel decoding techniques developed in the future. The embodiments of this disclosure do not limit the parallel decoding method.
[0066] After parallel decoding, the LLM samples the decoding results and verifies the samples to determine the accepted terms. The sampling method can be greedy sampling, selecting the K decoded results with the most accepted terms, or randomly selecting K decoded results from the P decoded results with the most accepted terms. In this context, K and P are positive integers, with K less than P. Here, accepted terms refer to terms that consecutively match the LLM's decoding results according to the order of natural language. Once an unaccepted term appears, all subsequent terms are considered unaccepted. Assume a single set of draft terms sequentially includes a first draft term, a second draft term, a third draft term, and a fourth draft term. If the first, second, and fourth draft terms all match the LLM's decoding results, but the third draft term does not, then the accepted terms are the first and second draft terms. Since the third draft lexicon is an unaccepted lexicon, the fourth draft lexicon is also considered an unaccepted lexicon. If the lexicon in the LLM decoding result corresponding to the third draft lexicon is called the substitution lexicon, then the inference lexicon, the first draft lexicon, the second draft lexicon, and the substitution lexicon are combined (concatenated) into the inference result.
[0067] In some embodiments of this disclosure, during the verification process, the CPU 121 can treat each group of draft words in the sampling results as target draft words and calculate the acceptance rate for the target draft words. In this process, the accepted draft words in the target draft words are determined sequentially. In the example where the first group of draft words is "is sunny," assuming the LLM decodes "tomorrow is sunny" as "tomorrow is rainy," then only one of the three draft words "is," "sunny," and "day" in the first group is an accepted word, and its acceptance rate is 1 / 3. In the example where the second group of draft words is "is rainy," assuming the LLM decodes "tomorrow is rainy," then three of the three draft words "is," "rain," and "day" in the second group are accepted words, and its acceptance rate is 3 / 3 = 100%.
[0068] At box 330, CPU 121 redetermines the inference metrics of draft words in each draft word set in the "pyramid" based on the results of parallel decoding. At box 340, CPU 121 updates each draft word set in the "pyramid" according to the redetermined inference metrics of each draft word, as shown by arrow ②.
[0069] During the update process, as shown by arrow ③ in Figure 3, if the inference metric of a draft term in draft memory meets the preset conditions set for the draft cache, the data storage format of that draft term is changed from the second data storage format corresponding to draft memory to the third data storage format corresponding to draft cache, and the draft term is "upgraded" (migrated) to the draft cache with higher access performance. Similarly, as shown by arrow ④ in Figure 3, if the inference metric of a draft term in draft database meets the preset conditions set for draft memory, the data storage format of that draft term is changed from the first data storage format corresponding to draft database to the second data storage format corresponding to draft memory, and the draft term is "upgraded" (migrated) to draft memory with higher access performance.
[0070] In some embodiments of this disclosure, preset conditions for the draft memory settings include a higher inference metric for a first draft term in the draft database than for a specified draft term in the draft memory. The specified draft term is the draft term in the draft memory with the lowest inference metric. The first draft term refers to any draft term in the draft database. When the first draft term is migrated to the draft memory, the CPU 121 converts the data storage format of the specified draft term from the second data storage format to the first data storage format and migrates the specified draft term to the draft database.
[0071] Similarly, the preset conditions for the draft cache settings include that the inference metric of the first draft term in draft memory is higher than the inference metric of a specified draft term in the draft cache. The specified draft term is the draft term with the lowest inference metric in the draft cache. The first draft term refers to any draft term in draft memory. When the first draft term is migrated to the draft cache, the CPU 121 converts the data storage format of the specified draft term from the third data storage format to the second data storage format and migrates the specified draft term to draft memory.
[0072] In some other embodiments of this disclosure, the preset conditions for the draft memory settings include that the inference metric of the first draft term in the draft database is higher than a first metric threshold. The first metric threshold can be a fixed value or can be dynamically updated as the inference operation iterates. When the first draft term is migrated to the draft memory, the CPU 121 converts the data storage format of the specified draft term in the draft memory from a second data storage format to a first data storage format, and migrates the specified draft term to the draft database. The specified draft term is the draft term with the lowest inference metric in the draft memory.
[0073] Similarly, the preset conditions for the draft cache include that the inference metric of the first draft term in the draft memory is higher than a first metric threshold. The first metric threshold can be a fixed value or dynamically updated during the iteration of the inference operation. When the first draft term is migrated to the draft cache, the CPU 121 converts the data storage format of the specified draft term in the draft cache from a third data storage format to a second data storage format and migrates the specified draft term to the draft memory. The specified draft term is the draft term with the lowest inference metric in the draft cache.
[0074] In this way, draft terms in each layer can be dynamically adjusted, migrating draft terms with high inference metrics to a set of draft terms with high access performance, and migrating draft terms with low inference metrics to a set of draft terms with low access performance. This method prioritizes draft terms with high inference metrics, which is beneficial for improving the inference performance of the LLM performing inference operations.
[0075] In some embodiments of this disclosure, the inference metric for each draft word in the at least two draft word sets is determined based on at least one of the following: the frequency of the draft word in the at least two draft word sets, the latest acceptance rate of the draft word, the historical acceptance rate of the draft word, or the source of the draft word.
[0076] In some embodiments of this disclosure, the inference metric for a single draft lexical is determined according to the following formula: Ord=α·Freq·Wt+(1-α)·Acpr-co (1)
[0077] Where Ord represents the inference metric of the draft term, α is a constant (α∈[0,1]), Freq represents the frequency of the draft term in the at least two draft term sets, Wt represents the weight of the draft term, Acpr represents the latest acceptance rate of the draft term, and co represents the cost value of the draft term. The weight of the draft term is related to whether the draft term is hit in the inference operation. The initial cost value of the draft term is related to its source. The sources of draft terms include: draft database, draft memory, and draft cache. In one example, the initial cost value of the draft term from the draft database is 2, and the initial cost value of the draft term from the draft memory is 2. Alternatively, the initial cost value of the draft term can be set according to the actual business scenario. If the sources overlap, the largest cost value is taken as the initial cost value of the node. The cost value of the draft term is set to the default value after the draft term is hit.
[0078] In some embodiments of this disclosure, the weights of static draft terms from the draft database are initialized to 1. The weights of dynamic draft terms from the draft memory are initialized to 2. The draft cache is initially empty, and therefore no initial weight values are set for it.
[0079] With γ representing the weight update coefficient, the weights can be updated in each inference iteration as follows: For a successfully hit (accepted) draft term: Wt = (1 + γ)·Wt. For a not hit (not accepted) draft term: Wt = (1 - γ)·Wt. Thus, the inference metric for a single trie is determined as Tord = ∏Ord_i (2). Where, i ∈ N, Ord_i represents the i-th node, and N represents the threshold number of nodes in a single trie. In one example, γ = 0.1.
[0080] In some embodiments of this disclosure, the initial value of the acceptance rate for a draft term is 0. If the draft term is accepted in a round of inference, its acceptance rate is calculated using the following formula:
[0081] Acpr = the number of draft words accepted in this round of reasoning / the total number of draft words in this round of reasoning, Acpr ∈ [0,1].
[0082] In the example in Figure 3, by fusing dynamic draft lexics from a dynamic lexicographical library with static draft lexics from a static lexicographical library, this method improves the diversity of draft lexicographical sources, making it more conducive to guessing the output of the LLM. Retaining draft lexics with high inference metrics also improves the chances of guessing the LLM output. Thus, this method can provide the LLM with draft lexics that take into account both local semantics and global domain knowledge.
[0083] In some embodiments of this disclosure, the draft cache with the highest access performance is stored in the static random access memory (SRAM) or high-bandwidth memory (HBM) of the graphics processing unit (GPU) or neural network processing unit (NPU). Drafts with intermediate access performance are stored in the memory of the central processing unit (CPU). The draft database with the lowest access performance is stored in the CPU's memory or on a hard disk. In one example, a portion of the draft database can be loaded into the CPU's memory, depending on its size. If the draft database is large, the remainder is stored on the hard disk. During retrieval and updating of the draft database, draft terms in the database are rotated between CPU memory and the hard disk. This allows the high-performance set of draft terms to be stored in hardware closer to the computing unit, improving hardware processing speed and reducing data transfer volume. Such hardware devices typically have small storage capacities, making it suitable for storing the high-performance draft term set, which is small in size.
[0084] Some embodiments of this disclosure also propose internal updates for each trie. In these embodiments, if the number of nodes in a single trie exceeds a node count threshold N, the CPU 121 calculates an inference metric for each node in the trie. If the inference metric of the first node in the trie is lower than a second metric threshold, the CPU 121 removes the first node from the trie. After removing the first node, if the number of nodes in the trie exceeds N, the CPU 121 sorts all nodes in the trie in descending order according to the inference metric and removes nodes whose sorting exceeds N. The inference metric for each node can be determined according to equation (1). In these embodiments, the trie can be dynamically updated, thereby avoiding the trie becoming too large and affecting retrieval speed.
[0085] Some embodiments of this disclosure also propose updating the trie in the draft lexicon set. In these embodiments, CPU 121 determines the inference metric of each node in each trie in each draft lexicon set (draft memory and draft cache in FIG. 3), which includes the trie. CPU 121 determines the inference metric of the trie based on the inference metric of each node in each trie (e.g., it can be determined according to Equation (2)). If the number of tries in a single draft lexicon set exceeds the tree threshold M corresponding to that draft lexicon set (the tree threshold may be different for different draft lexicon sets), then CPU 121 sorts all the tries in the draft lexicon set in descending order according to the inference metric and deletes the tries that exceed M from the draft lexicon set. The inference metric of each node can be determined according to Equation (1). In these embodiments, each draft lexicon set can be dynamically updated to avoid the retrieval speed being affected by too many tries in the draft lexicon set.
[0086] Figure 7 illustrates a schematic diagram of an inference apparatus 700 for an artificial intelligence model according to an embodiment of the present disclosure. The apparatus 700 is, for example, arranged in the AI server cluster 12 shown in Figure 1. The apparatus 700 may include multiple modules for performing corresponding steps in the method 200 discussed in Figure 2. As shown in Figure 7, the apparatus 700 includes a first acquisition module 702, a second acquisition module 704, and an execution module 706. The first acquisition module 702 is configured to acquire inference terms for the inference operation. The second acquisition module 704 is configured to acquire draft terms from at least two sets of draft terms based on the inference terms. Each of the at least two sets of draft terms has a different data storage format. The at least two sets of draft terms constitute at least two hierarchical levels based on the access performance of the draft terms they store, from high to low. Each set of draft terms corresponds to one level. When acquiring draft terms, draft terms are acquired layer by layer, starting with the set of draft terms with high access performance. The execution module 706 is configured to perform inference operations in parallel based on inference terms and draft terms. The device hierarchically sets draft term sets with different access performance, prioritizing the retrieval of draft terms from sets with higher access performance. Therefore, the device can reduce the time spent retrieving draft terms and correspondingly increase the speed of inference operations.
[0087] In some embodiments of this disclosure, the device further includes a hierarchy adjustment module. The hierarchy adjustment module is configured to, when the inference metric of a first draft lexicon in a first draft lexicon set of the at least two draft lexicon sets meets a preset condition, convert the data storage format of the first draft lexicon from a first data storage format corresponding to the first draft lexicon set to a second data storage format corresponding to the second draft lexicon set, and migrate the first draft lexicon to the second draft lexicon set. The first draft lexicon set and the second draft lexicon set are adjacent layers, and the access performance of the first draft lexicon set is lower than that of the second draft lexicon set.
[0088] In some further embodiments of this disclosure, the preset conditions include that the inference metric of the first draft word is higher than the inference metric of a specified draft word in the second draft word set. The specified draft word is the draft word with the lowest inference metric in the second draft word set. The hierarchy adjustment module is also configured to convert the data storage format of the specified draft word from the second data storage format to the first data storage format, and migrate the specified draft word to the first draft word set.
[0089] In some further embodiments of this disclosure, the preset conditions include that the inference metric of the first draft lexicon is higher than a first metric threshold. The first metric threshold may be a fixed value or may be dynamically updated as the inference operation iterates.
[0090] In some embodiments of this disclosure, the number of the at least two draft lexicon sets is three. The first draft lexicon set with the lowest access performance includes draft lexicons obtained from a static lexicon library. The second draft lexicon set with intermediate access performance includes draft lexicons obtained from a dynamic lexicon library. The third draft lexicon set with the highest access performance includes draft lexicons obtained from the second draft lexicon set whose inference metrics meet preset conditions.
[0091] In some embodiments of this disclosure, the static lexicon library includes lexicons from a specified scenario.
[0092] In some embodiments of this disclosure, the dynamic lexicon includes at least one of the following: input lexicons for inference operations, output lexicons for inference operations, search results of the input lexicons from a search engine, lexicons obtained by prospective autospeculation inference on the input lexicons, lexicons obtained by skip-level inference on the input lexicons, or lexicons obtained by autoregressive inference on the input lexicons using a small model.
[0093] In some embodiments of this disclosure, the data storage format for the first draft lexicon set is a suffix array. The data storage format for the second draft lexicon set is a trie. The data storage format for the third draft lexicon set is a trie represented by key-value pairs. The retrieval time complexity of the suffix array, trie, and trie represented by key-value pairs decreases sequentially, and therefore their access performance increases sequentially.
[0094] In some embodiments of this disclosure, the number of the at least two draft lexicon sets is two. The first draft lexicon set, which has low access performance, includes draft lexicons obtained from a static lexicon library. The second draft lexicon set, which has high access performance, includes draft lexicons obtained from a dynamic lexicon library.
[0095] In some embodiments of this disclosure, the number of the at least two draft lexicon sets is two. The first draft lexicon set with low access performance includes draft lexicons obtained from a static lexicon library. The second draft lexicon set with high access performance includes draft lexicons obtained from the first draft lexicon set whose inference metrics meet preset conditions.
[0096] In some embodiments of this disclosure, the number of the at least two draft lexicon sets is two. The first draft lexicon set with low access performance includes draft lexicons obtained from a dynamic lexicon library. The second draft lexicon set with high access performance includes draft lexicons obtained from the first draft lexicon set whose inference metrics meet preset conditions.
[0097] In some embodiments of this disclosure, the capacity of a first draft lexicon set with low access performance is greater than the capacity of a second draft lexicon set with high access performance.
[0098] In some embodiments of this disclosure, the second draft lexical set, which has high access performance, is stored in the static random access memory (SRAM) or high bandwidth memory (HBM) of a graphics processing unit (GPU) or neural network processing unit (NPU), or in the memory of a central processing unit (CPU). The first draft lexical set, which has low access performance, is stored in the CPU's memory or on a hard disk.
[0099] In some embodiments of this disclosure, the inference metric for each draft word in the at least two draft word sets is determined based on at least one of the following: the frequency of the draft word in the at least two draft word sets, the latest acceptance rate of the draft word, the historical acceptance rate of the draft word, or the source of the draft word.
[0100] In some embodiments of this disclosure, the second acquisition module includes a first determining module and a selecting module. The first determining module is configured to determine whether the candidate draft lexical groups matching the inference lexical in the first draft lexical set include multiple candidate draft lexical groups. The selecting module is configured to, if it is determined that the candidate draft lexical groups matching the inference lexical in the first draft lexical set include multiple candidate draft lexical groups, select one or more candidate draft lexical groups from the multiple candidate draft lexical groups according to the inference metric of the multiple candidate draft lexical groups. The draft lexical in the selected candidate draft lexical group is determined as the acquired draft lexical.
[0101] In some embodiments of this disclosure, the inference metric for a single draft term is calculated as the sum of a first coefficient multiplied by the frequency of the draft term in the at least two draft term sets and the weight of the draft term, plus the product of a second coefficient and the latest acceptance rate of the draft term, minus the cost value of the draft term. The sum of the first and second coefficients is one. The weight of the draft term is associated with whether the draft term is hit in the inference operation. The initial cost value of the draft term is associated with the source of the draft term. The cost value of the draft term is set to a default value after the draft term is hit.
[0102] In some embodiments of this disclosure, the apparatus further includes a pruning module. The pruning module is configured to calculate an inference metric for each node in the trie if the number of nodes in a single trie exceeds a node count threshold N. The pruning module is also configured to delete a first node from the trie if the inference metric of a first node in the trie is lower than a second metric threshold. The pruning module is further configured to, after deleting the first node from the trie, if the number of nodes in the trie exceeds N, sort all nodes in the trie in descending order according to the inference metric and delete nodes whose sorting exceeds N. The inference metric for each node is determined based on at least one of the following: the frequency of the node's occurrence in the at least two draft lexicon sets, the node's latest acceptance rate, the node's historical acceptance rate, or the node's origin.
[0103] In some embodiments of this disclosure, the apparatus further includes a tree pruning module. The tree pruning module is configured to determine an inference metric for each node in each trie within a set of draft lexical trees comprising trie trees. The tree pruning module is also configured to determine an inference metric for the trie tree based on the inference metric for each node in each trie tree. The tree pruning module is further configured to, if the number of trie trees in a single draft lexical set exceeds a tree threshold M corresponding to that draft lexical set, sort all trie trees in the draft lexical set in descending order according to the inference metric, and delete trie trees whose sorting exceeds M from the draft lexical set. The inference metric for each node is determined based on at least one of the following: the frequency of the node's occurrence in at least two draft lexical sets, the node's latest acceptance rate, the node's historical acceptance rate, or the node's origin.
[0104] In summary, the inference method for artificial intelligence models according to embodiments of this disclosure hierarchically sets draft word sets with different access performance, and prioritizes retrieving draft words from the draft word set with high access performance, which helps reduce the time spent retrieving draft words. This can correspondingly improve the speed at which the inference device performs inference operations.
[0105] This disclosure can be a method, apparatus, system, and / or computer program product. A computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for performing various aspects of this disclosure.
[0106] A computer-readable storage medium can be a tangible device capable of holding and storing instructions for use by an instruction execution device. A computer-readable storage medium can be, for example—but not limited to—an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination thereof. More specific examples (a non-exhaustive list) of computer-readable storage media include: random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), and any suitable combination thereof. The computer-readable storage medium as used herein is not to be construed as a transient signal itself, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.
[0107] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.
[0108] Computer program instructions used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as the "C" language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry, such as programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), is personalized by utilizing the status information of the computer-readable program instructions to implement various aspects of this disclosure.
[0109] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.
[0110] These computer-readable program instructions can be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner. Thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.
[0111] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.
[0112] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.
[0113] Unless otherwise expressly indicated by the context, the singular form of words used herein and in the appended claims includes the plural form, and vice versa. Thus, when referring to the singular, the plural form of the corresponding term is generally included. Where the term “example” is used herein, particularly when it follows a set of terms, the “example” is merely exemplary and illustrative and should not be considered exclusive or pervasive.
[0114] Further aspects and scope of adaptation become apparent from the description provided herein. It should be understood that various aspects of this application may be implemented individually or in combination with at least one other aspect. It should also be understood that the descriptions and specific embodiments herein are for illustrative purposes only and are not intended to limit the scope of this application.
[0115] The various embodiments of this disclosure have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or improvement of the technology in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.
Claims
1. A reasoning method for artificial intelligence models, characterized in that, The method includes: Obtain the reasoning lexical units for reasoning operations; Based on the inference lexical units, draft lexical units are obtained from at least two draft lexical unit sets, wherein each of the at least two draft lexical unit sets has a different data storage format, and the at least two draft lexical unit sets are arranged in at least two levels from high to low according to the access performance of the draft lexical units they store, with each draft lexical unit set corresponding to one level. Furthermore, when obtaining the draft lexical units, the process starts from the draft lexical unit set with the highest access performance and proceeds layer by layer. The reasoning operation is performed in parallel based on the inference lexical units and the draft lexical units.
2. The method according to claim 1, characterized in that, The method further includes: When the inference metric of the first draft lexicon in the first draft lexicon set of the at least two draft lexicon sets meets the preset conditions, the data storage format of the first draft lexicon is converted from the first data storage format corresponding to the first draft lexicon set to the second data storage format corresponding to the second draft lexicon set, and the first draft lexicon is migrated to the second draft lexicon set, wherein the first draft lexicon set and the second draft lexicon set are adjacent layers, and the access performance of the first draft lexicon set is lower than that of the second draft lexicon set.
3. The method according to claim 1 or 2, characterized in that, The number of the at least two draft lexicon sets is three, wherein the first draft lexicon set with the lowest access performance includes draft lexicons obtained from the static lexicon library, the second draft lexicon set with intermediate access performance includes draft lexicons obtained from the dynamic lexicon library, and the third draft lexicon set with the highest access performance includes draft lexicons obtained from the second draft lexicon set whose inference metrics meet preset conditions.
4. The method according to claim 3, characterized in that, The data storage format of the first draft lexicon set is a suffix array, the data storage format of the second draft lexicon set is a trie, and the data storage format of the third draft lexicon set is a trie represented by key-value pairs.
5. The method according to claim 1 or 2, wherein the number of the at least two draft lexicon sets is two, wherein the first draft lexicon set with low access performance includes draft lexicons obtained from a static lexicon library, and the second draft lexicon set with high access performance includes draft lexicons obtained from a dynamic lexicon library; or The first draft lexicon set with low access performance includes draft lexicons obtained from the static lexicon library, and the second draft lexicon set with high access performance includes draft lexicons obtained from the first draft lexicon set whose inference metrics meet preset conditions; or The first draft lexicon set with low access performance includes draft lexicons obtained from the dynamic lexicon library, while the second draft lexicon set with high access performance includes draft lexicons obtained from the first draft lexicon set whose inference metrics meet preset conditions.
6. The method according to claim 1 or 2, characterized in that, The capacity of the first draft lexicon set, which has lower access performance, is greater than the capacity of the second draft lexicon set, which has higher access performance.
7. The method according to claim 1 or 2, characterized in that, The second draft lexicon set with high access performance is stored in the static random access memory or high-bandwidth memory of the graphics processing unit or neural network processing unit, or in the memory of the central processing unit (CPU), while the first draft lexicon set with low access performance is stored in the memory of the CPU or in the hard disk.
8. The method according to claim 2, characterized in that, The inference metric for each draft word in the at least two draft word sets is determined based on at least one of the following: the frequency of the draft word in the at least two draft word sets, the latest acceptance rate of the draft word, the historical acceptance rate of the draft word, or the source of the draft word.
9. An inference device for an artificial intelligence model, comprising: The first acquisition module is configured to acquire the reasoning lexical units of the reasoning operation; The second acquisition module is configured to acquire draft words from at least two draft word sets based on the inferred words, wherein each of the at least two draft word sets has a different data storage format, and the at least two draft word sets are arranged into at least two levels from high to low according to the access performance of the draft words stored therein, with each draft word set corresponding to one level, and when acquiring the draft words, the draft words are acquired layer by layer starting from the draft word set with high access performance; as well as An execution module is configured to perform the inference operation in parallel based on the inference lexical and the draft lexical.
10. A computer program product tangibly stored on a non-transient computer-readable medium and comprising machine-executable instructions for performing the method according to any one of claims 1-8.