Model post-processing method and apparatus

By filtering and normalizing K elements in the logits output by the model, the problem of low efficiency caused by large data volume in the post-processing stage of the model is solved, the inference efficiency and accuracy are improved, the device load is reduced, and the user experience is enhanced.

WO2026137841A1PCT designated stage Publication Date: 2026-07-02HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2025-07-30
Publication Date
2026-07-02

Smart Images

  • Figure CN2025111580_02072026_PF_FP_ABST
    Figure CN2025111580_02072026_PF_FP_ABST
Patent Text Reader

Abstract

A model post-processing method and apparatus, relating to the technical field of computers. In the method, an inference device acquires first logits (S310), screens the first logits to obtain the top K elements arranged in descending order (S320), performs normalization processing on the K elements (S330) to obtain first confidence scores of K tokens, and accordingly determines, on the basis of the first confidence scores of the K tokens, a first target token corresponding to an inference request (S340). K is greater than or equal to 1, and there is a mapping relationship between one element among the logits and one token. After the top K elements in descending order are screened out from the first logits, the inference device processes only the K elements or the K tokens corresponding to the K elements in subsequent processing steps (processes such as normalization processing and determination of the first target token).
Need to check novelty before this filing date? Find Prior Art

Description

A model post-processing method and apparatus

[0001] This application claims priority to Chinese Patent Application No. 202411929605.9, filed on December 23, 2024, entitled “A Model Post-processing Method and Apparatus”, the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of computer technology, and in particular to a model post-processing method and apparatus. Background Technology

[0003] In the field of Natural Language Processing (NLP) within artificial intelligence, a series of operations are required during the model's inference process, from the initial predicted values ​​(logits) to the smallest linguistic unit (token). These operations include filtering, normalizing, and sampling the logits. However, since all these operations are based on complete logits, the amount of data processed during the logits-to-token stage is large, resulting in low model inference efficiency. Summary of the Invention

[0004] This application provides a model post-processing method and apparatus to solve the problem of large data volume and low model inference efficiency during the model post-processing stage.

[0005] Firstly, this application provides a model post-processing method. This method can be applied to a computer system or a computing device that supports the implementation of the model post-processing method within the computer system. For example, the computing device can be called an inference device, which may be a terminal, server, etc. In one possible example, the model post-processing method includes: the inference device acquiring a first logits, filtering the top K elements in descending order from the first logits, normalizing the K elements to obtain the first confidence scores of the K tokens, and then determining the first target token corresponding to the inference request based on the first confidence scores of the K tokens. Here, the first logits are obtained by the model based on the first inference request, K is greater than or equal to 1, there is a mapping relationship between each element in the logits and a token, and the K tokens include the first target token.

[0006] In this application, after the inference device selects the top K elements in descending order from the first logits, subsequent processing steps (such as normalization and determining the first target token corresponding to the inference request based on the first confidence of the K tokens) are only processed for the K elements or the K tokens corresponding to the K elements. This reduces the amount of data that the inference device needs to process during post-processing, thereby reducing the time required for post-processing and improving the efficiency of model inference.

[0007] In one possible scenario, the attributes of the aforementioned elements include: index, meaning, etc. The index of an element indicates its position in the logits; for example, element 'a' is the first element in the first logits. The meaning of an element represents the model's relative "confidence" in each token, that is, it represents the relative probability value (confidence level) of each token; for example, the larger the value of element 'a' in the first logits, the greater the confidence level of the token corresponding to the index of element 'a'.

[0008] In one possible example, there is a mapping between an element in logits and a token to indicate that there is a mapping between the index of an element in logits and the token.

[0009] For example, under the same vocabulary, the mapping relationship between the index of an element and the token is consistent across different logits. For instance, the element that is first in the first logit and the element that is first in the second logit both correspond to the same token in the mapping relationship.

[0010] In one possible example, the first confidence level of the first target token is the first confidence level of the first Q tokens out of the K tokens, where Q is greater than or equal to 1. For example, the aforementioned arrangement is a descending order of the first confidence level.

[0011] In another possible example, the first target token is determined by the inference device through random sampling of the first confidence levels of K tokens.

[0012] In one possible implementation, the inference device determines the first target token corresponding to the inference request based on the first confidence scores of K tokens. This includes: the inference device sampling the first confidence scores of the K tokens to obtain the target confidence score among the first confidence scores of the K tokens, and then determining the first target token corresponding to the element indicated by the target confidence score from the mapping relationship between elements and tokens in the first logits. The target confidence score includes one or more first confidence scores, and the first target token includes one or more tokens.

[0013] In this application, since the mapping relationship between elements in logits and tokens accurately shows the correspondence between the position of elements in logits and tokens, the inference device can quickly and accurately determine the first target token corresponding to the element indicated by the target confidence from the mapping relationship between elements in logits and tokens, thereby improving the accuracy and efficiency of model inference.

[0014] In one possible scenario, the target confidence level is the first confidence level of the first Q tokens arranged in descending order from the first confidence levels of the K tokens.

[0015] In this application, the inference device takes the first confidence scores of the first Q tokens arranged in descending order as the target confidence scores. Since the target confidence scores of the first target token are among the first Q tokens, that is, the first target token is among the Q better tokens out of the K tokens, the first target token obtained by the inference device including the first target token is better, thus improving the accuracy of obtaining the first target token.

[0016] In another possible scenario, the target confidence level is obtained by randomly sampling the first confidence level of the K tokens by the inference device.

[0017] In this application, since the target confidence of the first target token is obtained by random sampling, the first target token obtained by the inference device has strong randomness, which improves the diversity and flexibility of obtaining the first target token and avoids the first target token from getting trapped in a local optimum.

[0018] In one possible implementation, the inference device determines the first target token corresponding to the first inference request based on the first confidence scores of K tokens. This includes: the inference device selecting P elements from the K elements based on the first confidence scores of the K tokens, and then normalizing the P elements to obtain the second confidence scores of the P tokens. The inference device then determines the first target token corresponding to the first inference request based on the second confidence scores of the P tokens. Wherein, P is greater than or equal to 1, P is less than K, the first confidence scores of the tokens corresponding to the P elements are greater than or equal to a first threshold, or the first confidence scores of the tokens corresponding to the P elements are the first P tokens arranged in descending order, and the sum of the first confidence scores of the P tokens is greater than or equal to the second threshold.

[0019] In this application, the inference device selects P elements from K elements, and then in subsequent processing steps (such as normalization processing, determining the first target token, etc.), it only processes the P elements or the P tokens corresponding to the P elements, which reduces the amount of data that the inference device needs to process during post-processing, thereby reducing the time required for post-processing and improving the model inference efficiency.

[0020] In one possible example, the first threshold is M times the maximum value of the first confidence scores of the K tokens, 0 <M<1。

[0021] In one possible example, the second threshold mentioned above is configured by the user.

[0022] In one possible implementation, the inference device determines the first target token corresponding to the first inference request based on the first confidence scores of the K tokens. This includes: the inference device setting N elements out of the K elements to negative infinity values ​​based on the first confidence scores of the K tokens; normalizing the K elements including the N negative infinity values ​​to obtain the third confidence scores of the K tokens; and then determining the first target token corresponding to the first inference request based on the third confidence scores of the K tokens. Here, N is greater than or equal to 0, and N is less than K.

[0023] In this application, the inference device determines the first target token based on the third confidence scores (non-zero values) of KN tokens and the third confidence scores (zero values) of N tokens. Since the third confidence scores of K tokens include N zero values, the computational load in the subsequent process of determining the first target token can be reduced, thereby reducing the time required for post-processing and improving the efficiency of model inference.

[0024] In one possible scenario, the inference device selects the element with a first confidence level less than a first threshold from among the first confidence levels of the K tokens as one of the N elements.

[0025] In this application, the inference device reduces the computational load of subsequent steps (determining the first target token corresponding to the first inference request based on the first confidence of K tokens) by filtering out elements corresponding to the first confidence that are less than the first threshold and setting the filtered elements to negative infinity. This reduces the time required for post-processing and improves the efficiency of model inference.

[0026] In another possible scenario, the first confidence level of the tokens corresponding to the N elements is the sum of the first confidence levels of the N tokens in descending order, and the sum of the first confidence levels of the tokens corresponding to the N elements is less than the second threshold.

[0027] In this application, the inference device filters out the N elements whose sum of the first confidence scores of the tokens corresponding to the N elements is less than or equal to the second threshold and is arranged in descending order, and sets the filtered elements to negative infinity. This reduces the computational load of subsequent steps (determining the first target token corresponding to the first inference request based on the second confidence scores of the K tokens), thereby reducing the time required for post-processing and improving the efficiency of model inference.

[0028] In one possible implementation, the above-mentioned model post-processing method further includes: the inference device acquiring a second logits and filtering the L elements in the second logits in descending order. The inference device normalizes the L elements and the O elements with negative infinity values ​​to obtain the fourth confidence scores of the O tokens, and then determines the second target token corresponding to the second inference request based on the fourth confidence scores of the O tokens. Here, the second logits are obtained by the model based on the second inference request, the model performs parallel inference on the first and second inference requests, and the O tokens include the second target token.

[0029] In this application, the inference device can perform parallel inference on the first inference request and the second inference request, reducing the waiting time for inference on the first inference request, thereby improving the overall inference efficiency of the inference device for the inference request and reducing the waiting time for the user to obtain the target token. Furthermore, the inference device aligns the elements selected from the first inference request with those selected from the second inference request (K elements are consistent with the number of L elements and negative infinity values), enabling the inference device to use the same set of parameters to perform subsequent steps on the selected data (such as normalization processing, determining the first target token corresponding to the inference request based on the first confidence of the K tokens), thereby improving the model inference efficiency.

[0030] In one possible example, the inference device determines the second target token corresponding to the second inference request based on the fourth confidence scores of the O tokens. This includes: the inference device sampling the fourth confidence scores of the O tokens to obtain a target confidence score from the fourth confidence scores of the O tokens; and the inference device determining the second target token corresponding to the element indicated by the target confidence score from the mapping relationship between elements and tokens in the first logits. The second target token includes one or more tokens.

[0031] In one possible implementation, after the inference device filters the top K elements in descending order from the logits, the model post-processing method further includes: the inference device receiving a first request to indicate the output of the post-processed logits, and outputting the post-processed logits. The post-processed logits include the K elements and negative infinity values ​​excluding the K elements.

[0032] In this application, the inference device outputs post-processed logits according to the first request, which satisfies the user's need for post-processed logits and improves the user experience.

[0033] In one possible example, the inference device outputs post-processed logits, which includes: the inference device obtaining a first vector consisting entirely of negative infinity values, the number of which in the first vector is the same as the number of elements in the first logits; the inference device replacing the negative infinity values ​​in the first vector with K elements that have the same indices as the K elements, obtaining and outputting the post-processed logits.

[0034] For example, if K is 3, element 1 is located at the 1st position in the first logits, element 2 is located at the 3rd position in the first logits, and element 3 is located at the 130th position in the first logits. The inference device replaces the negative infinity value at the 1st position in the first vector with element 1, replaces the negative infinity value at the 3rd position in the first vector with element 2, and replaces the negative infinity value at the 130th position in the first vector with element 3.

[0035] In another possible example, the inference device outputs post-processed logits, which involves: the inference device obtaining a first vector consisting entirely of negative infinity values, and removing the negative infinity values ​​from K elements to obtain P elements. The inference device then replaces the negative infinity values ​​with the same indices as the P elements in the first vector with the P elements, obtaining and outputting the post-processed logits.

[0036] In one possible implementation, after the inference device filters the top K elements in descending order from the logits, the model post-processing method further includes: obtaining a second request to indicate the confidence level of the output token, the confidence level of the output token. The confidence level of the token includes the first confidence level of the K tokens, and zero values ​​other than the first confidence level of the K tokens.

[0037] In this application, the inference device outputs the confidence level of the token based on the second request, which satisfies the user's need for the confidence level of the token and improves the user experience.

[0038] In one possible example, the inference device outputs the confidence score of the token by normalizing the post-processed logits to obtain the confidence score of the token.

[0039] In another possible example, the inference device outputs the confidence score of the token by: obtaining a second vector consisting entirely of zero values, the second vector containing the same number of zero values ​​as the first logits contains elements; replacing the zero values ​​in the second vector that have the same indices as the K elements with the first confidence scores of the aforementioned K tokens, and then obtaining and outputting the confidence score of the token.

[0040] For example, if K is 3, element 1 is located at position 1 in the first logits, element 2 is located at position 3 in the first logits, and element 3 is located at position 130 in the first logits. The inference device replaces the zero value at position 1 in the first vector with the first confidence level of the token corresponding to position 1, replaces the zero value at position 3 in the first vector with the first confidence level of the token corresponding to position 3, and replaces the zero value at position 130 in the first vector with the first confidence level of the token corresponding to position 130.

[0041] In another possible example, the inference device outputs the confidence score of the token by: obtaining a second vector consisting entirely of zero values, removing the zero values ​​from K tokens to obtain P tokens; and replacing the zero values ​​in the second vector with the same indices as the P elements using the first confidence values ​​of the P tokens, thus obtaining and outputting the confidence score of the token.

[0042] In one possible scenario, the first confidence level of the first token among the aforementioned K tokens is: the probability value of the first token in the probability distribution or logarithmic probability distribution of the K tokens.

[0043] In one possible implementation, the inference device obtaining the first logits includes: obtaining a first inference request; determining one or more tokens included in the first inference request from the mapping relationship between tokens and text; and inputting the one or more tokens into the model and outputting the first logits.

[0044] In one possible scenario, the first inference request may include: data in one modality or multiple modalities. For example, the first inference request may consist of only text.

[0045] For example, the first reasoning request includes: text and image, text and video, text and voice, voice and picture, or voice and video, etc.

[0046] In one possible scenario, the first target token corresponds to text in the vocabulary, such as characters, numbers, mathematical symbols, punctuation marks, etc.

[0047] Secondly, this application provides a model post-processing apparatus. This model post-processing apparatus is applied to a computer system or to a computing device that supports the computer system in implementing a model post-processing method. The computing device may be referred to as an inference device. The model post-processing apparatus includes modules for executing the model post-processing method in the first aspect or any optional implementation of the first aspect. In one possible example, the model post-processing apparatus includes: an acquisition module, a filtering module, a first processing module, and a determination module.

[0048] The acquisition module is used to obtain the first logits; the first logits are obtained by the model based on the first inference request.

[0049] The filtering module is used to filter the first K elements in the first logits in descending order, where K is greater than or equal to 1; there is a mapping relationship between an element in the logits and a minimal language unit token.

[0050] The first processing module is used to normalize the K elements to obtain the first confidence level of the K tokens.

[0051] The determination module is used to determine the first target token corresponding to the first inference request based on the first confidence level of K tokens, where the K tokens include the first target token.

[0052] In one possible implementation, a determining module is specifically used to sample the first confidence of K tokens to obtain the target confidence among the first confidence of K tokens, and to determine the first target token corresponding to the element indicated by the target confidence from the mapping relationship between elements and tokens in logits; the target confidence includes one or more first confidences, and the first target token includes one or more tokens.

[0053] In one possible scenario, the target confidence level is the first confidence level of the first Q tokens in descending order of their first confidence levels, where Q is greater than or equal to 1.

[0054] In one possible scenario, the target confidence level is obtained by randomly sampling the first confidence level of the K tokens.

[0055] In one possible implementation, the determining module is specifically used to determine the first target token corresponding to the first inference request based on the first confidence scores of the K tokens. This includes: the inference device selecting P elements from the K elements based on the first confidence scores of the K tokens, and then normalizing the P elements to obtain the second confidence scores of the P tokens. The inference device determines the first target token corresponding to the first inference request based on the second confidence scores of the P tokens. Wherein, P is greater than or equal to 1, P is less than K, the first confidence scores of the tokens corresponding to the P elements are greater than or equal to a first threshold, or the first confidence scores of the tokens corresponding to the P elements are the first P tokens arranged in descending order, and the sum of the first confidence scores of the P tokens is greater than or equal to the second threshold.

[0056] In one possible implementation, a determining module is specifically used to, based on the first confidence level of the K tokens, set N elements out of the K elements to negative infinity, and normalize the K elements including the N negative infinity values ​​to obtain the third confidence level of the K tokens; based on the third confidence level of the K tokens, determine the first target token corresponding to the first inference request. Wherein, N is greater than or equal to 0, N is less than K, the first confidence level of the tokens corresponding to the N elements is less than a first threshold, or the second confidence level of the tokens corresponding to the N elements is the sum of the first confidence levels of the N tokens arranged in descending order, and the sum of the first confidence levels of the tokens corresponding to the N elements is less than a second threshold.

[0057] In one possible implementation, the first threshold is M times the maximum value of the first confidence scores of the K tokens, 0 <M<1。

[0058] In one possible implementation, the model post-processing device further includes a second processing module. The second processing module is used to obtain second logits, which are obtained by the model based on the second inference request; filter the top L elements in descending order from the second logits; normalize the O elements to obtain the fourth confidence scores of the O tokens; and determine the second target token corresponding to the second inference request based on the fourth confidence scores of the O tokens. The model performs parallel inference on the first and second inference requests; L is greater than or equal to 1, L is less than K; O equals K; the O elements include the L elements and negative infinity; and the O tokens include the second target token.

[0059] In one possible implementation, the model post-processing apparatus further includes a first output module. The first output module is configured to acquire a first request; the first request instructs the output of post-processed logits; the output of post-processed logits includes K elements and negative infinity values ​​excluding the K elements.

[0060] In one possible implementation, the model post-processing apparatus further includes a second output module. The second output module is used to obtain a second request; the second request is used to indicate the confidence level of the output token; the confidence level of the output token includes a first confidence level of K tokens, and zero values ​​other than the first confidence levels of the K tokens.

[0061] In one possible implementation, the first confidence of the first token among the first confidence of K tokens is: the probability value of the first token in the probability distribution or logarithmic probability distribution of the K tokens.

[0062] For more detailed implementation information regarding the model post-processing device, please refer to the description of any of the implementation methods in the first aspect above, as well as the content of the specific implementation methods below, which will not be repeated here.

[0063] Thirdly, this application provides a chip. The chip includes: a processor and an interface circuit; the interface circuit is used to acquire first data, and the processor is used to execute the method in the first aspect or any possible implementation of the first aspect based on the first data.

[0064] Fourthly, this application provides a computing device cluster. The computing device cluster includes at least one computing device, which includes a memory and a processor. The memory stores computer instructions; when the processor executes the computer instructions, it implements the method described in the first aspect or any possible implementation of the first aspect.

[0065] Fifthly, this application provides a computer-readable storage medium. The storage medium stores a computer program or instructions that, when executed by a processing device, implement the method described in the first aspect or any possible implementation thereof.

[0066] Sixthly, this application provides a computer program product. The computer program product includes a computer program or instructions that, when executed by a processing device, implement the method described in the first aspect or any possible implementation thereof.

[0067] The beneficial effects of aspects two through six above can be referred to in the first aspect or any possible implementation of the first aspect, and will not be elaborated here. Based on the implementations provided in the above aspects, this application can also be further combined to provide more implementations. Attached Figure Description

[0068] Figure 1 is a schematic diagram of the architecture of a computer system provided in this application;

[0069] Figure 2 is a schematic diagram of a model reasoning process provided in this application;

[0070] Figure 3 is a flowchart illustrating a model post-processing method provided in this application;

[0071] Figure 4 is a schematic flowchart of a model post-processing method provided in this application.

[0072] Figure 5 is a flowchart illustrating a model post-processing method provided in this application.

[0073] Figure 6 is a flowchart illustrating a logits generation method provided in this application;

[0074] Figure 7a is a schematic diagram of a model post-processing device provided in this application;

[0075] Figure 7b is a schematic diagram of the structure of a model post-processing device provided in this application;

[0076] Figure 8 is a schematic diagram of the structure of a computing device provided in this application;

[0077] Figure 9 is a schematic diagram of the structure of a computing device cluster provided in this application;

[0078] Figure 10 is a schematic diagram of the connection between computing devices provided in this application. Detailed Implementation

[0079] To facilitate understanding, the technical terms used in this application will be introduced first.

[0080] Large language models (LLMs) leverage their powerful computing capabilities and sophisticated algorithms to effectively process massive amounts of data, providing users with efficient and accurate information processing and analysis services. LLMs not only excel in understanding and generating human language but also demonstrate immense potential in solving complex problems and tasks. For example, LLMs are widely used in automated question-answering systems, text summarization, machine translation, and language generation, significantly improving efficiency and accuracy. Particularly when dealing with large-scale datasets, LLMs can uncover deep patterns and connections, supporting decision-making. Furthermore, the self-learning ability of LLMs allows them to continuously evolve, constantly improving their performance and intelligence by learning from new data.

[0081] Commonly used LLMs include transformer-based unidirectional or bidirectional encoder representation models. For example, a unidirectional encoder representation model can be a generative pre-trained transformer (GPT), and a bidirectional encoder representation model can be BERT (bidirectional encoder representations from transformers). LLMs can utilize unlabeled text to pre-train deep representations by co-adjusting left and right contexts across all layers. In some possible cases, LLMs can also be simply referred to as large models. The large models provided in this application can refer not only to LLMs but also to models with parameters greater than or equal to a threshold. Depending on the domain in which the large model is applied, it can also refer to models that include various functions such as image processing, human-computer interaction, semantic search, semantic query, or dialogue. This application does not limit the domain in which large models can be applied or their specific names. In this document, for the sake of simplicity, the term "large model" is used throughout, but this should not be construed as a limitation of this application, and will not be elaborated further hereafter.

[0082] `logits` represents the output tensor of the model, and its dimension is equal to the vocabulary size or batch size. The aforementioned models include both large models and models with fewer than a threshold number of parameters.

[0083] Probabilities (probs) represent the probability values ​​of each token obtained by the model. The dimension of each probability value is the size of the vocabulary or the batch size. Normalization of logits typically yields the probability values ​​of a token.

[0084] Post-processing refers to the process of converting the model's output logits into tokens, which indicate specific words in the vocabulary.

[0085] To address the issue of large data volumes and low model inference efficiency during the model post-processing stage, this application provides a model post-processing method. This method includes: an inference device acquiring a first logits; filtering the top K elements in descending order from the first logits; normalizing these K elements to obtain the first confidence scores of the K tokens; and determining the first target token corresponding to the inference request based on the first confidence scores of the K tokens. Here, the first logits are obtained by the model based on the first inference request, K is greater than or equal to 1, and there is a mapping relationship between each element in the logits and each token.

[0086] In this application, after the inference device selects the top K elements in descending order from the first logits, subsequent processing steps (such as normalization and determining the first target token corresponding to the inference request based on the first confidence of the K tokens) are only processed for the K elements or the K tokens corresponding to the K elements. This reduces the amount of data that the inference device needs to process during post-processing, thereby reducing the time required for post-processing and improving the efficiency of model inference.

[0087] Furthermore, by reducing the amount of data that the inference device needs to process during post-processing, the load on the inference device during the inference process is reduced.

[0088] Next, the model post-processing method provided in this application will be described in detail with reference to the accompanying drawings.

[0089] First, referring to Figure 1, which is a schematic diagram of the architecture of a computer system provided in this application. As shown in Figure 1, the computer system includes an inference device 110, a database 120, a terminal device 130, and a data acquisition device 140.

[0090] The inference device 110 can be a server or terminal, such as a mobile phone, tablet, laptop, virtual reality (VR) device, augmented reality (AR) device, mixed reality (MR) device, extended reality (ER) device, camera, or vehicle terminal, etc. It can also be an edge device, intelligent robot (e.g., a box carrying a chip with processing capabilities), etc. Among them, intelligent robots include inspection robots, interactive robots, etc. The application scenario of the vehicle terminal is autonomous driving, and the aforementioned server is a computing device in cloud services.

[0091] As one possible embodiment, the inference device 110 can be different processors deployed on different physical devices (such as servers or servers in a cluster). For example, the inference device 110 can be a graphics processing unit (GPU), a central processing unit (CPU), a neural network processing unit (NPU), other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and other acceleration cards. The general-purpose processor can be a microprocessor or any conventional processor.

[0092] Database 120 is used to store training data or inference data.

[0093] Data acquisition device 140 is used to collect inference data or training data and store the training data or inference data in database 120. Data acquisition device 140 and inference device 110 can be the same or different devices. Training data or inference data includes data in at least one modality, such as video, image, speech, and text. For example, inference data includes text, text and video, text and image, or text and speech.

[0094] It should be noted that in practical applications, the training data or inference data maintained in database 120 may not all come from data acquisition device 140, but may also be received from other devices.

[0095] In some embodiments, the inference device 110 may configure model 101 to itself and use model 101 to perform inference to obtain logits. Further, the inference device 110 post-processes the logits to obtain a target token, and then converts the target token into text in a vocabulary. For example, the inference device 110 performs various types of multimodal inference tasks, such as engaging in dialogue with the user based on user input text / speech, processing images, and performing speech processing.

[0096] In other embodiments, the inference device 110 can configure the model 101 on other computing devices to perform inference on other devices and obtain logits. The inference device 110 performs post-processing on the logits to obtain the target token, and then converts the target token into text in the vocabulary.

[0097] In one possible scenario, the step of converting the target word into a word in the vocabulary can be performed by other computing devices, i.e., the inference device 110 only performs post-processing.

[0098] Furthermore, based on the functions performed by the inference device 110, the inference device 110 can be further subdivided, as shown in Figure 1. The inference device 110 is configured with a computing module 111, an I / O interface 112, and a preprocessing module 113. In one possible example, the computing module 111 can also be referred to as the inference engine, which is used to process the inference data using the model 101 to obtain the target token.

[0099] I / O interface 112 is used for data interaction with external devices. Users can send inference data to I / O interface 112 through terminal device 130. In addition, inference data can also come from database 120.

[0100] The preprocessing module 113 is used to preprocess the inference data received from the I / O interface 112. For example, the preprocessing module 113 can be used to improve the clarity of the speech or image received from the I / O interface 112, or the preprocessing module 113 can convert the received text into a token according to the vocabulary, and then input the token into the model 101 to obtain the logits of this inference.

[0101] During the preprocessing of inference data by the inference device 110, or during the calculation and other related processing performed by the calculation module 111 of the inference device 110, the inference device 110 can call data, code, etc. in the data storage system 150 for corresponding processing, or store the data and instructions obtained from the corresponding processing into the database 120.

[0102] For example, after receiving inference data, the inference device 110 encodes the inference data using the preprocessing module 113 to obtain the token corresponding to the inference data. The calculation module 111 processes the aforementioned token using the model 101 to obtain the logits for this inference.

[0103] Furthermore, the calculation module 111 performs post-processing on the above logits to obtain the corresponding token.

[0104] Furthermore, the calculation module 111 determines the text corresponding to the token from the vocabulary.

[0105] Finally, I / O interface 112 returns the token or text to terminal device 130, thereby providing it to the user so that the user can view the token or text.

[0106] It is worth noting that the terminal device 130 and the inference device 110 can also be the same physical device.

[0107] In the scenario shown in Figure 1, the user can manually provide inference data, which can be done through the interface provided by I / O interface 112. Alternatively, the terminal device 130 can automatically send inference data to I / O interface 112. If user authorization is required for the terminal device 130 to automatically send inference data, the user can set the corresponding permissions in the terminal device 130. The user can view the inference results (token and / or text) output by the inference device 110 on the terminal device 130, which can be presented in various ways such as display, sound, or animation. The terminal device 130 can also act as a data acquisition terminal, collecting the inference data input to I / O interface 112 and the inference results output to I / O interface 112 as new training data and storing them in database 120. Alternatively, data can be collected without going through the terminal device 130; instead, I / O interface 112 can collect the inference data input to I / O interface 112 and the inference results output to I / O interface 112 as new training data and store them in database 120.

[0108] Figure 1 is merely a schematic diagram of the architecture of a computer system provided in an embodiment of this application. The positional relationships and quantities of the devices, components, modules, etc., shown in Figure 1 do not constitute any limitation. For example, Figure 1 may also include more computing modules 111 or more inference devices 110. When the computer system includes more inference devices 110, each of the plurality of inference devices 110 can execute at least one network layer in the inference model, or execute at least one sub-step of a plurality of sub-steps included in post-processing. The sub-steps include: filtering logits, normalizing, sampling, etc.

[0109] Based on the content shown in Figure 1, the overall process of model inference is described below. As shown in Figure 2, Figure 2 is a schematic diagram of a model inference process provided in this application. In this embodiment, the inference device 110 performs model inference, and logits 1 is referred to as the first logits as an example. The content shown in Figure 2 includes the following steps ①-⑥.

[0110] Step ①: The inference device 110 obtains the inference request.

[0111] The reasoning request contains reasoning data.

[0112] In one possible scenario, the inference data includes text.

[0113] In another possible scenario, the inference data includes at least two of multiple modalities such as text, video, images, and voice.

[0114] In one possible implementation, the inference device 110 acquires an inference request, including: the inference device 110 acquires an inference request input by a user through a command-line tool.

[0115] In another possible implementation, the inference device 110 obtains the inference request by: the terminal device 130 accessing the inference device 110 through the application programming interface (API) provided by the inference device 110, thereby displaying a user interface on the terminal device 130. Then, the user triggers control components on the user interface through various input devices (keyboard, mouse, touchscreen, etc.) connected to the terminal device 130, thereby the terminal device 130 obtaining the inference request configured by the user based on the triggering operation, and then sending request 1 to the inference device 110.

[0116] Two possible examples are provided below for the specific implementation of the triggering operation.

[0117] Example 1: The trigger operation can be a user's confirmation of the control component via the keyboard, such as the user pressing the Enter key to confirm the content of the user-configured request 1, such as specific inference data.

[0118] Example 2: The trigger action can be a user clicking on the control component with the mouse.

[0119] For example, the user interface displays multiple control components corresponding to candidate options. The user can click on the aforementioned control components with the mouse to determine whether the candidate option corresponding to the clicked control component is inference data, etc.

[0120] For example, the multiple options mentioned above could be text, images, videos, or audio from the inference data. After the user selects text as the option, they can enter text in the text box to obtain the inference data.

[0121] Step ②: The inference device 110 encodes the inference data to obtain the data to be processed.

[0122] In one possible scenario, the data to be processed includes a token to be processed or a vector to be processed.

[0123] In one possible implementation, if the inference data includes text, the inference device 110 encodes the inference data to obtain a token to be processed, including: the inference device 110 determines the token to be processed corresponding to the text from a vocabulary indicating the mapping relationship between the token and the text.

[0124] In one possible implementation, if the inference data includes non-textual modal data such as video, images, and voice, the inference device 110 encodes the inference data to obtain a token to be processed, including: the inference device 110 inputs non-textual modal data such as video, images, and voice into a preprocessing model to obtain a vector to be processed.

[0125] The aforementioned preprocessing models are used to embed representations of non-textual modal data such as video, images, and audio. For example, preprocessing models can be transformers, residual networks (ResNet), 3D convolutional neural networks (3D-CNN), and visual geometry groups (VGG). VGG is a type of deep convolutional neural network.

[0126] Step 3: The inference device 110 uses model 101 to process the data to be processed and obtain logits.

[0127] For example, the inference device 110 inputs the data to be processed into the model 101 and obtains logits.

[0128] In one possible scenario, model 101 is used to perform various types of multimodal inference tasks, such as dialogue with the user, image processing, and speech processing, based on at least one of the user's input: video, image, speech, and text. This inference task can be a multimodal understanding task, a text generation task, etc., and the final output of the inference task is text.

[0129] For example, Model 101 is any one or more of the following: transformer, BERT, convolutional neural network, generative adversarial network, etc.

[0130] In one possible implementation, if the data to be processed includes a token or a vector to be processed, the inference device 110 inputs the data to be processed into the model 101 and outputs logits, including: the inference device 110 inputs the token or vector to be processed into the model 101 and outputs logits.

[0131] In one possible implementation, if the data to be processed includes a token to be processed and a vector to be processed, the inference device 110 inputs the data to be processed into the model 101 and outputs logits, including: the inference device 110 inputs the token to be processed into the embedding layer in the model 101 to obtain the vector corresponding to the token to be processed, and then merges the vector corresponding to the token to be processed with the vector to be processed (such as concatenation, fusion, etc.), and then processes it using the subsequent network layers in the model 101 to output logits.

[0132] Step 4: The inference device 110 performs post-processing on the logits to obtain the target token.

[0133] In one possible implementation, the inference device 110 performs post-processing on logits 1 to obtain the target token, including: the inference device 110 filters the top K elements in logits 1 in descending order, and then normalizes the K elements to obtain the first confidence scores of the K tokens. Based on the first confidence scores of the K tokens, the inference device 110 determines the first target token from the K tokens.

[0134] For details on step ④, please refer to Figures 3-5 below, which will not be repeated here.

[0135] Step 5: The inference device 110 decodes the target token to obtain the text.

[0136] In one possible implementation, the inference device 110 decodes the target token to obtain text, including: the inference device 110 determines the text corresponding to the target token from a vocabulary indicating the mapping relationship between tokens and text.

[0137] Step 6: The reasoning device 110 outputs text.

[0138] In one possible implementation, the inference device 110 outputs text, including: the inference device 110 displays text.

[0139] In another possible implementation, the inference device 110 outputs text, including: the inference device 110 sends text to the terminal device 130, and then the terminal device 130 displays the text on the user interface.

[0140] It is worth noting that steps ③-⑥ in Figure 2 are only one round of reasoning process for the inference device 110 after obtaining the inference request. In actual situations, the inference device 110 will perform multiple rounds of reasoning process for a single inference request.

[0141] In one possible scenario, the text derived in an earlier iteration will be used as input for subsequent iterations. For example, the inference device 110 may encode the text derived in an earlier iteration and then input it into the model 101 along with the token to be processed.

[0142] In another possible scenario, the target token inferred in an earlier round will be used as input for inference in a later round. For example, inference device 110 inputs the target token inferred in an earlier round and the token to be processed into model 101 together.

[0143] The following content, based on the content shown in Figure 1, provides a detailed description of step ④ in Figure 2. Figure 3 is a flowchart illustrating a model post-processing method provided in this application. The method shown in Figure 3 is executed by the inference device 110. `logits 1` can be referred to as the first `logits`, `inference request a` as the first `inference request`, `confidence level A` as the first `confidence level`, and `target token a` as the first `target token`. As shown in Figure 3, the model post-processing method provided in this embodiment includes the following steps S310-S340.

[0144] S310, inference device 110 obtains logits 1.

[0145] Wherein, logits 1 is obtained by model 101 based on the reasoning data in reasoning request a.

[0146] The meaning of the elements in logits 1 is used to represent: Model 101's "confidence" in the token corresponding to the index of the element in logits 1, which is used to represent the confidence level of the token.

[0147] In one possible scenario, the indices of elements in logits 1 are mapped to tokens. For example, the index of element a indicates that element a is the first element in logits 1 (from left to right), and the index of element a corresponds to 123 in the word list; the index of element b indicates that element b is the second element in logits 1, and the index of element b corresponds to 22 in the word list; the index of element c indicates that element c is the third element in logits 1, and the index of element c corresponds to 43 in the word list, and so on.

[0148] In one possible scenario, the dimension of logits 1 is the same as the number of characters included in the vocabulary. For example, if the dimension of logits 1 is 32001, the vocabulary would have 32001 strings (such as 123, 22, 43 mentioned above). This application does not limit the dimension of logits 1.

[0149] As shown in Figure 3, the elements from position 0 to position 32000 in logits 1 are: 0.05, 0.01, 0.1, 0.4, 0.01, 0.05, ..., 0.02, 0.2, 0.001.

[0150] In one possible implementation, the inference device 110 acquires logits 1, including: the inference device 110 acquires the logits 1 output by the model 101.

[0151] S320 and inference device 110 filter the first K elements in logits 1 in descending order.

[0152] Wherein, K is greater than or equal to 1. The specific value of K is configured by the user, or is a default value (such as 5 or 6, etc., which is not limited in this application).

[0153] In one possible example, the specific value of K mentioned above is carried in the inference request a.

[0154] In one possible implementation, the inference device 110 filters the top K elements in logits 1 in descending order, including: the inference device 110 performs Top-k sampling to filter the top K elements in logits 1 in descending order, and then prunes the K elements from logits 1.

[0155] In one possible example, the inference device 110 uses the Top-k operator to sort the elements in logits 1 in descending order, thereby selecting the top K elements.

[0156] For example, if K is 6, then the K elements selected by the inference device 110 from logits 1 are: 0.4, 0.2, 0.1, 0.05, 0.05, 0.02.

[0157] In one possible scenario, during the process of filtering out the top K elements, the indices of the K elements will also be filtered out.

[0158] For example, since the index is an attribute of the element, the inference device 110 will carry the index of each of the K elements when filtering the first K elements.

[0159] For example, when the inference device 110 sorts the elements in logits 1, the attributes (such as index) of each element are sorted along with the elements. Thus, the inference device 110 can quickly determine that 0.4 is located in the 3rd position in logits 1, 0.2 is located in the 31999th position in logits 1, 0.1 is located in the 2nd position in logits 1, 0.05 is located in the 5th position in logits 1, 0.05 is located in the 0th position in logits 1, and 0.02 is located in the 31998th position in logits 1.

[0160] S330 and inference device 110 normalize the K elements to obtain the confidence level A of the K tokens.

[0161] In one possible implementation, the inference device 110 normalizes the K elements to obtain the confidence levels A of the K tokens, including: the inference device 110 performs softmax normalization on the K elements to obtain the probability distribution of the K tokens. The probability distribution of the K tokens includes the confidence level A of each of the K tokens.

[0162] In one possible example, the inference device 110 uses the following formula to determine the confidence level A of each of the K tokens.

[0163] Where i takes values ​​from 0 to K, This represents the sum of the exponents of K elements. This represents the exponent value of element i.

[0164] For example, the inference device 110 uses the above formula to perform softmax normalization on the K elements (0.4, 0.2, 0.1, 0.05, 0.05, 0.02) to obtain the confidence levels A of the K tokens corresponding to the K elements as follows: 0.215, 0.176, 0.159, 0.151, 0.151, 0.147.

[0165] In one possible implementation, the inference device 110 normalizes the K elements to obtain the confidence levels A of the K tokens, including: the inference device 110 performs logsoftmax normalization on the K elements to obtain the log probability distribution of the K tokens. The log probability distribution of the K tokens includes the confidence level A of each of the K tokens.

[0166] In one possible example, the inference device 110 performs logsoftmax normalization on the above K elements (0.4, 0.2, 0.1, 0.05, 0.05, 0.02) to obtain the confidence levels A of the K tokens corresponding to the K elements as follows: -1.5442, -1.7442, -1.8442, -1.8942, -1.8942, -1.9242.

[0167] S340, the inference device 110 determines the target token a corresponding to the inference request a based on the confidence level A of the K tokens.

[0168] In one possible implementation, the inference device 110 determines the target token a corresponding to the inference request a based on the confidence scores A of the K tokens, including: the inference device 110 determines the confidence scores A of the first Q tokens arranged in descending order from the confidence scores A of the K tokens, and then takes the token corresponding to the index of the element indicated by the first Q confidence scores A as the target token a.

[0169] The target confidence level is the confidence level A of the first Q tokens arranged in descending order, where Q is greater than or equal to 1. Similarly, the target token a also includes Q tokens out of the K tokens, where K is greater than or equal to Q.

[0170] In one possible example, the inference device 110 uses the token corresponding to the index of the element indicated by the first Q confidence levels A as the target token a, which includes: the inference device 110 determining the target token a corresponding to the index of the element indicated by the first Q confidence levels A from the mapping relationship between the index of the element and the token in logits 1.

[0171] Since the inference device 110 possesses the attribute of indexes of the K elements during the aforementioned processes of filtering K elements and normalizing the K elements to obtain the confidence scores A of the K tokens, meaning that the inference device 110 establishes a mapping relationship between the indices of the K elements and the confidence scores A of the K tokens, the inference device 110 can accurately determine the target token a corresponding to the index of the first Q elements indicated by confidence scores A from the mapping relationship between the indices of the K elements and the mapping relationship between the indices of elements in logits 1 and the tokens.

[0172] In other words, the confidence level A determined by the inference device 110 also has the attribute of index. A mapping relationship can be established between the confidence level A, the element, and the token based on the index. Then, the inference device 110 can determine the target token a corresponding to the index of the first Q elements indicated by the confidence level A.

[0173] Taking the maximum value among K elements as an example, the maximum value is 0.4. This maximum value is located at the 3rd position in logits 1 (i.e., the index is the 3rd position in logits 1). The confidence A of the token corresponding to the maximum value is 0.215 or -1.5442. Since the confidence A of the token corresponding to the maximum value is the largest among the confidence A of the K tokens, the confidence A of the first Q tokens arranged in descending order includes 0.215 or -1.5442. Based on the mapping relationship between the indices of the K elements and the confidence A of the K tokens, the inference device 110 determines that the index of the element corresponding to the confidence A of 0.215 or -1.5442 is the 3rd position in logits 1. Then, from the mapping relationship between the element index and the token, it determines that the token corresponding to the 3rd position in logits 1 (token 3) is one of the target tokens a.

[0174] In another possible implementation, the inference device 110 determines the target token a corresponding to the inference request a based on the confidence levels A of the K tokens. This includes: the inference device 110 randomly samples the confidence levels A of the K tokens to obtain the target confidence level among the confidence levels A of the K tokens, and then determines the target token a corresponding to the index of the element indicated by the target confidence level from the mapping relationship between the indexes of elements in logits and tokens. Here, the target confidence level includes one or more confidence levels from the confidence levels A of the K tokens, and the target token a includes one or more tokens from the K tokens.

[0175] For a detailed description of how the inference device 110 determines the target token a corresponding to the index of the element indicating the target confidence from the mapping relationship between the index of the element in logits and the token, the content of the target token a can be determined by referring to the above embodiment, and will not be repeated here.

[0176] Because the amount of data that the inference device needs to process during post-processing is reduced (e.g., from the operations included in the entire logits to K elements), the method shown in this application does not have high requirements for the computational performance of the inference device, thus enabling the post-processing process to be implemented even on inference devices with limited computational performance.

[0177] The following content, based on the content shown in Figure 1, provides a detailed description of step ④ in Figure 2. The difference between the following embodiment and the content shown in Figure 3 is that this embodiment further filters out P elements from the K elements, and then performs subsequent processing on the P elements (normalization, determining the first target token, etc.), reducing the amount of data processed and thus improving inference efficiency. In this embodiment, confidence level B can be called the second confidence level, threshold a can be called the first threshold, and threshold b can be called the second threshold. The content of this embodiment is as follows:

[0178] Inference device 110 obtains logits 1, filters the top K elements in descending order from logits 1, and then normalizes these K elements to obtain the confidence scores A for the K tokens. Based on the confidence scores A of the K tokens, the inference device filters P elements from the K elements and normalizes these P elements to obtain the confidence scores B for the P tokens. Therefore, based on the confidence scores B of the P tokens, the target token a corresponding to the inference request a is determined.

[0179] For inference device 110 to obtain logits 1, filter the K elements in logits 1 that are sorted in descending order, and then normalize the K elements to obtain the confidence A of the K tokens. Refer to the description of S310-S330 in Figure 3 above, which will not be repeated here.

[0180] In one possible implementation, the inference device 110 filters P elements from the K elements based on the confidence level A of the K tokens, including: the inference device 110 determines the confidence level A greater than or equal to the threshold a from the confidence level A of the K tokens, and then prunes the P elements corresponding to the confidence level A greater than or equal to the threshold a from the K elements.

[0181] That is, the confidence level A of the tokens corresponding to the P elements is greater than or equal to the threshold a.

[0182] For example, the threshold a is M times the maximum confidence score A among the K tokens, 0 <M<1。

[0183] For example, if M is 0.7, then the threshold a is 0.215 * 0.7, which equals 0.1505. If the confidence levels A of the K tokens are 0.215, 0.176, 0.159, 0.151, 0.151, and 0.147, then the inference device 110 determines that the confidence levels A greater than or equal to the threshold a are 0.215, 0.176, 0.159, 0.151, and 0.151. The inference device 110 determines that the elements corresponding to confidence levels A of 0.215, 0.176, 0.159, 0.151, and 0.151 are in the 0th, 1st, 2nd, 3rd, and 4th positions among the K elements. Furthermore, the inference device 110 prunes the elements in the 0th, 1st, 2nd, 3rd, and 4th positions among the K elements, obtaining P elements, namely 0.4, 0.2, 0.1, 0.05, and 0.05.

[0184] For example, the threshold 'a' is configured by the user.

[0185] In one possible implementation, the inference device 110 selects P elements from the K elements based on the confidence A of the K tokens, including:

[0186] The inference device 110 sorts the confidence scores A of the K tokens in descending order, determines the P elements corresponding to the confidence scores A of the top P tokens, and then prunes these P elements from the K elements. The sum of the confidence scores A of the tokens corresponding to these P elements is greater than or equal to a threshold b.

[0187] For example, the threshold b is configured by the user. For instance, the threshold b is carried in the inference request a.

[0188] For example, if the threshold b is 0.3, and the confidence scores A of the K tokens are 0.215, 0.176, 0.159, 0.151, 0.151, and 0.147, then the inference device 110 determines that the sum of the confidence scores A (0.215, 0.176) of the tokens corresponding to the first P elements is 0.391, which is greater than the threshold b. Therefore, the inference device 110 determines that the elements with confidence scores A of 0.215 and 0.176 are in the 0th and 1st positions respectively among the K elements. Furthermore, the inference device 110 prunes the elements in the 0th, 1st, and 2nd positions from the K elements to obtain P elements, namely 0.4 and 0.2.

[0189] The inference device 110 normalizes the P elements to obtain the confidence level B of the P tokens. Based on the confidence level B of the P tokens, the content of the target token a corresponding to the inference request a is determined. This can be referred to in the descriptions of S330 and S340 in Figure 3 above or S450 and S460 in Figure 4 below, which will not be elaborated here.

[0190] It is worth noting that when multiple values ​​of P satisfy the condition that the sum of the confidence scores A of the P elements corresponding to the tokens is greater than or equal to the threshold b, the inference device 110 takes the minimum value among these multiple values ​​as P. For example, in the example above, when P is 3 or 4, both satisfy the condition that the sum of the confidence scores A of the P elements corresponding to the tokens is greater than or equal to the threshold b. Therefore, the inference device 110 takes 3 as the value of P.

[0191] It is worth noting that the above content is merely an example and should not be construed as limiting this application. In other embodiments of this application, P may also be 4. In other words, when P has multiple values ​​that satisfy the sum of the confidence levels A of the tokens corresponding to the P elements being greater than or equal to the threshold b, P can be any of the multiple values.

[0192] The following content, based on the content shown in Figure 1, elaborates on step ④ in Figure 2. Figure 4 is a flowchart illustrating a model post-processing method provided in this application. The difference between Figure 4 and Figure 3 is that the inference device 110 further selects N elements from the K elements and sets these N elements to negative infinity (-inf). The inference device 110 normalizes the K elements, including the N negative infinity values, to obtain the third confidence level of the K tokens, and then determines the first target token based on the third confidence level of the K tokens. In this embodiment, the confidence level C can be referred to as the third confidence level, and the model post-processing method includes the following steps S410-S460.

[0193] S410, inference device 110 obtains logits 1.

[0194] S420, inference device 110 filters the first K elements in logits 1 in descending order.

[0195] For a detailed description of S410 and S420, please refer to the content of S310 and S320 above, which will not be repeated here.

[0196] S430 and inference device 110 normalize the K elements to obtain the confidence level A of the K tokens.

[0197] For a detailed description of S430, please refer to the content of S330 above, which will not be repeated here.

[0198] S440, the inference device 110 sets N elements out of the K elements to negative infinity based on the confidence level A of the K tokens.

[0199] Where N is greater than or equal to 0, and N is less than K.

[0200] In one possible implementation, the inference device 110 sets N elements out of the K elements to negative infinity based on the confidence A of the K tokens, including: the inference device 110 determines the confidence A less than a threshold a from the confidence A of the K tokens, and then sets the N elements of the tokens whose confidence A is less than the threshold a to negative infinity.

[0201] That is, the confidence level A of the tokens corresponding to N elements is less than the threshold a.

[0202] For example, if M is 0.7, then the threshold a is 0.215 * 0.7, which equals 0.1505. If the confidence levels A of the K tokens are 0.215, 0.176, 0.159, 0.151, 0.151, and 0.147, then the inference device 110 determines that the confidence level A less than the threshold a is 0.147. The inference device 110 determines that the element corresponding to the confidence level A of 0.147 is the 5th element among the K elements, and then sets the 5th element among the K elements to a negative infinity value, thus obtaining the K elements including N negative infinity values: 0.4, 0.2, 0.1, 0.05, 0.05, and -inf.

[0203] In one possible implementation, the inference device 110 sets N elements out of the K elements to negative infinity based on the confidence scores A of the K tokens. This includes: the inference device 110 sorts the confidence scores A of the K tokens in descending order, and determines the confidence scores A of the tokens corresponding to the last N elements in the sorted order. The sum of the confidence scores A of these N elements is less than a threshold b.

[0204] For example, if the threshold b is 0.3, and the confidence scores A of the K tokens are 0.215, 0.176, 0.159, 0.151, 0.151, and 0.147, then the inference device 110 determines that the sum of the confidence scores (0.151, 0.147) of the tokens corresponding to the last N elements is 0.298, which is less than the threshold b. Therefore, the inference device 110 determines that the elements corresponding to confidence scores A of 0.151 and 0.147 are in the 4th and 5th positions respectively among the K elements, and then sets the 4th and 5th positions among the K elements to negative infinity, thus obtaining the K elements including N negative infinity values ​​as: 0.4, 0.2, 0.1, 0.05, -inf, -inf.

[0205] It is worth noting that when N has multiple values ​​that satisfy the condition that the sum of the confidence scores A of the tokens corresponding to the N elements is less than the threshold b, the inference device 110 takes the maximum value among these multiple values ​​as N. For example, in the example above, when N is 1 or 2, the condition that the sum of the confidence scores A of the tokens corresponding to the N elements is less than the threshold b is satisfied. Therefore, the inference device 110 takes 2 as the value of N.

[0206] It is worth noting that the above content is merely an example and should not be construed as limiting this application. In other embodiments of this application, N may also be 1. In other words, when N has multiple values ​​that satisfy the condition that the sum of the confidence levels A of the tokens corresponding to the N elements is less than or equal to the threshold b, N can be any of the multiple values.

[0207] It is worth noting that when N is 0, step S440 is a redundant step, and the content shown in Figure 4 is generally consistent with the content shown in Figure 3.

[0208] S450 and inference device 110 normalize K elements including N negative infinity values ​​to obtain the confidence C of K tokens.

[0209] For example, the inference device 110 performs softmax normalization on 0.4, 0.2, 0.1, 0.05, 0.05, and -inf, resulting in confidence levels C for the K tokens as: 0.252, 0.206, 0.186, 0.177, 0.177, and 0.

[0210] For example, the inference device 110 performs logsoftmax normalization on 0.4, 0.2, 0.1, 0.05, 0.05, and -inf, resulting in confidence levels C for the K tokens as: -1.3735, -1.5735, -1.6735, -1.7235, -1.7235, and -inf.

[0211] For a detailed description of step S450, please refer to the content of S330 above, which will not be repeated here.

[0212] S460, the inference device 110 determines the target token a corresponding to the inference request a based on the confidence C of the K tokens.

[0213] For a detailed description of step S460, please refer to the content of S340 above, which will not be repeated here.

[0214] The following content, based on the content shown in Figure 1, elaborates on step ④ in Figure 2. Figure 5 is a flowchart illustrating a model post-processing method provided in this application. The difference between Figure 5 and Figure 3 is that this embodiment performs parallel inference on multiple inference requests. In this embodiment, logits 2 can be referred to as the second logits, inference request b can be referred to as the second inference request, confidence level D can be referred to as the fourth confidence level, target token b can be referred to as the second target token, and the model post-processing method includes the following steps S510-S540.

[0215] S510 and inference device 110 obtain logits 1 and logits 2.

[0216] Wherein, logits 2 is obtained by the model based on inference request b, and model 101 performs parallel inference on inference request a and inference request b.

[0217] The dimensions of logits 1 and logits 2 are the same; for example, both logits 1 and logits 2 contain 32001 elements.

[0218] For a detailed description of logits 2, please refer to the description of logits 1 above, and it will not be repeated here. For a detailed description of step S510, please refer to the content of S310 above, and it will not be repeated here.

[0219] S520 and inference device 110 filter the first K elements in logits 1 in descending order and the first L elements in logits 2 in descending order.

[0220] Where L is greater than or equal to 1, and L is less than K. The specific value of L is configured by the user, or it is a default value.

[0221] In one possible example, the specific value of L mentioned above is carried in the inference request b.

[0222] Since model 101 performs parallel inference for inference request a and inference request b, inference device 110 requires that the data size of the same batch be consistent. For example, batch 1 of inference request a includes K elements after filtering, and batch 2 of inference request b includes L elements after filtering. Therefore, inference device 110 needs to supplement the L elements in logits, which are arranged in descending order, to K elements.

[0223] In one possible scenario, the element added by the inference device 110 during the above completion process is -inf.

[0224] As shown in Figure 5, the elements from position 0 to position 32000 in logits 2 are: 0.1, 0.6, 0.01, 0.2, 0.8, 0.3, ..., 0.9, 0.01, 0.5. If L is 5, then the L elements in logits 2 arranged in descending order are: 0.9, 0.8, 0.6, 0.5, 0.3. The inference device 110 completes the process by padding the L elements in descending order to O elements: 0.9, 0.8, 0.6, 0.5, 0.3, -inf, where O equals K.

[0225] For a detailed description of step S520, please refer to the content of S320 above, which will not be repeated here.

[0226] S530 and inference device 110 normalize K elements to obtain the confidence level A of K tokens, and normalize O elements to obtain the confidence level D of O tokens.

[0227] For example, the inference device 110 performs softmax normalization on the O elements and obtains the confidence levels D of the O tokens as follows: 0.2583, 0.2347, 0.1915, 0.1731, 0.1410, 0.

[0228] For example, the inference device 110 performs logsoftmax normalization on O elements and obtains the confidence levels D of the O tokens as: -1.3513, -1.4513, -1.6513, -1.7513, -1.9513, and -inf.

[0229] For a detailed description of step S530, please refer to the content of S330 above, which will not be repeated here.

[0230] S540, the inference device 110 determines the target token a corresponding to inference request a based on the confidence level A of K tokens, and determines the target token b corresponding to inference request b based on the confidence level D of O tokens.

[0231] Among them, O tokens include the target token b.

[0232] For a detailed description of step S540, please refer to the content of S340 above, which will not be repeated here.

[0233] In one possible embodiment, based on the content shown in Figure 5, the inference device 110 can further filter elements based on the K elements in descending order in logits 1 and the L elements in descending order in logits 2, in order to further reduce the computational load of subsequent steps (normalization processing, determination of target token), thereby improving the model inference efficiency.

[0234] In one possible example, the inference device 110 performs Top-p sampling on the aforementioned K elements and L elements to determine P1 elements among the K elements and P2 elements among the L elements, and then prunes P1 elements from the K elements and P2 elements from the L elements.

[0235] In another possible example, the inference device 110 performs Top-p sampling on the aforementioned K elements and L elements to determine P1 elements among the K elements and P2 elements among the L elements, and then sets all elements in the K elements except P1 elements to zero values, and sets all elements in the L elements except P2 elements to zero values.

[0236] For a detailed description of this embodiment, please refer to the content shown in Figure 4 above, which will not be repeated here.

[0237] Based on the content shown in Figures 2 to 5 above, this application also provides two possible embodiments. The following two possible embodiments are used to illustrate how, during the post-processing process of the inference device 110, the confidence level of the complete logits or token is output according to the user's requirement for the confidence level of the complete logits or token. This embodiment takes the output of the confidence level of the complete logits 1 or the token corresponding to logits 1 as an example. For the output of the confidence level of the complete logits 2 or the token corresponding to logits 2 according to the user's requirements, please refer to the following description.

[0238] In a first possible embodiment, the inference device 110 acquires a post-processed logits 1 to indicate the output, and then outputs the post-processed logits 1. The post-processed logits 1 includes the aforementioned K elements, and negative infinity values ​​excluding the K elements.

[0239] Regarding the content of the first possible embodiment described above, three possible implementation methods are provided below.

[0240] In a first possible implementation, the inference device 110 obtains a first vector consisting entirely of negative infinity values, and then uses the first K elements of logits 1, arranged in descending order, to replace the negative infinity values ​​in the first vector that have the same indices as the aforementioned K elements, thus obtaining a post-processed logits 1. Finally, the inference device 110 outputs the post-processed logits 1.

[0241] The number of negative infinity values ​​in the first vector is the same as the number of elements in logits 1, such as 32001.

[0242] For example, if K is 3, the index of element 1 is the 1st position in the first logits, the index of element 2 is the 3rd position in the first logits, and the index of element 3 is the 130th position in the first logits. The computing device replaces the negative infinity value at the 1st position in the first vector with element 1, replaces the negative infinity value at the 3rd position in the first vector with element 2, and replaces the negative infinity value at the 130th position in the first vector with element 3.

[0243] In the second possible implementation, the inference device 110 obtains a first vector consisting entirely of negative infinity values, deletes N of the negative infinity values ​​from the K elements obtained through post-processing S440, and obtains KN elements (P elements). The inference device 110 then replaces the negative infinity values ​​in the first vector that have the same index as the P elements with the P elements, and obtains and outputs the post-processed logits.

[0244] In the third possible implementation, as shown in Figure 6, which is a flowchart of a logits generation method provided by this application, the content shown in Figure 6 includes the following steps S610-S640.

[0245] S610 and inference device 110 acquire two sets of first vectors, all of which are negative infinity values.

[0246] In this group, the first vector of one batch (batch 1) corresponds to logits 1, and the first vector of the other batch (batch 2) corresponds to logits 2.

[0247] S620, the inference device 110 performs Top-P sampling on the K elements in descending order of logits 1, the L elements in descending order of logits 2, and the negative infinity value to obtain I elements in logits 1 and T elements in logits 2.

[0248] In logits 1, the first K elements in descending order include I elements, and in logits 2, the first L elements in descending order include T elements, where I is greater than or equal to 1 and T is greater than or equal to 1.

[0249] The values ​​of I and T can be referred to the descriptions in Figure 4 or Figure 5 above, and will not be repeated here. Figure 6 uses I as 2 and T as 3 as an example for illustration.

[0250] In one possible example, the inference device 110 performs Top-P sampling on the top K elements in descending order of the aforementioned logits 1 to determine I elements (e.g., 0.4, 0.2) from the K elements, and sets the remaining elements in the K elements other than the aforementioned I elements to -inf. The inference device 110 then filters out the -inf elements using a mask to obtain the I elements.

[0251] Furthermore, the inference device 110 performs Top-P sampling on the L elements in descending order of the aforementioned logits 2 and the negative infinity value to determine T elements (e.g., 0.9, 0.8, 0.6) out of the L elements, and sets all elements other than the aforementioned T elements in the L elements to -inf. The inference device 110 filters out -inf using a mask to obtain the T elements.

[0252] In one possible scenario, when the inference device 110 performs Top-P sampling on L elements and negative infinity, the negative infinity will be retained and not included in the T elements.

[0253] S630, the inference device 110 replaces the negative infinity values ​​with the same index as the I elements in the first vector corresponding to logits 1 with I elements in logits 1 to obtain post-processed logits 1, and replaces the negative infinity values ​​with the same index as the T elements in the first vector corresponding to logits 2 with T elements in logits 2 to obtain post-processed logits 2.

[0254] Since the index of 0.4 in the I elements is the 3rd position in logits 1 and the index of 0.2 is the 31999th position in logits 1, the inference device 110 replaces the negative infinity value of the 3rd position in the first vector corresponding to logits 1 with 0.4 and replaces the negative infinity value of the 31999th position in the first vector corresponding to logits 1 with 0.2, thereby obtaining the post-processed logits 1.

[0255] Furthermore, since the index of 0.9 in the T elements is the 31998th bit in logits 2, the index of 0.8 is the 4th bit in logits 2, and the index of 0.6 is the 1st bit in logits 2, the inference device 110 replaces the negative infinity value of the 31998th bit in the first vector corresponding to logits 2 with 0.9, replaces the negative infinity value of the 4th bit in the first vector corresponding to logits 2 with 0.8, and replaces the negative infinity value of the 1st bit in the first vector corresponding to logits 2 with 0.6, thereby obtaining the post-processed logits 2.

[0256] S640, inference device 110 outputs post-processed logits 1 and post-processed logits 2.

[0257] In one possible example, inference device 110 displays post-processed logits 1 and post-processed logits 2.

[0258] In another possible example, inference device 110 sends post-processed logits 1 to terminal device 130, and terminal device 130 displays post-processed logits 1 and post-processed logits 2.

[0259] For example, the inference device 110 can send the post-processed logits 1 and post-processed logits 2 to different terminal devices 130 for display.

[0260] In a second possible embodiment, the inference device 110 obtains a second request to indicate the confidence level of the output token, and then outputs the confidence level of the token. The confidence level of the token includes the confidence levels A of K tokens, and zero or negative infinity values ​​other than the confidence levels A of the K tokens.

[0261] In one possible scenario, all elements except for the confidence level A of the K tokens are identical, meaning all elements except for the confidence level A of the K tokens are zero, or all elements except for the confidence level A of the K tokens are negative infinity.

[0262] In one possible scenario, the confidence level A of the first token among the K tokens is: the probability value of the first token in the probability distribution or logarithmic probability distribution of the K tokens.

[0263] Regarding the content of the second possible embodiment described above, three possible implementation methods are provided below.

[0264] In a first possible implementation, as shown in Figure 6, the inference device 110 normalizes the post-processed logits to obtain the complete confidence score of the token and outputs the complete confidence score of the token. The post-processed logits include the post-processed logits 1 and post-processed logits 2 mentioned above.

[0265] The normalization process described above can be either softmax normalization or logsoftmax normalization. Specifically, if the inference device 110 performs softmax normalization on the post-processed logits, the confidence level of each element in the logits is obtained as the probability value in the probability distribution of the tokens corresponding to each element. The tokens corresponding to each element include the aforementioned K tokens.

[0266] If the inference device 110 performs logsoftmax normalization on the post-processed logits, the confidence level of the token corresponding to each element in the logits is obtained as the probability value in the log probability distribution of the token corresponding to each element.

[0267] In the second possible implementation, we will take the output of the token corresponding to logits as an example. The inference device 110 obtains a second vector containing all zero values, and then uses the confidence scores of the aforementioned K tokens to replace the zero values ​​in the second vector that have the same indices as the K elements, to obtain the complete confidence score of the token, and outputs the complete confidence score of the token.

[0268] For example, if K is 3, the index of element 1 is the 1st position in logits 1, the index of element 2 is the 3rd position in logits 1, and the index of element 3 is the 130th position in logits 1. The inference device 110 replaces the zero value of the 1st position in the second vector with the confidence level A of the token corresponding to the 1st position, replaces the zero value of the 3rd position in the second vector with the confidence level A of the token corresponding to the 3rd position, and replaces the zero value of the 130th position in the second vector with the confidence level A of the token corresponding to the 130th position.

[0269] In a third possible implementation, the inference device 110 obtains a second vector consisting entirely of zero values, and then removes the zero values ​​from the confidence scores A of the K tokens obtained through S450, resulting in the confidence scores A of KN tokens (and the confidence scores B of P tokens). The inference device 110 then replaces the zero values ​​in the second vector that have the same index as the P elements with the confidence scores B of the P tokens to obtain the confidence scores of the complete tokens, and outputs the confidence scores of the complete tokens.

[0270] It is understood that, in order to achieve the functions in the above embodiments, the inference device 110 includes hardware structures and / or software modules corresponding to the execution of each function. Those skilled in the art should readily recognize that, based on the units and method steps of the various examples described in conjunction with the embodiments disclosed in this application, this application can be implemented in hardware or a combination of hardware and computer software. Whether a function is executed in hardware or by computer software driving hardware depends on the specific application scenario and design constraints of the technical solution.

[0271] The model post-processing method provided by this application has been described in detail above with reference to Figures 2 to 6. The model post-processing device provided by this application will now be described with reference to Figure 7a, which is a schematic diagram of the structure of a model post-processing device provided by this application. The model post-processing device 700 can be used to implement the function of the inference device 110 in the above method embodiments, and therefore can also achieve the beneficial effects of the above method embodiments.

[0272] As shown in Figure 7a, the model post-processing device 700 includes an acquisition module 710, a filtering module 720, a first processing module 730, and a determination module 740. This model post-processing device 700 is used to implement the functions of the inference device 110 in the method embodiments corresponding to Figures 2 to 6. In one possible example, the specific process by which the model post-processing device 700 implements the above-described model post-processing method includes the following steps:

[0273] The acquisition module 710 is used to acquire the first logits; the first logits are obtained by the model based on the first inference request.

[0274] The filtering module 720 is used to filter the first K elements in the first logits, sorted in descending order, where K is greater than or equal to 1. There is a mapping relationship between each element in the logits and a minimal linguistic unit token.

[0275] The first processing module 730 is used to normalize the K elements to obtain the first confidence level of the K tokens.

[0276] The determination module 740 is used to determine the first target token corresponding to the first inference request based on the first confidence level of the K tokens, wherein the K tokens include the first target token.

[0277] To further realize the functions of the method embodiments shown in Figures 2 to 6 above, this application also provides a model post-processing device, as shown in Figure 7b. Figure 7b is a schematic diagram of the structure of a model post-processing device provided by this application. The model post-processing device 700 further includes: a second processing module 750, a first output module 760, and a second output module 770.

[0278] The second processing module 750 is used to obtain the second logits, which are obtained by the model based on the second inference request. The module filters the top L elements in descending order from the second logits, normalizes the O elements to obtain the fourth confidence scores of the O tokens, and determines the second target token corresponding to the second inference request based on the fourth confidence scores of the O tokens. The model performs parallel inference on the first and second inference requests; L is greater than or equal to 1, L is less than K; O equals K; the O elements include the L elements and negative infinity; and the O tokens include the second target token.

[0279] The first output module 760 is used to obtain a first request; the first request is used to indicate the output of post-processed logits; the output of post-processed logits includes K elements and negative infinity values ​​other than the K elements.

[0280] The second output module 770 is used to obtain a second request; the second request is used to indicate the confidence level of the output token; the confidence level of the output token includes the first confidence level of K tokens, and zero values ​​other than the first confidence level of K tokens.

[0281] For more information on the functions of the acquisition module 710, the filtering module 720, the first processing module 730, and the determination module 740, please refer to the description of the model post-processing method above; it will not be repeated here.

[0282] The acquisition module 710, filtering module 720, first processing module 730, and determination module 740 can all be implemented in software or in hardware. For example, the implementation of the acquisition module 710 will be described below. Similarly, the implementation of the filtering module 720, first processing module 730, and determination module 740 can refer to the implementation of the acquisition module 710.

[0283] As an example of a software functional unit, module 710 may include code running on a computing instance. The computing instance may include at least one of a physical host (computing device), a virtual machine, and a container. Furthermore, the aforementioned computing instance may be one or more. For example, module 710 may include code running on multiple hosts / virtual machines / containers.

[0284] It should be noted that the multiple hosts / virtual machines / containers used to run this code can be distributed within the same region or in different regions. Furthermore, the multiple hosts / virtual machines / containers used to run this code can be distributed within the same availability zone (AZ) or in different AZs, each AZ comprising one or more geographically proximate data centers. Typically, a region can include multiple AZs.

[0285] Similarly, multiple hosts / virtual machines / containers used to run this code can be distributed within the same VPC or across multiple VPCs. Typically, a VPC is set up within a region. Communication between two VPCs within the same region, as well as between VPCs in different regions, requires a communication gateway to be set up within each VPC to enable interconnection between VPCs.

[0286] As an example of a hardware functional unit, the acquisition module 710 may include at least one computing device, such as a server. Alternatively, the acquisition module 710 may also be a device implemented using an ASIC or a programmable logic device (PLD). The PLD may be implemented using a complex programmable logical device (CPLD), an FPGA, generic array logic (GAL), or any combination thereof.

[0287] The multiple computing devices included in the acquisition module 710 can be distributed in the same region or in different regions. Similarly, the multiple computing devices included in the acquisition module 710 can be distributed in the same Availability Zone (AZ) or in different AZs. Likewise, the multiple computing devices included in the acquisition module 710 can be distributed in the same Virtual Private Cloud (VPC) or in multiple VPCs. These multiple computing devices can be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.

[0288] It should be noted that, in other embodiments, the acquisition module 710 can be used to execute any step in the model post-processing method, the filtering module 720 can be used to execute any step in the model post-processing method, the first processing module 730 can be used to execute any step in the model post-processing method, and the determination module 740 can be used to execute any step in the model post-processing method. The steps implemented by the acquisition module 710, filtering module 720, first processing module 730, and determination module 740 can be specified as needed. The acquisition module 710, filtering module 720, first processing module 730, and determination module 740 respectively implement different steps in the model post-processing method to realize all the functions of the inference device 110.

[0289] It is worth noting that the inference device 110 in the aforementioned embodiment can correspond to the model post-processing device 700, and can correspond to the corresponding subject that executes the method according to the embodiments of this application in Figures 2 to 6. The operation and / or function of each module in the model post-processing device 700 are respectively to implement the corresponding process of each method in the corresponding embodiments in Figures 2 to 6. For the sake of brevity, they will not be described in detail here.

[0290] Alternatively, the model post-processing device 700 shown in Figure 7a or Figure 7b can also be implemented using a communication device, which can refer to the inference device 110 in the aforementioned embodiments. When the communication device is a chip or chip system applied to the processing device, the model post-processing device 700 can also be implemented using a chip or chip system.

[0291] This application embodiment also provides a chip system, which includes a control circuit and an interface circuit. The interface circuit is used to obtain a first request, and the control circuit is used to implement the function of the inference device 110 in the above method according to the first request.

[0292] In one possible design, the chip system also includes a memory for storing program instructions and / or data. This chip system can be composed of chips or may include chips and other discrete components.

[0293] This application also provides a computing device. Please refer to FIG8, which is a schematic diagram of the structure of a computing device provided in this application. The computing device 800 includes a bus 802, a processor 804, a memory 806, and a communication interface 808. The processor 804, the memory 806, and the communication interface 808 are interconnected via the bus 802. The computing device 800 can be a server or a terminal device. It should be understood that this application does not limit the number of processors and memories in the computing device 800. For example, the computing device 800 can be the aforementioned inference device 110.

[0294] Bus 802 can be a PCIe bus or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of representation, only one line is used in Figure 8, but this does not imply that there is only one bus or one type of bus. Bus 802 can include pathways for transmitting information between various components of computing device 800 (e.g., processor 804, memory 806, communication interface 808).

[0295] Processor 804 may include any one or more processors such as CPU, GPU, FPGA, microprocessor (MP) or DSP.

[0296] The memory 806 may include volatile memory, such as random access memory (RAM). The processor 804 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid state drive (SSD).

[0297] The memory 806 stores executable program code, which the processor 804 executes to implement the functions of the aforementioned acquisition module 710, filtering module 720, first processing module 730, and determination module 740, thereby realizing the model post-processing method. In other words, the memory 806 stores instructions for executing the model post-processing method.

[0298] The communication interface 808 uses transceiver modules, such as, but not limited to, network interface cards and transceivers, to enable communication between the computing device 800 and other devices or communication networks. The computing device 800 can be a computer (e.g., a server) in a cloud data center, a computer in an edge data center, or a terminal.

[0299] This application also provides a computing device cluster. The computing device cluster includes at least one computing device, which can be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smartphone. For example, the computing device cluster can be the aforementioned inference device 110.

[0300] Figure 9 is a schematic diagram of a computing device cluster provided in this application. The computing device cluster includes at least one computing device 800. The memory 806 of one or more computing devices 800 in the computing device cluster may store the same instructions for executing model post-processing methods.

[0301] In some possible implementations, the memory 806 of one or more computing devices 800 in the computing device cluster may also store partial instructions for executing the model post-processing method. In other words, a combination of one or more computing devices 800 can jointly execute the instructions for executing the model post-processing method.

[0302] It should be noted that the memory 806 in different computing devices 800 within the computing device cluster can store different instructions, each used to execute a portion of the model post-processing method's functions. That is, the instructions stored in the memory 806 of different computing devices 800 can implement the functions of one or more modules among the acquisition module 710, the filtering module 720, the first processing module 730, and the determination module 740.

[0303] In some possible implementations, one or more computing devices in a computing device cluster can be connected via a network. This network can be a wide area network (WAN) or a local area network (LAN). Figure 10 illustrates one possible implementation. As shown in Figure 10, which is a schematic diagram of a connection between computing devices according to this application, two computing devices 800A and 800B are connected via a network. Specifically, they are connected to the network through communication interfaces in each computing device. In this type of possible implementation, the memory 806 in computing device 800A stores instructions for executing the functions of the acquisition module 710. Simultaneously, the memory 806 in computing device 700B stores instructions for executing the functions of the filtering module 720, the first processing module 730, and the determination module 740.

[0304] It should be understood that the functions of computing device 800A shown in Figure 10 can also be performed by multiple computing devices 800. Similarly, the functions of computing device 800B can also be performed by multiple computing devices 800.

[0305] This application also provides a computer program product containing instructions. This computer program product may be a software or program product containing instructions, capable of running on a computing device or stored on any usable medium. When the computer program product is run on at least one computing device, it causes the at least one computing device to perform the above-described post-processing method.

[0306] This application also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium capable of being stored by a computing device, or a data storage device such as a data center containing one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state drive). The computer-readable storage medium includes instructions that instruct the computing device to perform a model post-processing method.

[0307] In the above embodiments, implementation can be achieved entirely or partially through software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented entirely or partially in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of this application are performed entirely or partially. The computer can be a general-purpose computer, a special-purpose computer, a computer network, a network device, a user equipment, or other programmable device. The computer program or instructions can be stored in a computer-readable storage medium or transferred from one computer-readable storage medium to another. For example, the computer program or instructions can be transferred from one website, computer, server, or data center to another website, computer, server, or data center via wired or wireless means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium, such as a floppy disk, hard disk, or magnetic tape; it can also be an optical medium, such as a digital video disc (DVD); or it can be a semiconductor medium, such as an SSD.

[0308] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A model post-processing method, characterized in that, The method includes: Obtain the first raw prediction value logits; the first logits is obtained by the model based on the first inference request. Filter the first K elements in the first logits in descending order, where K is greater than or equal to 1; there is a mapping relationship between an element in the logits and a minimal language unit token. The K elements are normalized to obtain the first confidence level of the K tokens; Based on the first confidence level of the K tokens, the first target token corresponding to the first inference request is determined; the K tokens include the first target token.

2. The method of claim 1, wherein, The step of determining the first target token corresponding to the first inference request based on the first confidence level of the K tokens includes: The first confidence scores of the K tokens are sampled to obtain the target confidence score among the first confidence scores of the K tokens; the target confidence score includes one or more first confidence scores. From the mapping relationship between elements and tokens in logits, determine the first target token corresponding to the element indicated by the target confidence; the first target token includes one or more tokens.

3. The method according to claim 1 or 2, characterized in that, The step of determining the first target token corresponding to the first inference request based on the first confidence level of the K tokens includes: Based on the first confidence level of the K tokens, P elements are selected from the K elements; P is greater than or equal to 1, P is less than K, the first confidence level of the tokens corresponding to the P elements is greater than or equal to a first threshold, or the first confidence level of the tokens corresponding to the P elements is the first P tokens arranged in descending order, and the sum of the first confidence levels of the P tokens is greater than or equal to a second threshold. The P elements are normalized to obtain the second confidence scores of the P tokens; Based on the second confidence level of the P tokens, determine the first target token corresponding to the first inference request.

4. The method according to claim 1 or 2, characterized in that, The step of determining the first target token corresponding to the first inference request based on the first confidence level of the K tokens includes: Based on the first confidence level of the K tokens, N elements among the K elements are set to negative infinity; N is greater than or equal to 0, N is less than K, the first confidence level of the tokens corresponding to the N elements is less than a first threshold, or the first confidence level of the tokens corresponding to the N elements is the N tokens arranged in descending order, and the sum of the first confidence levels of the tokens corresponding to the N elements is less than a second threshold. Normalize the K elements, including N negative infinity values, to obtain the third confidence level of the K tokens; Based on the third confidence level of the K tokens, determine the first target token corresponding to the first inference request.

5. The method according to any one of claims 1 to 4, characterized in that, The method further includes: Obtain the second logits; the second logits are obtained by the model based on the second inference request, and the model performs inference in parallel with the first inference request and the second inference request. Filter the L elements in the second logits that are sorted in descending order, where L is greater than or equal to 1 and L is less than K; Normalize the O elements to obtain the fourth confidence level of the O tokens; O equals K, and the O elements include the L elements and negative infinity. Based on the fourth confidence level of the O tokens, the second target token corresponding to the second inference request is determined; the O tokens include the second target token.

6. The method according to any one of claims 1 to 5, characterized in that, After filtering the K elements in the logits that are sorted in descending order, the method further includes: Obtain a first request; the first request is used to indicate the output of post-processed logits; Output the post-processed logits; the post-processed logits include the K elements and negative infinity values ​​other than the K elements.

7. The method according to any one of claims 1 to 6, characterized in that, After normalizing the K elements to obtain the first confidence scores of the K smallest language unit tokens, the method further includes: Obtain the second request; the second request is used to indicate the confidence level of the output token; Output the confidence level of the token; the confidence level of the token includes the first confidence level of K tokens, and zero values ​​other than the first confidence level of K tokens.

8. The method of claim 7, wherein, The first confidence level of the first token among the K tokens is: the probability value of the first token in the probability distribution or logarithmic probability distribution of the K tokens.

9. A model post-processing apparatus, characterized by, The device includes: The acquisition module is used to acquire the first logits; the first logits are obtained by the model based on the first inference request. The filtering module is used to filter the first K elements in the first logits in descending order, where K is greater than or equal to 1; there is a mapping relationship between an element in the logits and a token. The first processing module is used to normalize the K elements to obtain the first confidence level of the K tokens; The determination module is configured to determine the first target token corresponding to the first inference request based on the first confidence level of the K tokens, wherein the K tokens include the first target token.

10. The apparatus of claim 9, wherein, The determining module is specifically used to sample the first confidence scores of the K tokens to obtain the target confidence scores among the first confidence scores of the K tokens, and to determine the first target token corresponding to the element indicated by the target confidence score from the mapping relationship between elements and tokens in logits; the target confidence score includes one or more first confidence scores, and the first target token includes one or more tokens.

11. The apparatus according to claim 9 or 10, characterized in that, The determining module is specifically used to filter P elements from the K elements based on the first confidence level of the K tokens, and to normalize the P elements to obtain the second confidence level of the P tokens. Based on the second confidence levels of the P tokens, the first target token corresponding to the first inference request is determined; wherein, P is greater than or equal to 1, P is less than K, the first confidence level of the tokens corresponding to the P elements is greater than or equal to a first threshold, or the first confidence level of the tokens corresponding to the P elements is the first P tokens arranged in descending order, and the sum of the first confidence levels of the P tokens is greater than or equal to the second threshold.

12. The apparatus of claim 9 or 10, wherein, The determining module is specifically configured to, based on the first confidence level of the K tokens, set N elements of the K elements to negative infinity, and normalize the K elements including the N negative infinity values ​​to obtain the third confidence level of the K tokens; and determine the first target token corresponding to the first inference request based on the third confidence level of the K tokens; wherein, N is greater than or equal to 0, N is less than K, the first confidence level of the tokens corresponding to the N elements is less than a first threshold, or the first confidence level of the tokens corresponding to the N elements is the N tokens arranged in descending order, and the sum of the first confidence levels of the tokens corresponding to the N elements is less than a second threshold.

13. The apparatus of any one of claims 9-12, wherein, The device also includes a second processing module; The second processing module is used to obtain a second logits, which are obtained by the model based on the second inference request; filter the L elements in the second logits in descending order, normalize the O elements to obtain the fourth confidence scores of the O tokens, and determine the second target token corresponding to the second inference request based on the fourth confidence scores of the O tokens; wherein the model performs parallel inference on the first inference request and the second inference request; L is greater than or equal to 1, L is less than K; O is equal to K, the O elements include the L elements and negative infinity, and the O tokens include the second target token.

14. The apparatus of any one of claims 9-13, wherein, The device also includes a first output module; The first output module is used to obtain a first request; the first request is used to instruct the output of post-processed logits; the post-processed logits are output; the post-processed logits include the K elements and negative infinity values ​​other than the K elements.

15. The apparatus of any one of claims 9 to 14, wherein, The device also includes a second output module; The second output module is used to obtain a second request; the second request is used to indicate the confidence level of the output token; the confidence level of the token is output; the confidence level of the token includes the first confidence level of K tokens, and zero values ​​other than the first confidence level of K tokens.

16. The apparatus of claim 15, wherein, The first confidence level of the first token among the K tokens is: the probability value of the first token in the probability distribution or logarithmic probability distribution of the K tokens.

17. A cluster of computing devices, characterized in that, It includes at least one computing device, each computing device including a processor and memory; The processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to perform the method as described in any one of claims 1 to 8.

18. A computer-readable storage medium, characterized in that, The storage medium stores a computer program or instructions, which, when executed by a computing device, implement the method of any one of claims 1 to 8.

19. A computer program product comprising computer programs or instructions, characterized in that, When the computer program or instructions are executed by a computing device, the method of any one of claims 1 to 8 is implemented.