Text recognition method, device and electronic equipment of large language model

By inserting dummy tokens into the large language model and using a sliding window mechanism to filter key tokens, the problem of GPU memory limitations was solved, enabling accurate identification and continuous generation of long text sequences.

CN119830900BActive Publication Date: 2026-06-19CHINA TELECOM CLOUD TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA TELECOM CLOUD TECH CO LTD
Filing Date
2024-11-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing large language models are limited by GPU memory capacity during text recognition, resulting in a limited length of the input text sequence and the discarding of the initial token, which causes the recognition results to deviate from the text and affect accuracy.

Method used

A dummy token is inserted at the beginning of the text sequence to absorb the attention score bias. Key tokens are calculated and filtered step by step through a sliding window mechanism. The newly added tokens are combined to generate prediction results until the entire sequence is identified.

Benefits of technology

Without changing the size of the video memory, it adapts to text sequences of different lengths, maintains the accuracy and continuity of the recognition results, and improves the text recognition performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119830900B_ABST
    Figure CN119830900B_ABST
Patent Text Reader

Abstract

This invention provides a text recognition method, apparatus, and electronic device using a large language model, relating to the field of natural language processing technology. The method includes: acquiring a token sequence of a text sequence; determining a dummy token that stores attention score biases, and inputting a sliding window portion of the tokens into a large language model to output a first predicted token; calculating the attention score of each sliding window portion token relative to the first predicted token, and selecting multiple key tokens; selecting a new token; inputting the dummy token, key tokens, and new token as a token combination into the large language model to output a second predicted token; updating the token combination until the token sequence is fully input; and outputting the recognition result of the text sequence. This invention can improve the accuracy of text recognition.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of natural language processing technology, and in particular to a text recognition method, a device, and an electronic device for large language models. Background Technology

[0002] Large language models are large neural network models in deep learning specifically designed for understanding, generating, and processing natural language. They typically contain billions or even hundreds of billions of parameters and, trained on massive amounts of text data, acquire rich linguistic knowledge and complex semantic relationships, enabling them to generate coherent text, answer questions, translate languages, and more. Attention mechanisms are a core component of large language models; they help the model selectively focus on specific parts of the text based on the varying importance of the context.

[0003] The necessity of attention mechanisms for large language models lies in their ability to capture global context and long-distance dependencies, enabling the model to focus on key content within a sequence when processing long texts. Simultaneously, attention mechanisms support parallel computation, improving model efficiency and enhancing its language understanding and generation performance across multiple tasks. This allows large language models to generate more coherent and semantically consistent outputs, demonstrating superior performance in complex semantic tasks.

[0004] However, existing large language models are limited by the finite memory capacity related to the length of the input text during text recognition. This makes it impossible to make the length of the input text sequence very large. In this case, the initial tokens, i.e., words, will be discarded. As the text recognition process progresses, the model's recognition results will deviate significantly from the text, resulting in a large deviation in text recognition accuracy and limiting the recognition performance of large language models. Summary of the Invention

[0005] This invention provides a text recognition method, electronic device, and computer-readable storage medium for large language models. The solution addresses the limitation of existing large language models in text recognition due to the finite memory capacity related to the length of the input text. This makes it impossible to achieve a very large input text sequence length, resulting in the initial tokens (word units) being discarded. As the text recognition process progresses, the model's recognition results deviate significantly from the text, leading to a large deviation in text recognition accuracy and limiting the recognition performance of large language models.

[0006] This invention also discloses a text recognition method for a large language model, the method comprising:

[0007] S1: Get the token sequence from the text sequence;

[0008] S2: Determine the dummy token at the head of the token sequence to store the attention score bias, and input the sliding window portion of the optimized token sequence with the dummy token into the large language model to output the first predicted token;

[0009] S3: Calculate the attention score of each sliding window part token relative to the first predicted token, and select a first preset number of key tokens whose attention scores are greater than the preset attention scores;

[0010] S4: Select a second preset number of new tokens from the token sequence to input into the large language model;

[0011] S5: Input the dummy token, each key token and each newly added token as a token combination sequence into the large language model, and output the second predicted token;

[0012] S6: Use the second predicted token as the first predicted token, and the token combination as the sliding window part token. Return to step S3 until all the tokens in the token sequence have been input into the large language model.

[0013] S7: Based on the obtained first and second prediction tokens, output the recognition result of the text sequence.

[0014] This invention also discloses a text recognition device based on a large language model, the device comprising:

[0015] The acquisition module is used to obtain the token sequence of the text sequence;

[0016] The determination module is used to determine the dummy token at the head of the token sequence to store the attention score bias, and input the sliding window portion of the optimized token sequence with the dummy token into the large language model to output the first predicted token;

[0017] The calculation module is used to calculate the attention score of each sliding window part token relative to the first predicted token, and select multiple key tokens whose attention scores are greater than a first preset number of preset attention scores;

[0018] The selection module is used to select a second preset number of new tokens from the token sequence to be input into the large language model;

[0019] The first input module is used to input the dummy token, each key token and each newly added token as token combinations into the large language model, and output the second predicted token.

[0020] The second input module is used to use the second predicted token as the first predicted token, and the token combination as the sliding window partial token, and return to step S3 until all the tokens in the token sequence are input into the large language model.

[0021] The output module is used to output the recognition result of the text sequence based on the obtained first and second prediction tokens.

[0022] This invention also discloses an electronic device, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus;

[0023] Memory, used to store computer programs;

[0024] A processor, when executing a program stored in memory, implements the method as described in the embodiments of the present invention.

[0025] The embodiments of the present invention have the following advantages:

[0026] In this invention, a dummy token for storing attention score bias is first determined at the head of the token sequence of the text sequence to be recognized. Then, a first predicted token is output based on the sliding window portion tokens. Next, the attention score of each sliding window portion token relative to the first predicted token is calculated, and a key token with a higher score is selected. Several new tokens are then selected from the token sequence to be input into the large language model. Based on the obtained dummy token, key tokens, and new tokens, the sliding window portion tokens are replaced to generate a new second prediction result. This process is repeated until the sliding window has traversed the entire text sequence. In this process, the length of the tokens input to the large language model can adapt to different sizes of GPU memory, resulting in greater adaptability. Furthermore, the input tokens retain the initial tokens of the text sequence, avoiding deviations from the text in the recognition results and improving the accuracy of the text recognition results. Each input token is based on a key token that has a significant impact on the previous prediction result, preserving important semantic information of the text. Each input token also includes new tokens from the text sequence, maintaining the continuity of the recognition results. In summary, this solution can process text sequences of different lengths without changing the original video memory size, easily obtain the global and local features of the text sequence, and maintain the accuracy of the text sequence recognition results. Attached Figure Description

[0027] Figure 1 This is a flowchart illustrating the steps of a text recognition method based on a large language model provided in an embodiment of the present invention.

[0028] Figure 2 This is a block diagram of a text recognition device for a large language model provided in an embodiment of the present invention;

[0029] Figure 3 This is a block diagram of an electronic device provided in an embodiment of the present invention;

[0030] Figure 4 This is a schematic diagram of a computer-readable medium provided in an embodiment of the present invention. Detailed Implementation

[0031] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0032] Reference Figure 1 The diagram illustrates a flowchart of the steps of a text recognition method based on a large language model provided in an embodiment of the present invention.

[0033] Specifically, this may include the following steps:

[0034] It should be noted that, for the sake of simplicity, the method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of the present invention are not limited to the described order of actions, because according to the embodiments of the present invention, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions involved are not necessarily essential to the embodiments of the present invention.

[0035] The methods include:

[0036] S1: Get the token sequence from the text sequence.

[0037] In this context, a text sequence refers to the raw natural language text to be identified or processed, existing in the form of sentences, paragraphs, or entire texts. A token sequence refers to a sequence of smaller units obtained from a text sequence through word segmentation or decomposition. In large language models, text needs to be broken down into tokens when processing it because the model cannot directly understand and process continuous text strings. Token sequences allow the model to use each token as a separate input unit for attention calculation, semantic analysis, and feature extraction.

[0038] S2: Determine the dummy token at the head of the token sequence to store the attention score bias, and input the sliding window portion of the optimized token sequence with the dummy token into the large language model to output the first predicted token.

[0039] The attention score bias represents the gradually accumulating offset caused by multiple softmax operations, resulting in the attention score deviating from its initial ideal distribution. This bias affects the model's ability to focus on long sequences of content. Specifically, the attention score bias represents the deviation of the attention score from its distribution due to softmax operations during training or inference.

[0040] The dummy token is a special "placeholder" token inserted at the beginning of the token sequence, primarily used to store the attention score bias. This bias arises because, after multiple softmax operations, the attention score gradually deviates from the ideal distribution, affecting the model's focus on important content. The dummy token absorbs and stabilizes these biases, allowing the model to maintain the accuracy of the attention distribution when processing long text sequences. The optimized token sequence refers to the token sequence after adding the dummy token and adjusting it. The sliding window portion of the tokens refers to a small subset of tokens extracted from the optimized token sequence, using a sliding window mechanism to control the sequence length. The sliding window mechanism allows the model to process only a small range of tokens at a time in long sequences, reducing the GPU memory burden. The sliding window gradually slides along the entire sequence so that the model covers the entire text sequence during processing. The first predicted token is the first prediction generated by the model after inputting the sliding window portion of the tokens.

[0041] It should be noted that a dummy token is added to the beginning of the token sequence to absorb the attention score bias. Then, the sliding window portion of the optimized token sequence containing the dummy token is input into the model to obtain the first prediction result (the first predicted token). This process ensures the stability of the attention distribution in long sequence processing, reduces the GPU memory burden, and improves prediction accuracy.

[0042] In one possible implementation, determining a dummy token at the head of the token sequence in S2 to store the attention score bias specifically includes:

[0043] S201: Determine the operation process of the large language model, which includes the training process and the inference process.

[0044] The training process is the stage where the model learns and optimizes. During training, the model updates its parameters using a large amount of labeled or unlabeled data (such as text corpora) to learn the structure and semantic relationships of the language. The inference process is the application stage where the model generates output on new data, i.e., the actual prediction stage. During inference, the model no longer adjusts its parameters but makes predictions or generates based on the knowledge it has already trained on.

[0045] S202: Determine the dummy token based on the running process.

[0046] Specifically, the setting of dummy tokens is determined by judging whether the large language model is currently running in training or inference mode, thus identifying the current running state and setting appropriate dummy tokens according to different states. This process ensures that the role of dummy tokens can effectively absorb attention score bias in both training and inference, thereby optimizing the attention distribution of the model at different running stages.

[0047] In one possible implementation, S202 specifically includes:

[0048] S2021: When the process is a training process, add a dummy token to the head of the token sequence.

[0049] S2022: When the process is a reasoning process, starting from the beginning of the token sequence, select the third preset number of tokens as dummy tokens.

[0050] Specifically, in S2021, during the training process, a dummy token is directly added to the beginning of the token sequence. In S2022, during the inference process, a certain number (a third preset number) of tokens are selected sequentially from the beginning of the sequence as dummy tokens. This method flexibly sets dummy tokens according to different states, ensuring that attention score bias can be effectively absorbed during both the training and inference phases.

[0051] It should be noted that those skilled in the art can set the size of the third preset quantity according to actual needs, and this invention does not limit it.

[0052] In one possible implementation, the added dummy token is equivalent to a token.

[0053] In one possible implementation, the third preset quantity is less than or equal to 5.

[0054] S3: Calculate the attention score of each sliding window part token relative to the first predicted token, and select a first preset number of key tokens whose attention scores are greater than the preset attention scores.

[0055] Attention score is a numerical metric used by the model to measure the correlation between each sliding window token and the first predicted token. A higher score indicates a greater influence of that token on generating the first predicted token. In large language models, attention scores are typically calculated using a self-attention mechanism, employing a softmax function to generate a probability distribution, thus guiding the model to focus on high-scoring tokens. Key tokens are those with attention scores above a preset threshold. The model identifies these key tokens because they play a crucial role in understanding the current context or subsequent generation. By filtering key tokens, the model can focus more intently on important information in the text, improving the accuracy of text recognition and generation.

[0056] It should be noted that the model calculates the attention score of each sliding window token relative to the first predicted token, and filters out key tokens with scores higher than a preset value. These key tokens help the model focus on important information in long sequence processing, improving prediction accuracy.

[0057] It should be noted that those skilled in the art can set the preset attention score and the size of the first preset quantity according to actual needs, and this invention does not limit them.

[0058] S4: Select a second preset number of new tokens from the token sequence to input into the large language model.

[0059] In this context, "new tokens" refers to the latest tokens selected from the original token sequence that have not yet been input into the model. These new tokens are typically located at the end of the sequence, representing the unprocessed portion of text before the current sliding window. Introducing new tokens provides the model with up-to-date contextual information, ensuring semantic coherence during long text recognition and effectively connecting processed and new content.

[0060] It should be noted that those skilled in the art can set the size of the second preset quantity according to actual needs, and the present invention does not limit this.

[0061] In one possible implementation, the total number of newly added tokens is the same as the number of tokens in the sliding window section.

[0062] It's important to note that the total number of newly added tokens is set to be equal to the number of tokens in the sliding window section. This means that each time the sliding window moves, the newly introduced tokens will completely replace the old tokens in the sliding window. This design ensures that the model processes the same number of tokens each time it handles long text, maintaining the stability of memory usage and guaranteeing semantic coherence.

[0063] In one possible implementation, after S4, the following is also included:

[0064] If the total number of newly added tokens in the sliding window portion of the large language model is less than the sum of the number of tokens calculated from the number of tokens in the dummy token, the first preset number, and the second preset number, the newly added tokens in the sliding window portion are treated as a token combination sequence.

[0065] It should be noted that when the number of newly added tokens in the sliding window is less than the sum of the number of dummy tokens, the first preset number of key tokens, and the second preset number of newly added tokens, these newly added tokens are input into the model as a complete token combination sequence. This setting ensures that as complete text sequences as possible are recognized, improving the accuracy of text sequence recognition.

[0066] S5: Input the dummy token, each key token, and each newly added token as token combinations into the large language model, and output the second predicted token.

[0067] It's important to note that the model integrates the dummy token, key token, and newly added token into a complete token combination input to the large language model. These tokens represent initial bias control, important information in the current context, and the latest context, respectively. Through this combined input, the model can consider both global and up-to-date information in the long sequence when outputting the second predicted token, achieving stable and coherent text generation or recognition. Furthermore, the length of the input text sequence can be freely adjusted during this process. This token selection mechanism improves the accuracy and reliability of recognition results and reduces recognition bias without exceeding the available GPU memory.

[0068] Understandably, the solution involves three parts: adding a dummy at the beginning of the text, adding the token with the highest attention score in each iteration, and adding a new token for each new input. The first part avoids the problem of error accumulation; the second part increases the accuracy of text recognition while overcoming the text length constraint; and the third part increases the continuity of the predicted text.

[0069] S6: Use the second predicted token as the first predicted token, and the token combination as the sliding window partial token. Return to step S3 until all tokens in the token sequence have been input into the large language model.

[0070] It should be noted that the model uses the second predicted token as the new first predicted token, and sets the current token combination as the new sliding window partial token, before re-entering step S3 for the next round of processing. By continuously repeating this step, the model can progressively input and recognize the entire text sequence until all tokens have been processed, achieving continuous recognition of long texts.

[0071] S7: Based on the obtained first and second prediction tokens, output the recognition result of the text sequence.

[0072] It should be noted that the model generates the final text recognition result based on the output sequence of the accumulated first and second predicted tokens. By progressively predicting and combining these tokens, the model can form a complete and coherent output that accurately reflects the semantic content of the entire input text, thus completing the text sequence recognition task.

[0073] In one possible implementation, S7 specifically includes:

[0074] S701: Arrange the obtained first and second prediction tokens in the output order.

[0075] S702: Output the permutation result as the recognition result of the text sequence.

[0076] It's important to note that the entire scheme, including both the training and inference processes, can be summarized as follows: Select the first few tokens from the text sequence. Select the tokens with high attention scores from the middle of the text sequence. Select the most recently added token in the sequence. Generate subsequent tokens by simultaneously incorporating these three sets of tokens into the calculation.

[0077] During training, the input token sequence is first obtained, then a dummy token is added, the Top-K high-attention tokens are selected, and the most recent token is obtained. Attention is then calculated to generate new tokens. The rolling window is then updated to redetermine the dummy token, high-attention tokens, and most recent tokens. This process is repeated to generate multiple new tokens until the entire sequence is traversed. The result, i.e., the generated token, is then output.

[0078] During the inference process, the input token sequence is first obtained. Then, a few initial tokens are selected as dummy tokens, the Top-K high-attention tokens are selected, and the most recent token is obtained. Attention is then calculated to generate new tokens. The rolling window is then updated to redetermine the dummy token, high-attention token, and most recent token. This process is repeated to generate multiple new tokens until the entire sequence is traversed. Finally, the result, which is the generated token, is output.

[0079] The initial tokens in the text sequence are important for the following reasons: During speech model inference, multiple softmax operations are performed. This operation aims to deviate the data distribution and improve gradient descent. However, because the softmax calculation results in each term being greater than 0, some tokens that should have zero or negative attention scores end up with positive scores due to the softmax operation. As the number of softmax operations increases, the attention scores accumulate, causing them to deviate from the original distribution. This is the reason for the window attention mechanism; once the initial few tokens of the sequence slip out of the window, the perplexity of the generated sequence increases. Furthermore, the reason for retaining the tokens with high attention scores in the middle of the sequence is that, whether during training or inference, key tokens always receive high attention scores, significantly influencing subsequent outputs. These tokens typically contain important semantic information from the sequence. The impact of the most recently added tokens on the generation of subsequent tokens is mainly because the semantics in the sequence are usually continuous, and these newly added tokens are necessary for the semantic coherence of subsequent token generation. By retaining these three parts of tokens for subsequent token generation, the language model avoids retaining all tokens, saving on GPU memory requirements for the attention mechanism computation. It retains the initial tokens to absorb the attention score distribution bias caused by softmax, retains tokens with high attention scores in the sequence to prevent the loss of key semantic information, and retains the latest tokens in the sequence to maintain semantic coherence. Through this process, a long-sequence adaptive attention mechanism for large language models can be implemented.

[0080] More specifically, during training, firstly, a sequence of tokens is input. Then, a dummy token is added to the beginning of the text sequence; this token stores the value of the attention score deviating from the distribution due to subsequent softmax operations. Next, the top-K tokens with the highest attention scores in the text sequence are selected. Then, the tokens of the latest sequence are obtained. Finally, all three tokens are used together in the attention calculation to predict the generation of the next token.

[0081] For example, to train a large language model, firstly, a dummy token is added to the beginning of the sequence. Secondly, during training, K tokens with high attention scores are retained. Thirdly, the W most recently added tokens are selected for subsequent calculations. At this point, the number of attention tokens is 1 + K + W. When the length of the input sequence is less than or equal to 1 + K + W, all tokens in the entire sequence are included in the attention score calculation. When the length of the sequence is greater than 1 + K + W, the number of tokens involved in the calculation is 1 + K + W. This process is repeated until the entire training sequence is input. When training is complete, during inference, the large language model can perform inference as long as K and W remain consistent with the training process and a dummy token is added to the beginning of the sequence.

[0082] More specifically, in the inference process, firstly, a sequence of tokens is input. Secondly, initial tokens are extracted from the beginning of the text sequence; these can be 1, 2, 3, and typically no more than 5. These tokens are used to store the deviations in attention scores caused by subsequent softmax operations. Next, the top-K tokens with the highest attention scores are obtained from the subsequent sequences. Then, the tokens from the latest sequence are obtained. These three sets of tokens work together in the attention calculation to predict the generation of the next token.

[0083] For example, even when a large language model is not trained, this adaptive attention mechanism can still be used. The specific steps are as follows: First, select the initial N tokens as dummy tokens to absorb the bias in the attention score. Second, retain the K tokens with higher attention scores. Third, select the W most recently added tokens for subsequent calculations. At this point, the number of attention tokens is N+K+W. When the length of the input sequence is less than or equal to N+K+W, all tokens in the entire sequence are included in the attention score calculation. When the length of the sequence is greater than N+K+W, the tokens involved in the calculation are those from N+K+W. This process is repeated until the entire sequence is input.

[0084] It should be noted that since the training process of the language model is different from that proposed in this invention, the above N, K, and W are all hyperparameters of the model. Experiments are needed to determine the values ​​of N, K, and W. Different models may choose different values.

[0085] Understandably, this solution enables long sequence output from large language models with limited GPU memory, and it can handle extremely long sequences. By introducing dummy tokens to stabilize attention, the model is no longer limited by the length of the pre-training sequence and can handle inputs far exceeding this length. Secondly, it significantly enhances the model's robustness and inference ability with only a small amount of additional computation, achieving high computational efficiency. Furthermore, it combines global and local high-attention tokens, integrating global and local information. During pre-training, dummy tokens can be added to absorb the attention score bias caused by the softmax operation, achieving the above process without consuming tokens. For already trained large language models, the initial few tokens of the sequence can still be selected to achieve stable output.

[0086] In practical applications, accurate recognition of long texts is achieved by progressively processing tokens within the text sequence. First, the token sequence is decomposed, and a dummy token is inserted at the beginning to absorb attention bias. Then, a sliding window mechanism is used to input the tokens into the model batch by batch, calculating the attention score of each token relative to the prediction result and filtering out important key tokens. Simultaneously, by introducing new tokens, the model continuously acquires the latest contextual information. At each step, dummy tokens, key tokens, and new tokens are combined and input into the model to generate coherent prediction results until the entire sequence is processed. Finally, the prediction results are arranged in order to output the complete text recognition result. This process ensures stable processing of long sequences within limited GPU memory and improves the accuracy and continuity of recognition.

[0087] The embodiments of the present invention have the following advantages:

[0088] In this invention, a dummy token for storing attention score bias is first determined at the head of the token sequence of the text sequence to be recognized. Then, a first predicted token is output based on the sliding window portion tokens. Next, the attention score of each sliding window portion token relative to the first predicted token is calculated, and a key token with a higher score is selected. Several new tokens are then selected from the token sequence to be input into the large language model. Based on the obtained dummy token, key tokens, and new tokens, the sliding window portion tokens are replaced to generate a new second prediction result. This process is repeated until the sliding window has traversed the entire text sequence. In this process, the length of the tokens input to the large language model can adapt to different sizes of GPU memory, resulting in greater adaptability. Furthermore, the input tokens retain the initial tokens of the text sequence, avoiding deviations from the text in the recognition results and improving the accuracy of the text recognition results. Each input token is based on a key token that has a significant impact on the previous prediction result, preserving important semantic information of the text. Each input token also includes new tokens from the text sequence, maintaining the continuity of the recognition results. In summary, this solution can process text sequences of different lengths without changing the original video memory size, easily obtain the global and local features of the text sequence, and maintain the accuracy of the text sequence recognition results.

[0089] Additionally, refer to Figure 2 The diagram shows a block diagram of a text recognition device based on a large language model provided in an embodiment of the present invention.

[0090] This invention also discloses a text recognition device 20 based on a large language model, comprising:

[0091] Module 201 is used to obtain the token sequence of the text sequence;

[0092] The determination module 202 is used to determine the dummy token for storing the attention score bias at the head of the token sequence, and input the sliding window portion of the token in the optimized token sequence with the dummy token into the large language model, and output the first predicted token.

[0093] The calculation module 203 is used to calculate the attention score of each sliding window part token relative to the first predicted token, and select a first preset number of key tokens whose attention scores are greater than the preset attention scores;

[0094] Module 204 is used to select a second preset number of new tokens from the token sequence to be input into the large language model;

[0095] The first input module 205 is used to input the dummy token, each key token and each newly added token as token combinations into the large language model and output the second predicted token.

[0096] The second input module 206 is used to use the second predicted token as the first predicted token, the token combination as the sliding window partial token, and return to step S3 until all the tokens in the token sequence are input into the large language model.

[0097] The output module 207 is used to output the recognition result of the text sequence based on the obtained first prediction token and second prediction token.

[0098] In one possible implementation, the determination module 202 determines a dummy token at the head of the token sequence for storing the attention score bias, specifically as follows:

[0099] The operation process of a large language model is determined, which includes the training process and the inference process.

[0100] The dummy token is determined based on the running process.

[0101] In one possible implementation, the operation process of the large language model is determined, wherein the operation process includes a training process and an inference process, specifically as follows:

[0102] When the process is a training process, a dummy token is added to the beginning of the token sequence;

[0103] When the process is a reasoning process, starting from the beginning of the token sequence, a third preset number of tokens are selected as dummy tokens.

[0104] In one possible implementation, the added dummy token is equivalent to a token.

[0105] In one possible implementation, the third preset quantity is less than or equal to 5.

[0106] In one possible implementation, the total number of newly added tokens is the same as the number of tokens in the sliding window section.

[0107] In one possible implementation, the device further includes:

[0108] If the total number of newly added tokens in the sliding window portion of the large language model is less than the sum of the number of tokens calculated from the number of tokens in the dummy token, the first preset number, and the second preset number, the newly added tokens in the sliding window portion are treated as a token combination sequence.

[0109] In one possible implementation, the output module 207 is specifically used for:

[0110] Arrange the first and second predicted tokens in the output order;

[0111] The permutation result is output as the recognition result of the text sequence.

[0112] The embodiments of the present invention have the following advantages:

[0113] In this invention, a dummy token for storing attention score bias is first determined at the head of the token sequence of the text sequence to be recognized. Then, a first predicted token is output based on the sliding window portion tokens. Next, the attention score of each sliding window portion token relative to the first predicted token is calculated, and a key token with a higher score is selected. Several new tokens are then selected from the token sequence to be input into the large language model. Based on the obtained dummy token, key tokens, and new tokens, the sliding window portion tokens are replaced to generate a new second prediction result. This process is repeated until the sliding window has traversed the entire text sequence. In this process, the length of the tokens input to the large language model can adapt to different sizes of GPU memory, resulting in greater adaptability. Furthermore, the input tokens retain the initial tokens of the text sequence, avoiding deviations from the text in the recognition results and improving the accuracy of the text recognition results. Each input token is based on a key token that has a significant impact on the previous prediction result, preserving important semantic information of the text. Each input token also includes new tokens from the text sequence, maintaining the continuity of the recognition results. In summary, this solution can process text sequences of different lengths without changing the original video memory size, easily obtain the global and local features of the text sequence, and maintain the accuracy of the text sequence recognition results.

[0114] In addition, embodiments of the present invention also provide an electronic device, such as... Figure 3 As shown, it includes a processor 1301, a communication interface 1302, a memory 1303, and a communication bus 1304. The processor 1301, the communication interface 1302, and the memory 1303 communicate with each other through the communication bus 1304.

[0115] Memory 1303 is used to store computer programs;

[0116] The processor 1301, when executing a program stored in the memory 1303, implements a text recognition method for a large language model as described in the method embodiment.

[0117] The communication bus mentioned above can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This communication bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only one thick line is used to represent it in the diagram, but this does not mean that there is only one bus or one type of bus.

[0118] The communication interface is used for communication between the aforementioned terminal and other devices.

[0119] The memory may include random access memory (RAM) or non-volatile memory, such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.

[0120] The processors mentioned above can be general-purpose processors, including central processing units (CPUs), network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0121] In this invention, a dummy token for storing attention score bias is first determined at the head of the token sequence of the text sequence to be recognized. Then, a first predicted token is output based on the sliding window portion tokens. Next, the attention score of each sliding window portion token relative to the first predicted token is calculated, and a key token with a higher score is selected. Several new tokens are then selected from the token sequence to be input into the large language model. Based on the obtained dummy token, key tokens, and new tokens, the sliding window portion tokens are replaced to generate a new second prediction result. This process is repeated until the sliding window has traversed the entire text sequence. In this process, the length of the tokens input to the large language model can adapt to different sizes of GPU memory, resulting in greater adaptability. Furthermore, the input tokens retain the initial tokens of the text sequence, avoiding deviations from the text in the recognition results and improving the accuracy of the text recognition results. Each input token is based on a key token that has a significant impact on the previous prediction result, preserving important semantic information of the text. Each input token also includes new tokens from the text sequence, maintaining the continuity of the recognition results. In summary, this solution can process text sequences of different lengths without changing the original video memory size, easily obtain the global and local features of the text sequence, and maintain the accuracy of the text sequence recognition results.

[0122] like Figure 4 As shown, in another embodiment of the present invention, a computer-readable storage medium 1401 is also provided, which stores instructions that, when executed on a computer, cause the computer to perform the text recognition method of the large language model described in the above embodiments.

[0123] In another embodiment of the present invention, a computer program product containing instructions is also provided, which, when run on a computer, causes the computer to execute the text recognition method of the large language model described in the above embodiments.

[0124] In the above embodiments, implementation can be achieved entirely or partially through software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented entirely or partially in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (SSD)).

[0125] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0126] The various embodiments in this specification are described in a related manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions of the method embodiments.

[0127] The above description is merely a preferred embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention are included within the scope of protection of the present invention.

Claims

1. A text recognition method for a large language model, characterized in that the method... include: S1: Get the token sequence from the text sequence; S2: Determine a dummy token at the head of the token sequence to store the attention score bias, and input the sliding window portion of the optimized token sequence with the dummy token into the large language model to output the first predicted token; S3: Calculate the attention score of each sliding window part token relative to the first predicted token, and select a first preset number of key tokens whose attention scores are greater than the preset attention scores; S4: Select a second preset number of new tokens from the token sequence and input them into the large language model; S5: Input the dummy token, each key token and each newly added token as a token combination into the large language model, and output the second predicted token; S6: Use the second predicted token as the first predicted token, and the token combination as the sliding window partial token. Return to step S3 until all the tokens in the token sequence have been input into the large language model. S7: Based on the obtained first and second prediction tokens, output the recognition result of the text sequence.

2. The text recognition method based on a large language model according to claim 1, characterized in that, The step S2, which involves determining a dummy token at the head of the token sequence to store the attention score bias, specifically includes: S201: Determine the operation process of the large language model, wherein the operation process includes a training process and an inference process; S202: Determine the dummy token according to the described operation process.

3. The text recognition method based on a large language model according to claim 2, characterized in that, S202 specifically includes: S2021: If the running process is a training process, add the dummy token to the beginning of the token sequence; S2022: When the running process is a reasoning process, starting from the beginning position of the token sequence, a third preset number of tokens are selected as the dummy token.

4. The text recognition method based on a large language model according to claim 3, characterized in that, The added dummy token is equivalent to a token.

5. The text recognition method based on a large language model according to claim 3, characterized in that, The third preset quantity is less than or equal to 5.

6. The text recognition method based on a large language model according to claim 1, characterized in that, The total number of newly added tokens is the same as the number of tokens in the sliding window section.

7. The text recognition method based on a large language model according to claim 1, characterized in that, Following S4, it also includes: If the total number of new tokens input to the sliding window portion of the large language model is less than the sum of the number of tokens calculated from the number of tokens in the dummy token, the first preset number, and the second preset number, the new tokens in the sliding window portion are used as the token combination sequence.

8. The text recognition method based on a large language model according to claim 1, characterized in that, Specifically, S7 includes: S701: Arrange the obtained first and second prediction tokens in the output order; S702: Output the permutation result as the recognition result of the text sequence.

9. A text recognition device for a large language model, characterized in that, The device includes: The acquisition module is used to obtain the token sequence of the text sequence; The determination module is used to determine a dummy token for storing the attention score bias at the head of the token sequence, and input the sliding window portion of the optimized token sequence with the dummy token into the large language model, and output the first predicted token; The calculation module is used to calculate the attention score of each sliding window part token relative to the first predicted token, and select a first preset number of key tokens whose attention scores are greater than the preset attention scores; The selection module is used to select a second preset number of new tokens from the token sequence to be input into the large language model; The first input module is used to input the dummy token, each key token and each newly added token as token combinations into the large language model, and output the second predicted token; The second input module is used to use the second predicted token as the first predicted token, and the token combination as the sliding window partial token, and return to step S3 until all the tokens in the token sequence are input into the large language model. The output module is used to output the recognition result of the text sequence based on the obtained first prediction token and second prediction token.

10. An electronic device, characterized in that, It includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; The memory is used to store computer programs; When the processor executes the program stored in the memory, it implements the text recognition method of the large language model as described in any one of claims 1-8.