Long text parallel inference method and device based on linear attention

By decomposing total attention computation into local and global attention computation, the complexity and time issues of traditional linear attention mechanism models in long text processing under distributed architecture are solved, thus improving inference efficiency.

CN122242722APending Publication Date: 2026-06-19SHANGHAI XIYU JIZHI TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI XIYU JIZHI TECH CO LTD
Filing Date
2026-02-13
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Traditional linear attention mechanisms suffer from high computational complexity and long computation time when processing long texts in a distributed architecture, which affects inference efficiency.

Method used

The total attention computation is decomposed into local and global attention computations. By constructing a local attention decay matrix and a global attention decay factor, the local and global attention matrices are computed in parallel, reducing the complexity and time of the total attention matrix.

Benefits of technology

It improves the inference efficiency of deep learning models in long text scenarios and reduces the complexity and time of attention computation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242722A_ABST
    Figure CN122242722A_ABST
Patent Text Reader

Abstract

This application relates to a method and apparatus for parallel reasoning of long texts based on linear attention. Applied to any parallel computing device in a distributed architecture, the method includes: converting a long text instruction to be processed into at least one embedded feature vector; constructing a local attention decay matrix and a global attention decay factor based on the at least one embedded feature vector and task attribute information of the long text instruction; mapping the at least one embedded feature vector to at least one core matrix based on a weight matrix; calculating the local attention matrix and the global attention matrix; constructing a total attention matrix for at least one embedded feature vector based on the local attention matrix, the global attention matrix, and linear attention; and calculating the unprocessed tokens or output text data corresponding to the long text based on task attribute information and the total attention matrix. This method can reduce the complexity of attention calculation in the long text reasoning process and shorten the computation time.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of deep learning technology, and in particular to a method and apparatus for parallel reasoning of long texts based on linear attention. Background Technology

[0002] In artificial intelligence applications such as natural language processing, speech recognition, and video understanding, the demand for processing ultra-long sequences is becoming increasingly prominent as the length of model input sequences continues to increase. Limited by the computing power and storage resources of a single device, a distributed computing architecture is typically required, using multiple parallel devices to collaboratively complete the processing tasks of ultra-long sequences.

[0003] In traditional technologies, for linear attention mechanisms, to achieve parallel computing in a distributed architecture, long sequences are typically divided into segments of a certain length, and each segment is allocated to different parallel devices. Since global linear attention computation requires global key and value information, to ensure global consistency of the attention computation results, parallel devices usually need to communicate in a specific order, sending and receiving the required key-value information sequentially. Especially in long text reasoning, later parallel computing devices need to wait for all preceding parallel computing devices to complete the KV matrix calculation and transmit the calculated KV matrix sequentially to the target parallel computing device before they can perform computation, resulting in a long waiting time. Furthermore, due to the limitations of the attention model, the complexity of attention computation is exponentially proportional to the length of the context. Therefore, the later the parallel computing device is, the larger the amount of data it receives; that is, the computational complexity of later parallel computing devices increases rapidly in an exponential manner, affecting computational overhead, inference time, and inference accuracy, thus impacting the response efficiency of text reasoning.

[0004] Therefore, traditional techniques suffer from high computational complexity and long computation time in attention calculations. Summary of the Invention

[0005] Therefore, it is necessary to provide a method and apparatus for parallel long text reasoning based on linear attention, which can reduce the complexity of attention calculation in the long text reasoning process based on linear attention mechanism model and shorten the calculation time, in order to address the above-mentioned technical problems.

[0006] Firstly, this application provides a long-text parallel inference method based on linear attention, applicable to any parallel computing device in a distributed architecture, including:

[0007] Obtain the long text instruction to be processed, and convert the long text instruction to be processed into at least one embedded feature vector;

[0008] Based on at least one of the embedded feature vectors and the task attribute information of the long text instruction to be processed, a local attention decay matrix and a global attention decay factor are constructed.

[0009] Obtain at least one weight matrix, and map at least one of the embedded feature vectors to at least one core matrix based on the weight matrix;

[0010] Based on linear attention, the local attention matrix is ​​calculated according to the core matrix and the local attention decay matrix;

[0011] Based on linear attention, the global attention matrix is ​​calculated according to the core matrix and the global attention decay factor;

[0012] Construct at least one total attention matrix for the embedded feature vector based on the local attention matrix, the global attention matrix, and the linear attention matrix;

[0013] Based on the task attribute information and the total attention matrix, infer the unprocessed lexical units and / or task output results corresponding to the long text instruction to be processed.

[0014] Secondly, this application also provides a long-text parallel inference device based on linear attention, applicable to any parallel computing device in a distributed architecture, including:

[0015] The acquisition module is used to acquire the long text instruction to be processed and convert the long text instruction to be processed into at least one embedded feature vector.

[0016] The construction module is used to construct a local attention decay matrix and a global attention decay factor based on at least one of the embedded feature vectors and the task attribute information of the long text instruction to be processed;

[0017] A mapping module is used to obtain at least one weight matrix and map at least one embedded feature vector to at least one core matrix based on the weight matrix;

[0018] The first calculation module is used to calculate the local attention matrix based on linear attention, according to the core matrix and the local attention decay matrix;

[0019] The second calculation module is used to calculate the global attention matrix based on linear attention, according to the core matrix and the global attention decay factor;

[0020] The construction module is used to construct at least one total attention matrix for the embedded feature vector based on the local attention matrix, the global attention matrix, and the linear attention.

[0021] The third calculation module infers the unprocessed lexical units and / or task output results corresponding to the long text instruction to be processed based on the task attribute information and the total attention matrix.

[0022] The aforementioned parallel inference method and apparatus for long text based on linear attention, leveraging the computational characteristics of linear attention, decomposes the total attention calculation into local attention calculation and global attention calculation. For each parallel device, the long text instruction to be processed is converted into at least one embedded feature vector. Based on this embedded feature vector and the task attribute information of the long text instruction, a local attention decay matrix and a global attention decay factor are constructed. The embedded feature vector is then mapped to at least one core matrix based on a weight matrix. Thus, based on linear attention, each parallel computing device can compute the local attention matrix in parallel according to the core matrix and the local attention decay matrix. Based on linear attention, the global attention matrix of the overall instruction is calculated according to the core matrix and the global attention decay factor. A total attention matrix with at least one embedded feature vector is constructed based on the local attention matrix, the global attention matrix, and linear attention. Finally, based on the task attribute information of the long text instruction and the total attention matrix, the corresponding unprocessed words and / or task output results are calculated. By parallel computing local attention, the complexity and time of the total attention matrix are reduced, thereby improving the inference efficiency of deep learning models based on linear attention in long text scenarios. Attached Figure Description

[0023] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0024] Figure 1 This is a flowchart illustrating a long text parallel inference method based on linear attention in one embodiment;

[0025] Figure 2 This is a flowchart illustrating a long text parallel inference method based on linear attention in another embodiment;

[0026] Figure 3 This is a flowchart illustrating a long text parallel inference method based on linear attention in another embodiment;

[0027] Figure 4 This is a block diagram of a long text parallel inference device based on linear attention in one embodiment;

[0028] Figure 5This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0029] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0030] It should be noted that the terms "first," "second," etc., used in this application can be used to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish the first element from the second element. The terms "comprising" and "having," and any variations thereof, used in this application, are intended to cover non-exclusive inclusion. The term "multiple" used in this application refers to two or more. The term "and / or" used in this application refers to one of the embodiments, or any combination of multiple embodiments.

[0031] In one embodiment, such as Figure 1 As shown, a parallel inference method for long texts based on linear attention is provided. This embodiment illustrates the method by applying it to any parallel computing device in a distributed architecture. In this embodiment, the method includes the following steps:

[0032] S201, Obtain the long text instruction to be processed, and convert the long text instruction to be processed into at least one embedded feature vector.

[0033] In this embodiment, the long text instruction to be processed refers to a continuous input sequence where the GPU memory and computing power of a single computing device cannot efficiently complete a full attention operation when the deep learning model performs linear attention calculations. The length of the long text instruction and the key-value cache (KV cache) used to infer it exceed the single-processing threshold of a single device for core operations such as linear attention KV aggregation and feature mapping. The definition of long text length changes with the development of artificial intelligence technology. It typically refers to text length that is close to or exceeds the context window text length set during training of the artificial intelligence model. For example, in 2025, text with a token count greater than or equal to 32k was generally considered long text. In extreme cases, the length of long text can reach 128k, 200k, or even 1M. above.

[0034] For example, this embodiment converts the original long text instruction to be processed into a specific-dimensional and continuous vector that can be efficiently processed by any parallel computing device in a distributed architecture. In other words, the embedded feature vector can be obtained by performing an embedding transformation on the long text instruction to be processed. Preferably, the long text instruction to be processed is first segmented into several word sequences, and then the word sequences are further processed... The embedding feature vector is obtained by performing an embedding transformation. In which, instructions and / or individual morphemes It can be expressed in natural language.

[0035] As an example, embedding feature vectors Dimensions , This indicates the sequence length of the long text instruction to be processed. Indicates the number of attention heads. The dimension representing the attention head can be transposed to represent the long text instruction to be processed. The shape from Adjusted to , focus attention Moving a dimension to the front, within a parallel computing framework, allows you to manage it as a batch processing dimension by placing the dimension that needs to be processed independently at the front. The computation of each attention head can be performed in parallel naturally, while simultaneously enabling the embedding of feature vectors. It exhibits high spatial locality, improving the efficiency of data access. Furthermore, different frameworks employ different algorithms for the transpose operation described above. For instance, in PyTorch, the `permute()` method can be used to perform dimension permutation; in TensorFlow, `tf.transpose()` can be used, specifying the dimension order through the `perm` parameter; and in NumPy, the `transpose()` method can be used to perform dimension permutation.

[0036] Furthermore, as an optional implementation method, it is possible to... In direction, with For the length of each segment, Divide into N segments The embedded feature vectors are then distributed sequentially across N parallel computing devices. It should be noted that if the length of the last block is insufficient... Then fill the last block into The length of each segment is determined to ensure that all blocks have a uniform computational scale. The number of parallel devices N determines the input allocated to the i-th parallel computing device after segmentation. Here, the padding operation does not introduce additional values ​​that require computational overhead; for example, the padding values ​​can all be 0.

[0037] S202, based on at least one embedded feature vector and task attribute information of the long text instruction to be processed, construct a local attention decay matrix and a global attention decay factor.

[0038] For example, the task attribute information of the long text instruction to be processed can include causal tasks and non-causal tasks. Causal tasks refer to tasks that follow a strict unidirectional time sequence or logical sequence. In these tasks, the deduction of the current state depends only on historical information and the current input, and it is strictly forbidden to obtain future prediction information; the core characteristic is unidirectional dependency. Non-causal tasks refer to tasks whose processing is not restricted by linear time sequence, allowing the use of global preceding and following context information when calculating the representation of the current position; their core characteristic is bidirectional correlation. As an optional implementation, the task attribute information of the long text instruction to be processed is related to the instruction objective. For example, generative instructions typically correspond to causal tasks, while discriminative instructions typically correspond to non-causal tasks.

[0039] As an optional implementation, the process of constructing the local attention decay matrix may include:

[0040] A local attention decay matrix is ​​constructed based on the distance between the position of the currently being calculated first processing term and the position of the attention-grabbing second processing term. Specifically, if the task attribute information of the long text instruction to be processed is a causal task, then the distribution position of the aforementioned second processing term is located before the position of the first processing term. As an example, if the task attribute information of the long text instruction to be processed is a causal task, the local attention decay matrix... It is a lower triangular matrix with dimension 1. Execution mask and position based on the first processed term currently being calculated and the position of the second processed word that is being monitored Distance between weights Perform the decay operation when hour, , hour, , If the task attribute information for the long text instruction to be processed is a non-causal task, Only perform weighting based on distance Perform decay. That is, for ,have ,for ,have , In other words, local attention refers to the attention paid to each parallel computation within its own scope. The attention results between them, if the weight value is 1, means that the attention will not decrease as the distance increases. All data at all positions correspond to the same weight, and so on. The closer a value is to 1, the more information a deep learning model can remember, either forwards or backwards, resulting in stronger memory capabilities. Additionally, it should be noted that when... When not equal to 1, Usually slightly larger .

[0041] As an optional implementation, the process of constructing the global attention decay factor may include:

[0042] If the task attribute information of the long text instruction to be processed is a causal task, then the global attention decay factor is determined based on the first decay coefficient and the exponential function value of the first value; the first value is the position index of the currently being calculated word plus 1. If the task attribute information of the long text instruction to be processed is a non-causal task, then the forward global attention decay factor is determined based on the second decay coefficient and the exponential function of the second value, and the backward global attention decay factor is determined based on the second decay coefficient and the exponential function of the third value; the second value is the position index of the currently being calculated word plus 1, and the third value is the difference between the number of all parallel computing devices and the position index of the currently being calculated word. For example, in a scenario where the task attribute information of the long text instruction to be processed is a causal task, whether it is causal only affects the calculation of the decay factor. In the case of causality, the global attention decay factor... In the formula, The first attenuation coefficient, This represents the position of the currently being calculated lexical unit; for long text instructions to be processed, the task attribute information is non-causal task scenarios, categorized into forward global attention decay factors ( (time) and backward global attention decay factor ( (Time), forward global attention decay factor Backward global attention decay factor , This is the second attenuation coefficient. This represents the position of the word being calculated. This represents the total number of parallel computing devices. (The above...) All are preset hyperparameters. , If the above or If the attenuation matrix is ​​an all-1 matrix, it means that the attenuation no longer occurs due to distance.

[0043] It's important to understand here that a complete long text instruction to be processed is considered a single, complete task. This task either applies non-causal tasks to both local and global attention, or it applies causal tasks to both. Furthermore, the type of attention used is predetermined based on the task type; it's impossible for local and global attention calculations to be applied to causal and non-causal scenarios respectively. If the long text instruction to be processed is causal, then when calculating local attention... Among them When calculating global attention Among them Equal if the long text instruction to be processed is non-causal, then when calculating local attention... Among them When calculating global attention Among them They must be equal. If they are not equal, it may cause a disconnect between local attention and global attention for the same task.

[0044] It should be noted that even in non-causal scenarios, the above... , , , The value of is usually not 1, that is... , , , The value range is all within (0,1). Preferably, set... , , , The range of values ​​is This helps to achieve a balance between global context awareness and local semantic refinement. , , , When the value is 1, it represents the most basic linear attention pattern, but it is not the optimal solution. , , , A value of 1 means that all content has the same weight regardless of distance, which may make it difficult for the model to capture the key points of attention calculation, and the amount of data to be calculated is too large, which may lead to data overflow.

[0045] S203, obtain at least one weight matrix, and map at least one embedded feature vector to at least one core matrix based on the weight matrix.

[0046] For example, at least one weight matrix may include a query weight matrix. Key weight matrix Sum weight matrix , , and All dimensions are Optionally, during the training phase, the three dimensions can be initialized to D. model *D model weight matrix , and The above information is then obtained through the deep learning model training phase. , and For example, in each round of training, the embedded feature vectors Multiply the current weights to obtain the output, calculate the loss for this round based on the output, and simultaneously calculate the loss against the weights. , and The gradients of each parameter are calculated, and then the optimizer updates the weights based on the gradients until the requirements for training completion are met.

[0047] As an example, the process of mapping at least one embedded feature vector to at least one core matrix based on at least one weight matrix may include:

[0048] 1) Based on query weight matrix With embedded feature vectors The product of these terms yields the query matrix. ;

[0049] 2) Based on the key weight matrix With embedded feature vectors The product of these terms yields the key matrix. ;

[0050] 3) Based on value weight matrix With embedded feature vectors The product of these terms yields the value matrix. .

[0051] It should be noted that, , , All dimensions are Or, it can also be expressed as The weight matrix typically does not need to be sharded based on the number of parallel devices N, as each parallel device corresponds to the same weight matrix.

[0052] S204, based on linear attention, calculates the local attention matrix according to the core matrix and the local attention decay matrix.

[0053] As an example, such as Figure 2 As shown, the above S204 may include:

[0054] S301, based on the product of the query matrix and the transpose of the key matrix, the first matrix is ​​obtained.

[0055] In this embodiment, it can be based on a query matrix. Bond matrix The product of transposes yields the first matrix, which can be... .

[0056] S302, multiply the first matrix element-wise with the local attention decay matrix to obtain the second matrix.

[0057] In this embodiment, the second matrix can be obtained by element-wise multiplication of the first matrix and the local attention decay matrix, that is, the second matrix can be... .

[0058] S303, the product of the second matrix and the value matrix is ​​determined as the local attention matrix.

[0059] For example, in this embodiment, the above-mentioned second matrix can be... AND-value matrix The product of these is determined as the local attention matrix, i.e., the local attention matrix. .

[0060] S205, based on linear attention, calculates the global attention matrix according to the core matrix and the global attention decay factor.

[0061] As an example, such as Figure 3 As shown, the above S205 may include:

[0062] S401, calculate the first prefix sum matrix based on task attribute information, key matrix, value matrix and global attention decay factor.

[0063] As an example, in this embodiment, the key matrix can be... AND-value matrix The product of these factors is used to determine the third matrix. This third matrix is ​​then multiplied by the global attention decay factor. The product of these matrices is used to determine the fourth matrix. If the task attribute information of the long text instruction to be processed is a causal task, then the sum of the fourth matrices corresponding to all other parallel computing devices preceding the parallel computing device used in this embodiment and the fourth matrix of the parallel computing device used in this embodiment is used to determine the first prefix sum matrix. Otherwise, the fourth matrix of the parallel computing device used in this embodiment is used as the first prefix sum matrix. That is, for causal tasks, the first prefix sum matrix... For the cumulative local prefix sum matrix First prefix sum matrix The dimension is It should be noted that the above... The matrix is ​​calculated using the outer product, that is... And in calculating the first prefix sum matrix First calculate Matrix, and then with attenuation factor For multiplication, in causal tasks, a chained approach using local prefix sums can be employed. The complexity of chained multiplication is O(n log n). For non-causal tasks, only the local prefix sum needs to be calculated, i.e. .

[0064] S402, for each parallel computing device, broadcast the first prefix sum matrix to other parallel computing devices in the distributed architecture, and determine the second prefix sum matrix based on the first prefix sum matrix sent by the other parallel computing devices and the task attribute information.

[0065] For example, for each parallel computing device, a communication to all parallel computing devices can be performed once, with each device broadcasting its local first prefix sum matrix to all other parallel computing devices. As an example, this communication could be AllGather, or a logarithmic union communication, with AllGather taking [time value missing]. .

[0066] The determination of the second prefix sum matrix can include: if the task attribute information is a causal task, then the sum of the first prefix sums corresponding to all other parallel computing devices preceding the current parallel computing device is used to determine the second prefix sum matrix; otherwise, the sum of the first prefix sums corresponding to all other parallel computing devices is used to determine the second prefix sum matrix. As an example, depending on whether the task attribute information corresponding to the current long text instruction is a causal or non-causal task, each parallel device obtains the first prefix sum required to compute the global attention corresponding to that parallel computing device. For causal tasks, the first... The parallel computing devices obtained the first to the second The first prefix sum matrix output by each parallel computing device The second prefix sum matrix is ​​obtained. For example, if there are a total of 8 parallel computing devices, the 5th parallel computing device already stores the first prefix sum matrix corresponding to the 5th parallel computing device. Therefore, it is only necessary to obtain the first prefix sum matrices calculated by the 1st to 4th parallel computing devices. The sum of the first prefix sums calculated by the 1st to 5th parallel computing devices is the second prefix sum of the 5th parallel computing device. For non-causal tasks, each parallel device obtains the first prefix sums output by all other parallel computing devices to obtain the second prefix sum. .

[0067] S403, calculate the global attention matrix based on task attribute information, global attention decay factor, query vector, and second prefix sum matrix.

[0068] For example, in this embodiment, if the task attribute information of the long text instruction to be processed is a causal task, then the global attention matrix is ​​calculated based on the product matrix of the global attention decay factor, the query vector, and the second prefix sum matrix, i.e., it can be calculated based on the formula. Calculate the global attention matrix, where, Represents the global attention matrix. This represents the global attention decay factor. Represents the query vector. This represents the second prefix sum matrix. If the task attribute information of the long text instruction to be processed is a non-causal task, then the global attention matrix is ​​calculated based on the sum of the first attention matrix and the second attention matrix. The first attention matrix is ​​the product of the forward global attention decay factor, the query vector, and the second prefix sum matrix; the second attention matrix is ​​the product of the backward global attention decay factor, the query vector, and the second prefix sum matrix. This can be calculated based on the formula... Calculate the global attention matrix, where, Represents the global attention matrix. This represents the forward global attention decay factor. Represents the query vector. This represents the second prefix sum matrix. This represents the backward global attention decay factor.

[0069] S206, construct at least one total attention matrix for embedding feature vectors based on local attention matrix, global attention matrix and linear attention.

[0070] In this embodiment, since the method proposed in this application can be applied to all linear attention models, as an optional implementation, the sum of the local attention matrix and the global attention matrix can be used as the total attention matrix. , , , All dimensions are .

[0071] S207, based on task attribute information and total attention matrix, infer the unprocessed lexical units and / or task output results corresponding to the long text instruction to be processed.

[0072] In this embodiment, for example, after determining the total attention matrix of the embedded feature vectors input to the deep learning model, the following steps can be performed: 1) Symmetrically corresponding to the embedded feature vector shaping step, ... Adjusted to 1) Adjust the form of the attention calculation result to the original form; 2) Pass the data into the residual connection layer: In order to ensure that the data is not lost or attenuated during transmission and calculation, the original data after being divided into blocks is... With attention in each block 3) The calculation results of the residual connection layer are fed into a neural network structure, which includes at least one fully connected layer and at least one linear layer in sequence, and the probability distribution results of the linear layer are mapped to the output. For causal tasks, the output is the next position to be predicted. The probability distribution or the task output result corresponding to the long text instruction to be processed. For non-causal tasks, the output is the entire content indicated by the long text instruction to be processed, that is, the task output result corresponding to the long text instruction to be processed.

[0073] The aforementioned parallel inference method for long texts based on linear attention leverages the computational characteristics of linear attention by decomposing total attention computation into local attention computation and global attention computation. For each parallel device, the long text instruction to be processed is converted into at least one embedded feature vector. Based on this embedded feature vector and the task attribute information of the long text instruction, a local attention decay matrix and a global attention decay factor are constructed. The embedded feature vector is then mapped to at least one core matrix based on a weight matrix. Thus, based on linear attention, each parallel computing device can compute the local attention matrix in parallel according to the core matrix and the local attention decay matrix. Based on linear attention, the global attention matrix of the overall instruction is calculated according to the core matrix and the global attention decay factor. Finally, a total attention matrix with at least one embedded feature vector is constructed based on the local attention matrix, the global attention matrix, and linear attention. Then, based on the task attribute information of the long text instruction and the total attention matrix, the corresponding unprocessed words and / or task output results are calculated. By parallelizing local attention computation, the complexity and time of the total attention matrix are reduced, thereby improving the inference efficiency of deep learning models based on linear attention in long text scenarios.

[0074] It should be understood that although the steps in the flowcharts of the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages in other steps. It is understood that the steps in different embodiments can be freely combined as needed, and all non-contradictory solutions formed by such combinations are within the scope of protection of this application.

[0075] Based on the same inventive concept, this application also provides a linear attention-based long text parallel reasoning apparatus for implementing the above-described linear attention-based long text parallel reasoning method. The solution provided by this apparatus is similar to the implementation described in the above method; therefore, the specific limitations of one or more linear attention-based long text parallel reasoning apparatus embodiments provided below can be found in the limitations of the linear attention-based long text parallel reasoning method described above, and will not be repeated here.

[0076] In one exemplary embodiment, such as Figure 4 As shown, a long text parallel inference device based on linear attention is provided, applicable to any parallel computing device in a distributed architecture, including: an acquisition module 10, a construction module 11, a mapping module 12, a first computing module 13, a second computing module 14, a construction module 15, and a third computing module 16, wherein:

[0077] The acquisition module 10 is used to acquire the long text instruction to be processed and convert the long text instruction to be processed into at least one embedded feature vector.

[0078] Module 11 is used to construct a local attention decay matrix and a global attention decay factor based on task attribute information of at least one embedded feature vector and the long text instruction to be processed.

[0079] The mapping module 12 is used to obtain at least one weight matrix and map at least one embedded feature vector to at least one core matrix based on the weight matrix.

[0080] The first calculation module 13 is used to calculate the local attention matrix based on linear attention, according to the core matrix and the local attention decay matrix.

[0081] The second calculation module 14 is used to calculate the global attention matrix based on linear attention, according to the core matrix and the global attention decay factor.

[0082] Construction module 15 is used to construct a total attention matrix for at least one embedded feature vector based on the local attention matrix, the global attention matrix, and the linear attention.

[0083] The third calculation module 16 infers the unprocessed lexical units and / or task output results corresponding to the long text instruction to be processed based on task attribute information and the total attention matrix.

[0084] The long text parallel inference device based on linear attention provided in this embodiment can execute the above method embodiment. Its implementation principle and technical effect are similar, and will not be repeated here.

[0085] Based on the above embodiments, optionally, the weight matrix includes a query weight matrix, a key weight matrix, and a value weight matrix; the mapping module 12 includes: a first mapping unit, a second mapping unit, and a third mapping unit, wherein:

[0086] The first mapping unit is used to obtain the query matrix based on the product of the query weight matrix and the embedded feature vector.

[0087] The second mapping unit is used to obtain the key matrix based on the product of the key weight matrix and the embedded feature vector.

[0088] The third mapping unit is used to obtain the value matrix based on the product of the value weight matrix and the embedded feature vector.

[0089] The long text parallel inference device based on linear attention provided in this embodiment can execute the above method embodiment. Its implementation principle and technical effect are similar, and will not be repeated here.

[0090] Based on the above embodiments, optionally, the first calculation module 13 includes: a first calculation unit, a second calculation unit, and a determination unit, wherein:

[0091] The first calculation unit is used to obtain the first matrix based on the product of the query matrix and the transpose of the key matrix.

[0092] The second computational unit is used to multiply the first matrix element-wise with the local attention decay matrix to obtain the second matrix.

[0093] The determining unit is used to determine the local attention matrix by multiplying the second matrix and the value matrix.

[0094] The long text parallel inference device based on linear attention provided in this embodiment can execute the above method embodiment. Its implementation principle and technical effect are similar, and will not be repeated here.

[0095] Based on the above embodiments, optionally, the above-mentioned construction module 11 includes: a first construction unit, wherein:

[0096] The first construction unit is used to construct a local attention decay matrix based on the distance between the position of the first processing word currently being calculated and the position of the second processing word being focused on;

[0097] If the task attribute information is a causal task, then the distribution position of the second processing word is located before the position of the first processing word.

[0098] The long text parallel inference device based on linear attention provided in this embodiment can execute the above method embodiment. Its implementation principle and technical effect are similar, and will not be repeated here.

[0099] Based on the above embodiments, optionally, the second calculation module 14 includes: a third calculation unit, a determination unit, and a fourth calculation unit, wherein:

[0100] The third computational unit is used to compute the first prefix sum matrix based on task attribute information, the key matrix, the value matrix, and the global attention decay factor.

[0101] The determining unit is used to broadcast the first prefix sum matrix to other parallel computing devices in the distributed architecture for each parallel computing device, and to determine the second prefix sum matrix based on the first prefix sum matrix sent by the other parallel computing devices and the task attribute information.

[0102] The fourth computational unit is used to calculate the global attention matrix based on task attribute information, global attention decay factor, query vector, and second prefix sum matrix.

[0103] The long text parallel inference device based on linear attention provided in this embodiment can execute the above method embodiment. Its implementation principle and technical effect are similar, and will not be repeated here.

[0104] Based on the above embodiments, optionally, the third calculation unit is specifically used to determine the product of the key matrix and the value matrix as the third matrix;

[0105] The product of the third matrix and the global attention decay factor is used to determine the fourth matrix;

[0106] If the task attribute information is a causal task, then the sum of the fourth matrices corresponding to all other parallel computing devices located before the parallel computing device is determined as the first prefix sum matrix.

[0107] Otherwise, the fourth matrix is ​​determined as the first prefix sum matrix.

[0108] The long text parallel inference device based on linear attention provided in this embodiment can execute the above method embodiment. Its implementation principle and technical effect are similar, and will not be repeated here.

[0109] Based on the above embodiments, optionally, the above determining unit is specifically used to determine the sum of the first prefix sums corresponding to all other parallel computing devices located before the parallel computing device as the second prefix sum matrix if the task attribute information is a causal task;

[0110] Otherwise, the sum of the first prefix sums corresponding to all other parallel computing devices is used to determine the second prefix sum matrix.

[0111] The long text parallel inference device based on linear attention provided in this embodiment can execute the above method embodiment. Its implementation principle and technical effect are similar, and will not be repeated here.

[0112] Based on the above embodiments, optionally, the above-mentioned construction module 11 includes: a second construction unit and a third construction unit, wherein:

[0113] The second building unit is used to determine the global attention decay factor based on the first decay coefficient and the exponential function value of the first value if the task attribute information is a causal task; the first value is the position index of the currently being calculated word plus 1.

[0114] The third building unit is used to determine the forward global attention decay factor based on the exponential function of the second decay coefficient and the second value if the task attribute information is a non-causal task, and to determine the backward global attention decay factor based on the exponential function of the second decay coefficient and the third value; the second value is the position index of the currently being calculated word plus 1, and the third value is the difference between the number of all parallel computing devices and the position index of the currently being calculated word.

[0115] The long text parallel inference device based on linear attention provided in this embodiment can execute the above method embodiment. Its implementation principle and technical effect are similar, and will not be repeated here.

[0116] Based on the above embodiments, optionally, the fourth calculation unit is specifically used to calculate the global attention matrix based on the product matrix of the global attention decay factor, the query vector, and the second prefix sum matrix if the task attribute information is a causal task; and to calculate the global attention matrix based on the sum of the first attention matrix and the second attention matrix if the task attribute information is a non-causal task. The first attention matrix is ​​the product matrix of the forward global attention decay factor, the query vector, and the second prefix sum matrix, and the second attention matrix is ​​the product matrix of the backward global attention decay factor, the query vector, and the second prefix sum matrix.

[0117] The long text parallel inference device based on linear attention provided in this embodiment can execute the above method embodiment. Its implementation principle and technical effect are similar, and will not be repeated here.

[0118] Based on the above embodiments, optionally, the above-mentioned construction module 15 includes: a construction unit, wherein:

[0119] The construction unit is used to sum the local attention matrix and the global attention matrix to form the total attention matrix.

[0120] The modules in the aforementioned long-text parallel inference device based on linear attention can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module.

[0121] In one exemplary embodiment, a computer device is provided, the internal structure of which can be as shown in the figure. Figure 5 As shown, the computer device includes a processor, memory, input / output (I / O) interfaces, and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides the environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores at least one weight matrix. The I / O interfaces are used for exchanging information between the processor and external devices. The communication interface is used for communicating with external terminals via a network connection. When executed by the processor, the computer program implements a long-text parallel inference method based on linear attention.

[0122] Those skilled in the art will understand that Figure 5 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0123] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.

[0124] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps in the above method embodiments.

[0125] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0126] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.

[0127] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.

[0128] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A parallel reasoning method for long texts based on linear attention, characterized in that, The method, applicable to any parallel computing device in a distributed architecture, includes: Obtain the long text instruction to be processed, and convert the long text instruction to be processed into at least one embedded feature vector; Based on at least one of the embedded feature vectors and the task attribute information of the long text instruction to be processed, a local attention decay matrix and a global attention decay factor are constructed. Obtain at least one weight matrix, and map at least one of the embedded feature vectors to at least one core matrix based on the weight matrix; Based on linear attention, the local attention matrix is ​​calculated according to the core matrix and the local attention decay matrix; Based on linear attention, the global attention matrix is ​​calculated according to the core matrix and the global attention decay factor; Construct at least one total attention matrix for the embedded feature vector based on the local attention matrix, the global attention matrix, and the linear attention matrix; Based on the task attribute information and the total attention matrix, infer the unprocessed lexical units and / or task output results corresponding to the long text instruction to be processed.

2. The method according to claim 1, characterized in that, The weight matrix includes a query weight matrix, a key weight matrix, and a value weight matrix; mapping at least one embedded feature vector to at least one core matrix based on the weight matrix includes: The query matrix is ​​obtained by multiplying the query weight matrix and the embedded feature vector. The key matrix is ​​obtained by multiplying the key weight matrix with the embedded feature vector. The value matrix is ​​obtained by multiplying the value weight matrix and the embedded feature vector.

3. The method according to claim 2, characterized in that, The calculation of the local attention matrix based on linear attention, according to the core matrix and the local attention decay matrix, includes: The first matrix is ​​obtained by multiplying the query matrix and the transpose of the key matrix; The first matrix is ​​multiplied element-wise with the local attention decay matrix to obtain the second matrix; The product of the second matrix and the value matrix is ​​determined as the local attention matrix.

4. The method according to claim 3, characterized in that, The construction of the local attention decay matrix includes: The local attention decay matrix is ​​constructed based on the distance between the position of the first processed word currently being calculated and the position of the second processed word being focused on. If the task attribute information is a causal task, then the distribution position of the second processing word is located before the position of the first processing word.

5. The method according to claim 2, characterized in that, The calculation of the global attention matrix based on linear attention, according to the core matrix and the global attention decay factor, includes: Based on the task attribute information, the key matrix, the value matrix, and the global attention decay factor, calculate the first prefix sum matrix; For each parallel computing device, the first prefix sum matrix is ​​broadcast to other parallel computing devices in the distributed architecture, and a second prefix sum matrix is ​​determined based on the first prefix sum matrix sent by the other parallel computing devices and the task attribute information; The global attention matrix is ​​calculated based on the task attribute information, the global attention decay factor, the query vector, and the second prefix sum matrix.

6. The method according to claim 5, characterized in that, The step of calculating the first prefix sum matrix based on the task attribute information, the key matrix, the value matrix, and the global attention decay factor includes: The product of the key matrix and the value matrix is ​​used to determine the third matrix; The product of the third matrix and the global attention decay factor is used to determine the fourth matrix; If the task attribute information is a causal task, then the sum of the fourth matrix corresponding to all other parallel computing devices located before the parallel computing device and the fourth matrix is ​​determined as the first prefix sum matrix; Otherwise, the fourth matrix is ​​determined to be the first prefix sum matrix.

7. The method according to claim 6, characterized in that, The determination of the second prefix sum matrix includes: If the task attribute information is a causal task, then the sum of the first prefix sums corresponding to all other parallel computing devices located before the parallel computing device is determined as the second prefix sum matrix; Otherwise, the sum of the first prefix sums corresponding to all other parallel computing devices is used to determine the second prefix sum matrix.

8. The method according to any one of claims 5 to 7, characterized in that, The construction of the global attention decay factor includes: If the task attribute information is a causal task, then the global attention decay factor is determined based on the first decay coefficient and the exponential function value of the first value; the first value is the position index of the currently being calculated word plus 1. If the task attribute information is a non-causal task, then the forward global attention decay factor is determined based on the second decay coefficient and the exponential function of the second value, and the backward global attention decay factor is determined based on the second decay coefficient and the exponential function of the third value; the second value is the position index of the currently being calculated word plus 1, and the third value is the difference between the number of all parallel computing devices and the position index of the currently being calculated word.

9. The method according to claim 8, characterized in that, The calculation of the global attention matrix based on the task attribute information, the global attention decay factor, the query vector, and the second prefix sum matrix includes: If the task attribute information is a causal task, then the global attention matrix is ​​calculated based on the product matrix of the global attention decay factor, the query vector, and the second prefix sum matrix; If the task attribute information is a non-causal task, then the global attention matrix is ​​calculated based on the sum of the first attention matrix and the second attention matrix; the first attention matrix is ​​the product matrix of the forward global attention decay factor, the query vector and the second prefix sum matrix, and the second attention matrix is ​​the product matrix of the backward global attention decay factor, the query vector and the second prefix sum matrix.

10. The method according to claim 1, characterized in that, The construction of a total attention matrix for at least one of the embedded feature vectors based on the local attention matrix, the global attention matrix, and the linear attention matrix includes: The sum of the local attention matrix and the global attention matrix is ​​used as the total attention matrix.

11. A long text parallel reasoning device based on linear attention, characterized in that, The device, applicable to any parallel computing device in a distributed architecture, comprises: The acquisition module is used to acquire the long text instruction to be processed and convert the long text instruction to be processed into at least one embedded feature vector. The construction module is used to construct a local attention decay matrix and a global attention decay factor based on at least one of the embedded feature vectors and the task attribute information of the long text instruction to be processed; A mapping module is used to obtain at least one weight matrix and map at least one embedded feature vector to at least one core matrix based on the weight matrix; The first calculation module is used to calculate the local attention matrix based on linear attention, according to the core matrix and the local attention decay matrix; The second calculation module is used to calculate the global attention matrix based on linear attention, according to the core matrix and the global attention decay factor; The construction module is used to construct at least one total attention matrix for the embedded feature vector based on the local attention matrix, the global attention matrix, and the linear attention. The third calculation module infers the unprocessed lexical units and / or task output results corresponding to the long text instruction to be processed based on the task attribute information and the total attention matrix.