Methods of processing tensors, processing data, related apparatuses, and computer program products

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By splitting the tensor output by the linear attention layer into computational blocks for parallel processing, the problem of high computational complexity in long sequences of traditional attention mechanisms is solved, achieving efficient tensor processing and accelerated model inference.

CN122242660APending Publication Date: 2026-06-19BEIJING BAIDU NETCOM SCI & TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIJING BAIDU NETCOM SCI & TECH CO LTD
Filing Date: 2026-03-18
Publication Date: 2026-06-19

Application Information

Patent Timeline

18 Mar 2026

Application

19 Jun 2026

Publication

CN122242660A

IPC: G06N3/10; G06N3/0455; G06N5/04; G06F9/50

AI Tagging

Application Domain

Resource allocation Inference methods

Technology Topics

Algorithm Theoretical computer science

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Traditional attention mechanisms are computationally complex and memory-intensive when processing long sequences, which limits the application of deep learning models on long texts, time series, or high-dimensional feature data.

Method used

The tensor output by the linear attention layer is split into a set of computational blocks, each of which is a strictly lower triangular matrix. Iterative algorithms are executed in parallel in each computational domain to generate computational results, which are then written to global storage for use by the normalization layer.

Benefits of technology

It improves the efficiency of tensor processing, reduces the frequency of memory access, lowers storage requirements, and enhances the model's inference acceleration capabilities.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122242660A_ABST

Patent Text Reader

Abstract

This disclosure provides methods, related apparatuses, and computer program products for processing tensors and data, relating to artificial intelligence technologies such as data processing, deep learning, and inference acceleration. One specific implementation of the tensor processing method includes: splitting the target tensor output by the linear attention layer into a set of computational blocks according to the target block size, where each computational block in the set is a strictly lower triangular matrix; forming a computational domain corresponding to each computational block by loading corresponding storage for each computational block; performing iterative iterations of the target round on the computational blocks in each computational domain using an iterative algorithm in parallel, generating computational results corresponding to each computational domain; generating tensor processing results corresponding to the target tensor based on each computational result; and writing the tensor processing results to global storage that can be used by the normalization layer.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer technology, specifically to artificial intelligence technologies such as data processing, deep learning, and inference acceleration, and particularly to methods, apparatuses, electronic devices, computer-readable storage media, and computer program products for processing tensors and data. Background Technology

[0002] In existing deep learning models, especially those based on the Transformer architecture, the attention mechanism is a key component for sequence modeling and contextual information capture. Traditional attention mechanisms typically employ full computation, causing their computational complexity to increase quadratically with sequence length. This results in high computational cost and memory consumption when processing long sequences, limiting the application of traditional attention mechanisms on long texts, time series data, or high-dimensional feature data.

[0003] To alleviate this problem, an improved structure for linear attention layers has been proposed. Linear attention layers can reduce computational complexity to a level linearly related to the sequence length by performing linear approximation or decomposition of attention calculations, thereby reducing both computational load and memory usage.

[0004] Against this backdrop, how to further improve the processing capability and efficiency of the tensors output by the linear attention layer, and accelerate the inference process of the model, is a matter of concern and urgent need. Summary of the Invention

[0005] This disclosure provides a method, apparatus, electronic device, computer-readable storage medium, and computer program product for processing tensors and data.

[0006] In a first aspect, embodiments of this disclosure propose a method for processing tensors, comprising: splitting the target tensor output by a linear attention layer into a set of computational blocks according to the target block size, wherein each computational block in the set of computational blocks is a strictly lower triangular matrix; forming a computational domain corresponding to each computational block by loading corresponding storage for each computational block; performing a target round of iteration on the computational blocks in each computational domain in parallel using an iterative algorithm to generate computational results corresponding to each computational domain; generating tensor processing results corresponding to the target tensor based on each computational result; and writing the tensor processing results to global storage that can be used by a normalization layer.

[0007] Secondly, embodiments of this disclosure propose a method for processing data, comprising: acquiring data to be processed, wherein the data to be processed is at least one of text, image, and audio; invoking a data processing model to process the data to be processed, and obtaining a processing result corresponding to the data processing model, wherein the data processing model includes a linear attention layer, an inference acceleration layer, and a normalization layer, and the inference acceleration layer is capable of executing the tensor processing method described in the first aspect.

[0008] Thirdly, embodiments of this disclosure propose an apparatus for processing tensors, comprising: a tensor splitting unit configured to split a target tensor output by a linear attention layer into a set of computational blocks according to the target block size, wherein each computational block in the set of computational blocks is a strictly lower triangular matrix; a computational domain forming unit configured to form a computational domain corresponding to each computational block by loading corresponding storage for each computational block; a computation execution unit configured to perform a target round of iteration on the computational blocks in each computational domain in parallel using an iterative algorithm to generate computation results corresponding to each computational domain; a result generation unit configured to generate tensor processing results corresponding to the target tensor based on each computation result; and a result writing unit configured to write the tensor processing results to global storage that can be used by a normalization layer.

[0009] Fourthly, embodiments of this disclosure propose an apparatus for processing data, comprising: a data acquisition unit configured to acquire data to be processed, wherein the data to be processed is at least one of text, image, and audio; and a data processing unit configured to invoke a data processing model to process the data to be processed and obtain a processing result corresponding to the data processing model, wherein the data processing model includes a linear attention layer, an inference acceleration layer, and a normalization layer, and the inference acceleration layer is capable of deploying the tensor processing apparatus described in the third aspect.

[0010] Fifthly, embodiments of this disclosure provide an electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to implement, when executed, a method for processing tensors as described in any implementation of the first aspect, and / or a method for processing data as described in any implementation of the third aspect.

[0011] In a sixth aspect, embodiments of this disclosure provide a non-transitory computer-readable storage medium storing computer instructions that enable a computer, when executed, to implement a method for processing tensors as described in any implementation of the first aspect, and / or a method for processing data as described in any implementation of the third aspect.

[0012] In a seventh aspect, embodiments of this disclosure provide a computer program product including a computer program that, when executed by a processor, can implement a method for processing tensors as described in any implementation of the first aspect, and / or a method for processing data as described in any implementation of the third aspect.

[0013] The present disclosure provides methods, apparatus, electronic devices, computer-readable storage media, and computer program products for processing tensors and data. In the method for processing tensors, firstly, the target tensor output by the linear attention layer is split into a set of computational blocks according to the target block size, wherein each computational block in the set is a strictly lower triangular matrix; then, by loading corresponding storage for each computational block, a computational domain corresponding to each computational block is formed; next, the computational blocks in each computational domain are iterated in parallel using an iterative algorithm for the target round, generating computational results corresponding to each computational domain; next, based on each computational result, a tensor processing result corresponding to the target tensor is generated; finally, the tensor processing result is written to global storage that can be used by the normalization layer.

[0014] In the data processing method, firstly, the data to be processed is obtained, wherein the data to be processed is at least one of text, image, and audio; then, the data processing model is called to process the data to be processed, and the processing result corresponding to the data processing model is obtained, wherein the data processing model includes a linear attention layer, an inference acceleration layer, and a normalization layer, and the inference acceleration layer can be a tensor processing method described by any of the above implementation methods.

[0015] This disclosure, by splitting tensors into computational blocks and forming corresponding independent computational domains, utilizes each computational domain to achieve parallel computation, which not only improves the processing efficiency of tensors but also reduces the frequency of storage calls during processing and lowers storage requirements.

[0016] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0017] Other features, objects, and advantages of this disclosure will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings: Figure 1 This is an exemplary system architecture to which this disclosure can be applied; Figure 2 A flowchart illustrating a process for processing tensors provided in this embodiment of the disclosure; Figure 3 A flowchart illustrating a process of writing tensor processing results to global storage, as provided in this embodiment of the disclosure; Figure 4 A flowchart illustrating a data processing procedure provided in this embodiment of the disclosure; Figure 5 A flowchart illustrating the process of processing tensors in a specific application scenario, as provided in this embodiment of the disclosure; Figure 6 A structural block diagram of a tensor processing apparatus provided in an embodiment of this disclosure; Figure 7 A structural block diagram of a data processing apparatus provided in an embodiment of this disclosure; Figure 8 This is a schematic diagram of the structure of an electronic device suitable for performing a method of processing tensors and processing data, provided in an embodiment of this disclosure. Detailed Implementation

[0018] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding; these should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description. It should be noted that, unless otherwise specified, the embodiments and features described in this disclosure can be combined with each other.

[0019] Furthermore, the acquisition, storage, use, processing, transportation, provision, and disclosure of any type of information involved in the technical solutions disclosed herein, such as user personal information (for example, in some scenarios and embodiments, the data to be processed in this disclosure may include, for example, images of user facial objects), all comply with the provisions of relevant laws and regulations and do not violate public order and good morals.

[0020] Figure 1 An exemplary system architecture 100 is shown, in which embodiments of the present disclosure of methods, apparatuses, electronic devices, and computer-readable storage media for processing tensors, processing data, and processing data can be applied.

[0021] like Figure 1 As shown, system architecture 100 may include terminal devices 101, 102, and 103, a network 104, and a server 105. Network 104 serves as the medium for providing communication links between terminal devices 101, 102, and 103 and server 105. Network 104 may include various connection types, such as wired or wireless communication links, or fiber optic cables, etc.

[0022] Users can use terminal devices 101, 102, and 103 to interact with server 105 via network 104 to receive or send messages, etc. Various applications for enabling information communication between the terminal devices 101, 102, and 103 and server 105 can be installed. These applications include inference acceleration applications, data processing applications, and instant messaging applications.

[0023] Terminal devices 101, 102, and 103 and server 105 can be either hardware or software. When terminal devices 101, 102, and 103 are hardware, they can be various electronic devices with displays, including but not limited to smartphones, tablets, laptops, and desktop computers. When terminal devices 101, 102, and 103 are software, they can be installed in the aforementioned electronic devices, and can be implemented as multiple software programs or software modules, or as a single software program or software module; no specific limitation is made here. When server 105 is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or as a single server. When server 105 is software, it can be implemented as multiple software programs or software modules, or as a single software program or software module; no specific limitation is made here.

[0024] Server 105 can provide various services through built-in applications. For example, it can process the tensors output by the linear attention layer to accelerate model inference. When running such an inference acceleration application, server 105 can achieve the following: First, it obtains the target tensor output by the linear attention layer of the model deployed on terminal devices 101, 102, and 103 from terminal devices 101, 102, and 103 via network 104; then, server 105 splits the target tensor output by the linear attention layer into blocks according to the target block size. The server 105 generates a set of computational blocks, where each computational block is a strictly lower triangular matrix. Then, the server 105 loads corresponding storage for each computational block to form a computational domain for each block. Next, the server 105 performs iterative iterations on the computational blocks in each computational domain using an iterative algorithm to generate the target round's computational results for each domain. Then, based on the computational results, the server 105 generates tensor processing results corresponding to the target tensor. Finally, the server 105 writes the tensor processing results to global storage that can be used by the normalization layer.

[0025] It should be noted that, in addition to being obtained from terminal devices 101, 102, and 103 via network 104, the target tensor can also be pre-stored locally on server 105 through various means. Therefore, when server 105 detects that this data is already stored locally (e.g., the model and its linear attention layer are deployed locally on server 105), it can choose to directly obtain this data locally. In this case, the exemplary system architecture 100 may also exclude terminal devices 101, 102, and 103 and network 104.

[0026] Since models are often deployed on servers with abundant computing resources and stronger computing capabilities, the methods for processing tensors and data provided in the subsequent embodiments of this disclosure are generally executed by server 105, which has strong computing power and abundant computing resources. Correspondingly, the devices for processing tensors and data are generally also located in server 105. However, it should also be noted that when terminal devices 101, 102, and 103 also have sufficient computing power and resources, they can also complete the aforementioned calculations performed by server 105 through inference acceleration applications installed on them, and thus output the same results as server 105. Especially when multiple terminal devices with different computing capabilities exist simultaneously, but the inference acceleration application determines that the terminal device it is on has strong computing power and abundant remaining computing resources, it can allow the terminal device to perform the aforementioned calculations, thereby appropriately reducing the computing pressure on server 105. Correspondingly, the devices for processing tensors and data can also be located in terminal devices 101, 102, and 103. In this case, the exemplary system architecture 100 may also exclude the server 105 and the network 104.

[0027] Similarly, the process of processing tensors and data, as well as the deployment of the devices for processing tensors and data, can also be implemented by combining terminal devices 101, 102, 103 and server 105 (for example, one device is deployed in terminal devices 101, 102, 103, and the other device is deployed in server 105).

[0028] It should be understood that Figure 1 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.

[0029] First, the process of handling tensors will be discussed. For ease of understanding, please refer to [the relevant documentation / reference]. Figure 2 Please provide an explanation. Figure 2 A flowchart of a process for processing tensors provided for embodiments of this disclosure, including process 200.

[0030] Process 200 specifically includes the following steps: Step 201: Split the target tensor output by the linear attention layer into a set of computation blocks according to the target block size; In embodiments of this disclosure, this step is intended to be performed by the execution entity of a method for processing tensors (e.g., Figure 1 The server 105 shown, after obtaining the target tensor output by the linear attention layer of the model (or neural network), will split it into a set of computational blocks according to the target block size to form a set of computational blocks.

[0031] In practice, the model can be, for example, a Large Language Model (LLM) or a Transformer, which can include and deploy linear attention layers.

[0032] Accordingly, in the embodiments of this disclosure, the model also includes a normalization layer located after the linear attention layer, so that the model can perform actions such as text semantic parsing, object recognition in an image, audio semantic parsing, etc. on the tensor provided by the linear attention layer (or the tensor processing result of the tensor) through the normalization layer.

[0033] For example, when the "model" is an LLM used to parse text semantics, the LLM can obtain the data to be processed in text form, process the data using the linear attention layer of the LLM to obtain a tensor, and then process the tensor using the normalization layer of the LLM to generate semantic recognition results.

[0034] Similarly, when the "model" is an LLM used to identify objects included in an image, the LLM can acquire data to be processed in the form of an image, process the data using the linear attention layer of the LLM to obtain a tensor, and then process the tensor using the normalization layer of the LLM to determine whether the image contains an "object" and, if so, the location of the "object" in the image.

[0035] For the target tensor output by the linear attention layer A It can be expressed by the following formula (1): (1) in, B This refers to the batch size, for example, the number of inputs processed by the linear attention layer; T The sequence length, for example, the number of time steps or tokens in the input sequence; H This represents the number of attention heads involved in the linear attention layer. The target block size is as described above. Furthermore, in the embodiments of this disclosure, for each attention head in each batch, the corresponding submatrix should satisfy a strictly lower triangular matrix, that is, a matrix where all elements above the diagonal are 0.

[0036] Typically, the target block size can be preset based on the desired computing speed; for example, the target block size can be preset to 64.

[0037] Accordingly, in this step, the executing entity can, after obtaining the target tensor, proceed according to the target block size. Along the sequence dimension of the target tensor A Organize the computation into blocks, so that each computation block corresponds to a... The strict lower triangular submatrix is used to complete the target tensor. A The decomposition yields a set of computational blocks. Accordingly, each computational block in the set is a strictly lower triangular matrix.

[0038] Step 202: By loading the corresponding storage for each computing block, a computing domain corresponding to each computing block is formed; In the embodiments of this disclosure, after splitting the computation blocks and the set of computation blocks based on the above step 201, the execution entity can load a storage corresponding to each computation block to form an independent and corresponding computation domain for each computation block.

[0039] For example, the execution entity can independently load each computation block into on-chip high-speed storage, such as Static Random-Access Memory (SRAM) or other equivalent storage units, to form a local computation domain corresponding to the computation block.

[0040] In some embodiments, since storage is used to form the computation domain corresponding to the computation block in this step, during the execution of step 201, the execution entity may, alternatively or additionally, determine the target block size based on the storage capacity (e.g., the size of storage that can be allocated to each computation block). This improves storage utilization while avoiding storage bottlenecks and overflows.

[0041] Specifically, during the execution of step 201 above, the executing entity may optionally or alternatively select the capacity size of the storage to be read. Then, based on this capacity size, it estimates the size of the computation blocks that can be supported. For example, the executing entity may estimate the size of the computation blocks that can be supported based on the current available storage size of each storage unit and then based on the smallest available storage size.

[0042] In some embodiments, the executing entity may determine the minimum and maximum allowed computation blocks based on storage capacity and storage security thresholds.

[0043] Then, based on the sequence length of the tensor and the storage capacity (i.e., the range from the maximum to the minimum value), the multiple that can be aligned for storage is determined, and this multiple is used to determine the largest "target block size" that can be stored.

[0044] Then, the executing entity can split the target tensor output by the linear attention layer into a set of computational blocks according to the target block size.

[0045] Step 203: In parallel, perform the target round of iteration on the computation blocks in each computation domain using an iterative algorithm to generate the corresponding computation results for each computation domain; In the embodiments of this disclosure, after forming the computational domains corresponding to each computational block based on step 202 above, the execution entity can execute each computational domain in parallel during this step. In each computational domain, the execution entity can perform iterative iterations (calculations, processing) on the computational blocks using iterative algorithms (e.g., Gaussian elimination, adjoint matrix method) for a target number of rounds to generate the computational results corresponding to each computational domain.

[0046] In some embodiments, the iterative algorithm can be Newton's iteration method. For example, the following iterative formula (2) can be constructed to generate the calculation results corresponding to each computational domain: (2) in, M For computation blocks, For the first k The computation result corresponding to the computational domain after the next iteration. I It is an identity matrix.

[0047] Accordingly, the executing entity can control each computing domain in " k After reaching the target round with +1, the program exits and uses the calculation result obtained in that round as the final output calculation result corresponding to the calculation domain.

[0048] Therefore, by implementing iterative operations in the computational domain using Newton's iteration method, the processing and computation of tensors can be transformed from a serial dependency to parallel reasoning, thereby improving the efficiency of tensor processing and reasoning.

[0049] In the process of using Newton's iteration method, although the target number can be determined based on convergence constraints so that each computational domain can jump out after iterating to the convergent "target round", in order to improve the stability of the iteration and make the "target round" predictable, in some embodiments, numerical interval pruning can also be added during the iteration process to limit the amplification of errors and make the iteration results stable in the same target round, so that the same "target round" can be used to complete the iteration under different conditions.

[0050] In this case, a numerical range for the calculation result can be predetermined, for example, (-1, 1). Then, if during the execution of the iterative algorithm for the target round, the executing entity detects that the value of the calculation result obtained in the current round is greater than the upper limit of the numerical range, for example, greater than 1, the executing entity can respond by adjusting the calculation result of the current round to the upper limit, i.e., 1.

[0051] Similarly, if, during the execution of the iterative algorithm for the target round, the value of the calculation result in the current round is less than the lower limit of the numerical range, for example, less than -1, the executing entity can also respond by adjusting the calculation result of the current round to that lower limit, for example, -1. Thus, this numerical range makes the iteration process of the executing entity theoretically bounded, for example, [-1, 1], and allows each computational domain to suppress numerical amplification and maintain stability during the calculation and iteration process, enabling convergence at a fixed "target round" for different situations. Furthermore, this approach also improves compatibility with low-precision hardware.

[0052] In practice, this numerical range can usually be determined based on the specific circumstances of the linear attention layer and its theoretically bounded results. For example, by performing L2 normalization on the query and key and restricting the input matrix to a lower triangular shape, a linear attention layer structure that satisfies the constraints of DeltaNet can be constructed to achieve the theoretical bound and thus determine the numerical range.

[0053] For ease of understanding, refer to the numerical range of (-1, 1) as an example, and refer to formula (2) in each iteration. k The cutting process can be represented by the following formula (3): (3) As discussed above, besides determining the target round based on the criterion of "convergence," the target round can also be preset based on the statistical results of common convergence. For example, in embodiments that utilize numerical ranges for pruning, as discussed above, because numerical ranges are used for pruning, the same fixed target round can be used to achieve a near-convergence effect under different circumstances. Therefore, a fixed target round can be set based on such statistical results to improve array utilization and eliminate row-level synchronization dependencies by constructing a static computation graph.

[0054] In some embodiments, when setting the target number of rounds, the target number of rounds can be set based on the number of rows in the computation block. That is, the number of target rounds is less than the number of rows in the computation block, so that such a computation process can have better computational efficiency and use less computational resources compared to serial, line-by-line processing.

[0055] In some embodiments, to ensure the stability of the strictly lower triangular matrix, the execution entity may choose to update the computational blocks using the identity matrix before parallelizing each computational domain, thus obtaining the updated computational blocks corresponding to each computational block. For example, the computational blocks in formula (2) can be updated using the following formula (4). M Update and use the updated version. M’ As in formula (2) M The substitution of "" is used to achieve the iterative process based on formula (2): (4) in, I It is also an identity matrix.

[0056] In some embodiments, considering that the computation process implemented in each computation domain (e.g., the iterative process to complete the target round based on Newton's iteration method) is actually intended to eliminate row data dependencies and avoid using a serial recursive approach to complete the inversion process, then in such cases, the executing entity may also choose to pre-process... M (Or, in the use of) M’ In this situation, for M’ Constraints are applied in the form of a matrix (e.g., matrix form) to initialize structure awareness, ensuring that the initial matrix maintains a consistent lower triangular structure with the target matrix, thus guaranteeing computational quality.

[0057] To utilize M’ To illustrate, the executing entity can update the matrix form of the computation block and the matrix form of the computation result corresponding to the computation domain (e.g., the inverse matrix of the computation block). Alignment is performed to obtain the aligned updated computation block.

[0058] Then, while the execution entity is iterating over the updated computational blocks in parallel across various computational domains using an iterative algorithm to reach the target round, generating the corresponding computational results for each domain, it can further choose to iterate over the aligned updated computational blocks in parallel across various computational domains to reach the target round, generating the corresponding computational results for each domain. Thus, through this alignment action, the computational blocks can be iterated in a stable structure corresponding to the "structure of the inversion result" during the iteration process within the computational domain, without structural errors, thereby improving the computational quality within the computational domain.

[0059] Accordingly, after generating the computation results corresponding to each computation domain, the execution entity can cache them accordingly (e.g., "cache" them in the storage corresponding to each computation domain), and then combine them in subsequent step 204 to form and generate the tensor processing result corresponding to the target tensor.

[0060] Step 204: Based on the calculation results, generate the tensor processing results corresponding to the target tensor; In the embodiments of this disclosure, after obtaining each calculation result based on the above step 203, the execution entity can combine each calculation result according to the inverse operation and inverse action when splitting the calculation block, as discussed above, to form and generate a tensor processing result corresponding to the target tensor (for example, the tensor processing result can be used as the processing result of "inverse operation on tensor").

[0061] Step 205: Write the tensor processing result to global storage that can be used by the normalization layer.

[0062] In the embodiments of this disclosure, after generating the tensor processing result based on step 204 above, the execution entity can write it to a global storage that can be used by the subsequent normalization layer located after the linear attention layer, so that the normalization layer can use the global storage to obtain the tensor processing result and complete the subsequent processing actions in the model based on the tensor processing result.

[0063] The tensor processing method provided in this disclosure firstly splits the target tensor output by the linear attention layer into a set of computational blocks according to the target block size, wherein each computational block in the set is a strictly lower triangular matrix; then, by loading corresponding storage for each computational block, a computational domain corresponding to each computational block is formed; next, the computational blocks in each computational domain are iterated in parallel using an iterative algorithm to perform the target round of iteration, generating the computational results corresponding to each computational domain; next, based on each computational result, a tensor processing result corresponding to the target tensor is generated; finally, the tensor processing result is written to global storage that can be used by the normalization layer. This approach, by splitting the tensor into computational blocks and forming corresponding independent computational domains, utilizes each computational domain to achieve parallel computation, which not only improves the tensor processing efficiency but also reduces the frequency of storage calls during processing, thus lowering storage requirements.

[0064] In some embodiments, taking process 200 as an example, it can be understood as a process of processing "tensors" to accelerate the inference process of the model. Therefore, in such a case, the executing entity can also be more figuratively understood as an "inference acceleration operator" or "inference acceleration layer" in the model, so that by deploying this "inference acceleration layer" or "inference acceleration operator" in the model, the model can have such accelerated inference capabilities during the inference process. For example, this inference acceleration layer can be located after the linear attention layer in the model, such as after the L2 norm layer (L2Norm), convolutional layer (Conv), and linear layer (Linear) of the linear attention layer, and before the normalization layer (Root Mean Square Layer Normalization, abbreviated as RMSNorm).

[0065] However, in other embodiments, for models that have already been trained and deployed, the aforementioned process of accelerating recommendation and processing tensors can be introduced into these models by modifying the operators involved in the relevant operations using the aforementioned "execution subject" to achieve the transformation and enable them to "accelerate inference".

[0066] In this scenario, the execution entity can differentiate the process of writing tensor processing results to global storage that can be used by the normalization layer, based on the specific situation and structure of the model. For easier understanding, please refer to [the following text is also included]. Figure 3 Please provide an explanation.

[0067] Figure 3 This document provides a flowchart of a process for writing tensor processing results to global storage, as provided in an embodiment of the present disclosure, including process 300. In some embodiments, process 300 may be an alternative or alternative implementation of step 205 described above.

[0068] Process 300 specifically includes the following steps: Step 301: In response to the existence of an inversion operator connected to and located after the linear attention layer, disable the inversion operator; Specifically, as discussed above, since the purpose of the process discussed in process 200 above is to accelerate inference by replacing the "inversion" process with parallel iteration, in such a case, the executing entity can first detect whether the model itself has deployed an inversion operator (or inversion layer) to implement the "inversion" process after the linear attention layer.

[0069] Accordingly, if such an inversion operator exists, the execution entity can respond by first disabling the inversion operator to avoid redundant calculations, and then using the aforementioned "tensor processing result" to replace the result of the inversion operator, thereby accelerating inference.

[0070] Typically, after disabling the inversion operator, the executing entity can directly use the tensor processing result as the result of the inversion operator, and write the tensor processing result to global storage that can be used by the normalization layer to complete the "replacement" process, thereby accelerating model inference.

[0071] In some embodiments, the execution entity may further select to detect whether the inversion operator and the normalization layer are directly connected, and if they are not directly connected and there are other target operators between the inversion operator and the normalization layer, the calculation logic of the target operator is integrated to reduce the read operations of global storage, improve memory access efficiency, and improve computational efficiency.

[0072] Accordingly, in process 300, the executing entity may further select to execute step 302 after step 301 to detect whether the inversion operation operator and the normalization layer are directly connected.

[0073] Step 302: Check whether the inversion operator is directly connected to the normalization layer; If the inversion operator is directly connected to the normalization layer, the execution entity can respond to this by selecting to execute step 303, directly using the tensor processing result as the result of the inversion operator, and writing the tensor processing result to global storage that can be used by the normalization layer, so as to directly complete the above-mentioned substitution and inference acceleration process.

[0074] If they are not directly connected, that is, if there is a target operator between the inversion operator and the normalization layer, the execution subject can respond to this by selecting to execute step 304, using the target operator to process the tensor processing result, and obtaining the operator processing result.

[0075] Step 303: Using the tensor processing result as the result of the inversion operation, write the tensor processing result to global storage that can be used by the normalization layer; Step 304: Process the tensor processing result using the target operation operator to obtain the operator processing result; Specifically, as discussed above, if at least one target operator (e.g., operators that perform transpose operations on matrices or multiply target matrices based on different models and scenarios) is included between the inversion operator and the normalization layer, the execution entity can first use the target operator to process the tensor processing results and perform the corresponding operations, instead of directly writing the tensor processing results to global storage that can be used by the normalization layer.

[0076] Accordingly, after obtaining the operator processing result, the execution entity can continue to execute step 305 to actually use the operator processing result as the final content to be stored (i.e., as a further processing result of the tensor processing result, replacing the tensor processing result to be written to the global storage), and write it to the global storage that can be used by the normalization layer.

[0077] It should be understood that if there are at least two target operators, the execution entity can similarly use these target operators to obtain the final result and then write the final "operator processing result" to reduce the frequency of accessing global storage and improve computational efficiency.

[0078] Step 305: Replace the tensor processing result with the operator processing result and write it to global storage that can be used by the normalization layer.

[0079] Furthermore, the process of processing data, including the aforementioned processing tensors, will be discussed.

[0080] Please refer to Figure 4 , Figure 4 A flowchart of a data processing procedure provided for an embodiment of this disclosure includes process 400.

[0081] Process 400 specifically includes the following steps: Step 401: Obtain the data to be processed; In embodiments of this disclosure, this step is intended to be performed by the entity executing the data processing method (e.g., Figure 1 The server 105 shown obtains the data to be processed, such as the data provided by the user through terminal devices 101, 102, and 103.

[0082] As discussed above, depending on the specific implementation and processing purpose, the data to be processed can be at least one of text, image, and audio, or "multimodal data" including at least two of them.

[0083] Step 402: Call the data processing model to process the data to be processed and obtain the processing result corresponding to the data processing model.

[0084] In embodiments of this disclosure, after step 401 above, the executing entity in this step may correspondingly invoke a data processing model for processing the data to be processed. The data processing model, i.e., the "model" discussed above, may include at least a linear attention layer, an inference acceleration layer, and a normalization layer. The inference acceleration layer is capable of performing, for example, the tensor processing process discussed in process 200 above.

[0085] The data processing method provided in this embodiment obtains data to be processed, wherein the data to be processed is at least one of text, image, and audio. Then, a data processing model is called to process the data to be processed to obtain a processing result corresponding to the data processing model. The data processing model includes a linear attention layer, an inference acceleration layer, and a normalization layer. The inference acceleration layer is capable of performing the above-mentioned tensor processing process.

[0086] Therefore, by using a data processing model that includes an inference acceleration layer, the data processing process can be completed more efficiently and with higher quality.

[0087] To enhance understanding, this disclosure also provides a specific implementation scheme based on a particular application scenario. Please refer to [link / reference needed]. Figure 5 , Figure 5 A flowchart of a process for processing tensors implemented in a specific application scenario, as provided in an embodiment of this disclosure, includes process 500.

[0088] For example, process 500 can also be implemented by "server 105" as the execution subject.

[0089] In process 500, for a model (not shown in the figure) including a linear attention layer 510 and a normalization layer 517, the executing entity can obtain the tensor 513 after the linear attention layer 510 outputs tensor 513 by executing S501.

[0090] Then, the execution entity executes S502 to split the tensor 513 output by the linear attention layer 510 into computation blocks 521, 522...52N (where N is a positive integer) according to the target block size.

[0091] Next, the executing entity can allocate storage to each of the split computing blocks by executing S503, thus forming a "computation domain" corresponding to each computing block. For example, for computing block 521, the executing entity can allocate storage 531 by executing S503 to form computing domain 551; for computing block 522, the executing entity can allocate storage 532 by executing S503 to form computing domain 552... For computing block 52N, the executing entity can allocate storage 53N by executing S503 to form computing domain 55N.

[0092] Then, the execution entity can execute S504 to parallelize the various computational domains, performing iterations of the target round in each computational domain (e.g., iterations based on the "Newton iteration algorithm") to obtain the corresponding computational results for each computational domain. For example, the computational result 561 of computational domain 551, the computational result 562 of computational domain 552, and the computational result 56N of computational domain 55N.

[0093] Next, the executing entity can generate the final tensor calculation result 570 based on the calculation results 561, 562...56N by executing S505.

[0094] Finally, the execution entity can execute S506 to store the tensor calculation result 570 to the global storage 515 that can be used by the normalization layer 517, so that the normalization layer 517 can obtain and use the tensor calculation result 570 by accessing the global storage 515.

[0095] Further reference Figure 6 As an implementation of the methods shown in the above figures, this disclosure provides an embodiment of a device for processing tensors, which is similar to... Figure 2 Corresponding to the method embodiments shown, this device can be specifically applied to various electronic devices.

[0096] like Figure 6As shown, the tensor processing apparatus 600 of this embodiment may include: a tensor splitting unit 601, a computational domain forming unit 602, a computation execution unit 603, a result generation unit 604, and a result writing unit 605. The tensor splitting unit 601 is configured to split the target tensor output by the linear attention layer into a set of computational blocks according to the target block size, wherein each computational block in the set is a strictly lower triangular matrix; the computational domain forming unit 602 is configured to form a computational domain corresponding to each computational block by loading corresponding storage for each computational block; the computation execution unit 603 is configured to perform iterative iterations of the computational blocks in each computational domain using an iterative algorithm for the target round, generating computational results corresponding to each computational domain; the result generation unit 604 is configured to generate tensor processing results corresponding to the target tensor based on each computational result; and the result writing unit 605 is configured to write the tensor processing results to global storage that can be used by the normalization layer.

[0097] In this embodiment, the specific processing and technical effects of the tensor splitting unit 601, computational domain forming unit 602, computation execution unit 603, result generation unit 604, and result writing unit 605 in the tensor processing apparatus 600 can be found in the following references. Figure 2 The relevant descriptions of steps 201-205 in the corresponding embodiments will not be repeated here.

[0098] In some optional implementations of this embodiment, the apparatus 600 further includes: an update calculation block calculation unit configured to update the calculation blocks using an identity matrix to obtain update calculation blocks corresponding to each calculation block; and a calculation execution unit 603 further configured to perform iterative iteration of the update calculation blocks in each calculation domain using an iterative algorithm for a target round in parallel to generate calculation results corresponding to each calculation domain.

[0099] In some optional implementations of this embodiment, the apparatus 600 further includes: an update computation block alignment unit configured to align the matrix form of the update computation block with the matrix form of the computation result to obtain an aligned update computation block; and a computation execution unit 603 further configured to perform a target round of iteration on the aligned update computation block in each computation domain using an iterative algorithm to generate the computation result corresponding to each computation domain.

[0100] In some optional implementations of this embodiment, the iterative algorithm includes: Newton's iteration algorithm.

[0101] In some optional implementations of this embodiment, the apparatus 600 further includes: a calculation result adjustment unit, configured to adjust the calculation result to the upper limit value in response to the following: during the execution of the iterative algorithm for the target round, if the value of the calculation result obtained in the current round is greater than the upper limit value of the value range; and to adjust the calculation result to the lower limit value in response to the following: during the execution of the iterative algorithm for the target round, if the value of the calculation result obtained in the current round is less than the lower limit value of the value range.

[0102] In some optional implementations of this embodiment, the result writing unit 605 includes: an operator prohibition unit, configured to disable the inversion operation operator in response to the existence of an inversion operation operator connected to and located after the linear attention layer; and a result writing subunit, configured to use the tensor processing result as the operation result of the inversion operation operator and write the tensor processing result to global storage that can be used by the normalization layer.

[0103] In some optional implementations of this embodiment, the result writing subunit is further configured to, in response to the inversion operator being directly connected to the normalization layer, use the tensor processing result as the operation result of the inversion operator, and write the tensor processing result to global storage that can be used by the normalization layer.

[0104] In some optional implementations of this embodiment, the apparatus 600 further includes: an operator execution unit configured to, in response to the inclusion of a target operator between the inversion operator and the normalization layer, process the tensor processing result using the target operator to obtain the operator processing result; and a replacement writing unit configured to replace the tensor processing result with the operator processing result and write it to global storage that can be used by the normalization layer.

[0105] In some optional implementations of this embodiment, the tensor splitting unit 601 includes: a capacity reading subunit configured to read the capacity size of the storage; a block size determination unit configured to determine the target block size based on the capacity size; and a tensor splitting subunit configured to split the target tensor output by the linear attention layer into a set of computation blocks according to the target block size.

[0106] In some optional implementations of this embodiment, the number of target rounds is less than the number of rows in the computation block.

[0107] This embodiment exists as a device embodiment corresponding to the above method embodiment. The tensor processing device provided in this embodiment, after splitting the tensor into computational blocks and forming corresponding independent computational domains, utilizes each computational domain to achieve parallel computation, which can not only improve the processing efficiency of tensors, but also reduce the frequency of storage calls during the processing and reduce storage requirements.

[0108] Further reference Figure 7As an implementation of the methods shown in the above figures, this disclosure provides an embodiment of a device for processing tensors, which is similar to... Figure 4 Corresponding to the method embodiments shown, this device can be specifically applied to various electronic devices.

[0109] like Figure 7 As shown, the data processing device 700 of this embodiment may include a data acquisition unit 701 and a data processing unit 702. The data acquisition unit 701 is configured to acquire data to be processed, wherein the data to be processed is at least one of text, image, and audio. The data processing unit is configured to invoke a data processing model to process the data to be processed and obtain a processing result corresponding to the data processing model. The data processing model includes a linear attention layer, an inference acceleration layer, and a normalization layer, wherein the inference acceleration layer can be deployed in the device 600.

[0110] In this embodiment, the specific processing of the data acquisition unit 701 and the data processing unit 702 in the data processing apparatus 700 and the resulting technical effects can be referred to respectively. Figure 4 The relevant descriptions of steps 401-402 in the corresponding embodiments will not be repeated here.

[0111] This embodiment exists as a device embodiment corresponding to the above method embodiment. The data processing device provided in this embodiment, through a data processing model including an inference acceleration layer, enables the data processing process to be completed more efficiently and with higher quality.

[0112] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.

[0113] Figure 8 A schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0114] like Figure 8As shown, device 800 includes a computing unit 801, which can perform various appropriate actions and processes based on a computer program stored in read-only memory (ROM) 802 or a computer program loaded from storage unit 808 into random access memory (RAM) 803. RAM 803 may also store various programs and data required for the operation of device 800. The computing unit 801, ROM 802, and RAM 803 are interconnected via bus 804. Input / output (I / O) interface 805 is also connected to bus 804.

[0115] Multiple components in device 800 are connected to I / O interface 805, including: input unit 806, such as keyboard, mouse, etc.; output unit 807, such as various types of monitors, speakers, etc.; storage unit 808, such as disk, optical disk, etc.; and communication unit 809, such as network card, modem, wireless transceiver, etc. Communication unit 809 allows device 800 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0116] The computing unit 801 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as methods for processing tensors and processing data. For example, in some embodiments, methods for processing tensors and processing data may be implemented as computer software programs tangibly contained in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program may be loaded and / or installed on device 800 via ROM 802 and / or communication unit 809. When the computer program is loaded into RAM 803 and executed by the computing unit 801, one or more steps of the methods for processing tensors and processing data described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured in any other suitable manner (e.g., by means of firmware) to perform methods of processing tensors and processing data.

[0117] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0118] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0119] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0120] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0121] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0122] Computer systems can include clients and servers. Clients and servers are generally geographically separated and typically interact via communication networks. The client-server relationship is established by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, also known as cloud computing servers or cloud hosts, which are hosting products within the cloud computing service ecosystem to address the management difficulties and weak business scalability inherent in traditional physical hosts and Virtual Private Servers (VPS) services. Servers can also be categorized as distributed system servers or servers incorporating blockchain technology.

[0123] According to the technical solution of the present disclosure, after splitting the tensor into computational blocks and forming corresponding independent computational domains, parallel computation is achieved by utilizing each computational domain. This not only improves the processing efficiency of the tensor, but also reduces the frequency of storage calls during the processing and lowers storage requirements.

[0124] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution provided in this disclosure can be achieved, and this is not limited herein.

[0125] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A method for processing tensors, comprising: According to the target block size, the target tensor output by the linear attention layer is split into a set of computational blocks, wherein each computational block in the set of computational blocks is a strictly lower triangular matrix; By loading corresponding storage for each of the computing blocks, a computing domain corresponding to each of the computing blocks is formed. The computational blocks are iterated in parallel in each computational domain using an iterative algorithm to perform the target number of iterations, generating computational results corresponding to each computational domain. Based on the calculation results, a tensor processing result corresponding to the target tensor is generated; The tensor processing result is written to global storage that can be used by the normalization layer.

2. The method according to claim 1, further comprising: The computation blocks are updated using the identity matrix to obtain the updated computation blocks corresponding to each computation block. The parallel execution of the target round iterations of the computation blocks in each computation domain using an iterative algorithm to generate computation results corresponding to each computation domain includes: The update computation block is iterated in parallel in each computation domain using an iterative algorithm to perform the target round of iteration, generating the corresponding computation results for each computation domain.

3. The method according to claim 2, further comprising: Align the matrix form of the update calculation block with the matrix form of the calculation result to obtain the aligned update calculation block; The parallel execution of the update computation block in each computation domain using an iterative algorithm for the target round, generating computation results corresponding to each computation domain, includes: The alignment update computation blocks are iterated in parallel in each computation domain using an iterative algorithm to perform the target round of iteration, generating the computation results corresponding to each computation domain.

4. The method according to any one of claims 1-3, wherein, The iterative algorithm includes: Newton's iteration algorithm.

5. The method according to claim 1, wherein, The method further includes: In response to the fact that, during the execution of the iterative algorithm for the target round, the value of the round calculation result obtained in the current round is greater than the upper limit of the value range, the round calculation result is adjusted to the upper limit. In response to the fact that, during the execution of the iterative algorithm for the target round, the value of the round calculation result obtained in the current round is less than the lower limit of the value range, the round calculation result is adjusted to the lower limit.

6. The method according to claim 1, wherein, The step of writing the tensor processing result to global storage that can be used by the normalization layer includes: In response to the existence of an inversion operator connected to and following the linear attention layer, the inversion operator is disabled. The tensor processing result is used as the operation result of the inversion operator, and the tensor processing result is written to global storage that can be used by the normalization layer.

7. The method according to claim 6, wherein, The step of using the tensor processing result as the result of the inversion operation and writing the tensor processing result to global storage that can be used by the normalization layer includes: In response to the fact that the inversion operator is directly connected to the normalization layer, the tensor processing result is used as the operation result of the inversion operator, and the tensor processing result is written to global storage that can be used by the normalization layer.

8. The method according to claim 7, further comprising: In response to the inclusion of a target operator between the inversion operator and the normalization layer, the tensor processing result is processed using the target operator to obtain the operator processing result; The operator processing result replaces the tensor processing result and is written to global storage that can be used by the normalization layer.

9. The method according to claim 1, wherein, The step of splitting the target tensor output by the linear attention layer into a set of computational blocks according to the target block size includes: Read the storage capacity; The target block size is determined based on the aforementioned capacity. According to the target block size, the target tensor output by the linear attention layer is split into a set of computation blocks.

10. The method according to claim 1, wherein, The number of target rounds is less than the number of rows in the computation block.

11. A method for processing data, comprising: Acquire data to be processed, wherein the data to be processed is at least one of text, image, and audio; The data to be processed is processed by calling a data processing model to obtain a processing result corresponding to the data processing model. The data processing model includes a linear attention layer, an inference acceleration layer, and a normalization layer. The inference acceleration layer is capable of executing the tensor processing method according to any one of claims 1-10.

12. An apparatus for processing tensors, comprising: Tensor splitting unit is configured to split the target tensor output by the linear attention layer into a set of computational blocks according to the target block size, wherein each computational block in the set of computational blocks is a strictly lower triangular matrix; A computational domain forming unit is configured to form a computational domain corresponding to each of the computational blocks by loading corresponding storage for each of the computational blocks. The computation execution unit is configured to perform a target round of iteration on the computation block in each computation domain using an iterative algorithm, generating computation results corresponding to each computation domain. The result generation unit is configured to generate a tensor processing result corresponding to the target tensor based on each of the calculation results. The result writing unit is configured to write the tensor processing result to global storage that can be used by the normalization layer.

13. An apparatus for processing data, comprising: The data acquisition unit is configured to acquire data to be processed, wherein the data to be processed is at least one of text, image, and audio. The data processing unit is configured to invoke a data processing model to process the data to be processed and obtain a processing result corresponding to the data processing model. The data processing model includes a linear attention layer, an inference acceleration layer, and a normalization layer. The inference acceleration layer can deploy the tensor processing device as described in claim 12.

14. An electronic device comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the method for processing tensors as described in any one of claims 1-10 and / or the method for processing data as described in claim 11.

15. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of processing tensors as described in any one of claims 1-10 and / or the method of processing data as described in claim 11.

16. A computer program product comprising a computer program that, when executed by a processor, implements the method for processing tensors according to any one of claims 1-10 and / or the method for processing data according to claim 11.