Data processing method and apparatus thereof

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By moving data between memory and cache, performing fast attention FA operations, reusing Q vectors, and expanding the calculation range, the problem of high memory peaks in large-scale model and long-text calculations is solved, thus improving the algorithm's execution efficiency.

WO2026045239A9PCT designated stage Publication Date: 2026-06-18HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: HUAWEI TECH CO LTD
Filing Date: 2025-03-21
Publication Date: 2026-06-18

AI Technical Summary

⚠Technical Problem

When large models process long texts, the peak memory usage during computation is high, which reduces the efficiency of the algorithm. In particular, the low efficiency of data transfer is caused by the relationship between the space complexity of the intermediate matrices S and P and the square of the sequence length.

⚗Method used

By moving data between memory and cache, fast attention-based computation (FA) is performed, Q-vectors are reused, reducing the amount of data movement, and the computation range is expanded to multiple row blocks, reducing the number of input matrix movements.

🎯Benefits of technology

It reduced the amount of data transfer, improved computational efficiency, reduced memory peaks, and enhanced the overall efficiency of the algorithm implementation.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN2025083944_18062026_PF_FP_ABST

Patent Text Reader

Abstract

A data processing method, the method comprising: transferring first data from a memory to a second memory, wherein the first data comprises data required for a computing unit to perform fast attention (FA) operations on a same batch, and the first data comprises a first Q vector, a second Q vector, and a first KV vector; by means of the computing unit, performing the FA operation on the first Q vector and the first KV vector in the second memory to obtain a first computation result; and by means of the computing unit, performing the FA operation on the second Q vector and the first KV vector in a first cache to obtain a second computation result. The present application can reduce an amount of data transferred.

Need to check novelty before this filing date? Find Prior Art

Description

A data processing method and apparatus

[0001] This application claims priority to Chinese Patent Application No. 202411192549.5, filed with the State Intellectual Property Office of China on August 27, 2024, entitled “A Data Processing Method and Apparatus”, the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of artificial intelligence, and more particularly to a data processing method and apparatus thereof. Background Technology

[0003] Large models, as one of the most important research directions in the field of artificial intelligence, have accelerated the technological advancement of neural networks in both academia and industry. They are a crucial step in realizing artificial intelligence applications, especially deep learning applications. Large models can help us extract valuable information from complex models and solve a wide variety of problems, such as image recognition, speech recognition, and natural language processing.

[0004] The ability to extract information from long texts is a crucial performance indicator for large models. Therefore, with the development of large models, the length of context is increasing daily. Both the sequence length of the dataset used in the training phase and the length of the prompt words in the full inference phase have reached hundreds of thousands or even millions of times. Compared to the size of the input / output matrices Q, K, V, O (O(Nd)), the space complexity of the intermediate matrices S, P has a quadratic relationship with the sequence length (O(Nd)). 2 Therefore, some calculations can lead to extremely high memory peaks during algorithm execution, and the overall efficiency of the algorithm will be greatly reduced due to data movement. Summary of the Invention

[0005] In a first aspect, this application provides a data processing method, the method comprising: transferring first data from memory to second memory, the first data including data required by a computing unit for performing fast attention FA operations in the same batch, the first data including a first Q vector, a second Q vector, and a first KV vector; performing the fast attention FA operation on the first Q vector and the first KV vector in the second memory through the computing unit to obtain a first operation result; and performing the fast attention FA operation on the second Q vector and the first KV vector in the first cache through the computing unit to obtain a second operation result.

[0006] In this embodiment of the application, when performing fast attention operations, the data retrieved from memory in each batch includes different Q vectors corresponding to the same KV vector. This allows the Q vectors to be reused during fast attention operations (the same Q vector can be used for different KV vectors), thereby reducing the amount of data transfer.

[0007] In one possible implementation, the first data further includes a second KV vector, and the method further includes: performing the fast attention FA operation on the first Q vector, the second KV vector and the first operation result in the first cache through the computing unit to obtain a third operation result; wherein the third operation result includes an update result of the first operation result.

[0008] In one possible implementation, the first data further includes a third KV vector, and the method further includes: performing the fast attention FA operation on the second Q vector, the third KV vector, and the second operation result in the first cache through the computing unit to obtain a fourth operation result; wherein the fourth operation result includes an update result of the second operation result.

[0009] In one possible implementation, the memory is dedicated memory (GM), and the first cache is a level 1 cache (L1).

[0010] The key to fully enabling the L1 cache is that a single input data transfer can handle as many computational tasks as possible corresponding to the input data. Therefore, in the forward FA operator, the computation of a stage can be expanded from within one row block to within several row blocks. Each input matrix transfer stage transfers several Q, K, V row blocks at once, instead of one Q row block and several K, V row blocks, and processes their corresponding computational tasks, without needing to transfer the input matrix again during the process.

[0011] In one possible implementation, the method further includes: after the computing unit completes the fast attention operation on the first data, deleting the first KV vector from the first cache.

[0012] In one possible implementation, the size of the first data is smaller than the capacity of the first memory.

[0013] In one possible implementation, the intermediate result obtained from performing a fast attention (FA) operation on the first data is stored in a second cache (L2), and the size of the intermediate result is smaller than the capacity of the second cache (L2).

[0014] In one possible implementation, the method further includes:

[0015] The second data is moved from the memory to the second memory. The second data includes the data required by the computing unit when performing fast attention gradient (FAG) calculations in the same batch. The first data includes the fourth Q vector, the first gradient, the fourth KV vector, and the fifth KV vector.

[0016] The calculation module is used to perform the fast attention gradient (FAG) operation on the first four-Q vector, the first gradient, and the fourth KV vector in the second memory to obtain a fifth calculation result.

[0017] Furthermore, the computation unit performs the Fast Attention Gradient (FAG) operation on the fourth Q vector, the first gradient, and the fifth KV vector in the first cache to obtain the sixth computation result.

[0018] In one possible implementation, the size of the second data is smaller than the capacity of the first memory.

[0019] Secondly, this application provides a data processing apparatus, the apparatus comprising:

[0020] The data management module is used to move first data from memory to second memory. The first data includes the data required by the computing unit when performing fast attention (FA) operations in the same batch. The first data includes a first Q vector, a second Q vector, and a first KV vector.

[0021] The computing module of the computing unit is configured to perform the fast attention FA operation on the first Q vector and the first KV vector in the second memory to obtain a first operation result; and to perform the fast attention FA operation on the second Q vector and the first KV vector in the first cache to obtain a second operation result.

[0022] In one possible implementation, the first data further includes a second KV vector, and the calculation module is further configured to perform the fast attention FA operation on the first Q vector, the second KV vector and the first operation result in the first cache to obtain a third operation result; wherein the third operation result includes an update result of the first operation result.

[0023] In one possible implementation, the first data further includes a third KV vector, and the calculation module is further configured to perform the fast attention FA operation on the second Q vector, the third KV vector, and the second operation result in the first cache to obtain a fourth operation result; wherein the fourth operation result includes an update result of the second operation result.

[0024] In one possible implementation, the memory is dedicated memory (GM), and the first cache is a level 1 cache (L1).

[0025] In one possible implementation, the data management module is further configured to delete the first KV vector from the first cache after the computing unit completes the fast attention operation on the first data.

[0026] In one possible implementation, the size of the first data is smaller than the capacity of the first memory.

[0027] In one possible implementation, the intermediate result obtained from performing a fast attention (FA) operation on the first data is stored in a second cache (L2), and the size of the intermediate result is smaller than the capacity of the second cache (L2).

[0028] In one possible implementation, the data management module is further configured to:

[0029] The second data is moved from the memory to the second memory. The second data includes the data required by the computing unit when performing fast attention gradient (FAG) calculations in the same batch. The first data includes the fourth Q vector, the first gradient, the fourth KV vector, and the fifth KV vector.

[0030] The calculation module is further configured to perform the fast attention gradient (FAG) operation on the first four-Q vector, the first gradient, and the fourth KV vector in the second memory to obtain a fifth calculation result;

[0031] Furthermore, the Fast Attention Gradient (FAG) operation is performed on the fourth Q vector, the first gradient, and the fifth KV vector in the first cache to obtain the sixth operation result.

[0032] In one possible implementation, the size of the second data is smaller than the capacity of the first memory.

[0033] Thirdly, embodiments of this application provide a data processing apparatus, which may include a memory, a processor, and a bus system, wherein the memory is used to store a program, and the processor is used to execute the program in the memory to perform the methods described in the first aspect above and any of its optional methods.

[0034] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program that, when run on a computer, causes the computer to perform the methods described in the first aspect and any of its optional methods.

[0035] Fifthly, embodiments of this application provide a computer program that, when run on a computer, causes the computer to perform the methods described in the first aspect and any of its optional methods.

[0036] Sixthly, this application provides a chip system including a processor for supporting an execution data processing device in implementing the functions involved in the foregoing aspects, such as transmitting or processing data involved in the foregoing methods; or, information. In one possible design, the chip system further includes a memory for storing program instructions and data necessary for the execution device or training device. This chip system may be composed of chips or may include chips and other discrete devices. Attached Figure Description

[0037] Figure 1A is a schematic diagram of a structural framework for artificial intelligence.

[0038] Figures 1B and 1C are schematic diagrams of the application system framework of the present invention;

[0039] Figure 1D is a schematic diagram of an optional hardware structure for the terminal;

[0040] Figure 2 is a schematic diagram of a server structure;

[0041] Figure 3 is a schematic diagram of a system architecture according to this application;

[0042] Figure 4 is a schematic diagram of a system architecture according to this application;

[0043] Figure 5 is a flowchart illustrating a data processing method provided in an embodiment of this application;

[0044] Figures 6A and 6B are schematic diagrams of the system architecture;

[0045] Figures 6C and 6D are schematic diagrams of the computational logic of the operator;

[0046] Figure 7 is a schematic diagram of a data processing method provided in an embodiment of this application;

[0047] Figure 8 is a schematic diagram of a data processing method provided in an embodiment of this application;

[0048] Figures 9A to 9D illustrate a data processing procedure provided in an embodiment of this application.

[0049] Figure 10 is a schematic diagram of a data processing device provided in an embodiment of this application;

[0050] Figure 11 is a schematic diagram of an execution device provided in an embodiment of this application;

[0051] Figure 12 is a schematic diagram of a training device provided in an embodiment of this application;

[0052] Figure 13 is a schematic diagram of a chip structure provided in an embodiment of this application. Detailed Implementation

[0053] The embodiments of the present invention will now be described with reference to the accompanying drawings. The terminology used in the embodiments section is for illustrative purposes only and is not intended to limit the scope of the invention.

[0054] The embodiments of this application will now be described with reference to the accompanying drawings. Those skilled in the art will recognize that, with technological advancements and the emergence of new scenarios, the technical solutions provided in the embodiments of this application are equally applicable to similar technical problems.

[0055] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such terms are interchangeable where appropriate; this is merely a way of distinguishing objects with the same attributes in the embodiments of this application. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, so that a process, method, system, product, or apparatus that comprises a series of elements is not necessarily limited to those elements, but may include other elements not explicitly listed or inherent to those processes, methods, products, or apparatuses.

[0056] The terms “substantially,” “about,” and similar terms used herein are used as approximations rather than as terms of degree, and are intended to take into account the inherent biases of measurements or calculations known to those skilled in the art. Furthermore, the use of “may” in describing embodiments of the invention refers to “one or more possible embodiments.” The terms “use,” “using,” and “used” used herein are to be considered synonymous with the terms “utilize,” “utilizing,” and “utilized,” respectively. Additionally, the term “exemplary” is intended to refer to an instance or illustration.

[0057] First, the overall workflow of an artificial intelligence system is described, as shown in Figure 1A. Figure 1A is a structural diagram of the main framework of artificial intelligence. The framework is then elaborated from two dimensions: the "intelligent information chain" (horizontal axis) and the "IT value chain" (vertical axis). The "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it could be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, and intelligent execution and output. In this process, data undergoes a condensation process of "data—information—knowledge—wisdom." The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (provided and processed by technology) to the industrial ecosystem of the system.

[0058] (1) Infrastructure

[0059] Infrastructure provides computing power to support artificial intelligence systems, enabling communication with the external world and providing support through a basic platform. This communication occurs through sensors; computing power is provided by intelligent chips (hardware acceleration chips such as CPUs, NPUs, GPUs, ASICs, and FPGAs); and the basic platform includes distributed computing frameworks and related platform guarantees and support, which may include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to acquire data, and this data is provided to intelligent chips in the distributed computing system provided by the basic platform for computation.

[0060] (2) Data

[0061] The data at the next layer of infrastructure is used to represent the data sources in the field of artificial intelligence. The data involves graphics, images, voice, text, and IoT data from traditional devices, including business data from existing systems and sensor data such as force, displacement, liquid level, temperature, and humidity.

[0062] (3) Data processing

[0063] Data processing typically includes methods such as data training, machine learning, deep learning, search, reasoning, and decision-making.

[0064] Among them, machine learning and deep learning can perform intelligent information modeling, extraction, preprocessing, and training on data, including symbolization and formalization.

[0065] Reasoning refers to the process in which, in a computer or intelligent system, the machine thinks and solves problems by simulating human intelligent reasoning, based on reasoning control strategies and using formalized information. Typical functions include search and matching.

[0066] Decision-making refers to the process of making decisions based on intelligent information after reasoning, and it typically provides functions such as classification, sorting, and prediction.

[0067] (4) General ability

[0068] After the data processing mentioned above, the results of the data processing can be used to form some general capabilities, such as algorithms or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

[0069] (5) Smart Products and Industry Applications

[0070] Intelligent products and industry applications refer to products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Their application areas mainly include: intelligent terminals, intelligent transportation, intelligent healthcare, autonomous driving, smart cities, etc.

[0071] This application can be applied to the field of natural language processing in the field of artificial intelligence. The following will introduce several application scenarios that have been implemented in products, taking natural language processing as an example.

[0072] First, we will introduce the application scenarios of this application. This application can be used, but is not limited to, applications with generative artificial intelligence (AIGC) functionality (hereinafter referred to as synthetic applications) or cloud services provided by cloud-side servers, etc., which will be introduced separately below:

[0073] I. Synthesis Applications

[0074] The product form of this application embodiment can be a synthetic application. Synthetic applications can run on terminal devices or cloud-based servers.

[0075] In one possible implementation, a synthesis application can perform data generation tasks based on input data (e.g., images, text, audio, video, etc.), wherein the synthesis application can perform data generation tasks in response to the input data (e.g., images, text, audio, video, etc.) to obtain generated data.

[0076] For example, the task of generating the above data can be, but is not limited to:

[0077] Text generation task: It can generate various types of text content, including news reports, blog posts, product descriptions, social media posts, etc. It can generate logical and coherent text based on given themes and requirements.

[0078] Image generation task: This task can generate images, including illustrations, artworks, design drafts, etc. It can generate image content related to given descriptions or keywords.

[0079] Audio generation task: This task can generate speech content, including text readings and responses from voice assistants. It can simulate human speech characteristics and intonation, making the generated speech sound more natural.

[0080] Content summarization and conclusion tasks: It can read large amounts of text content and generate summaries or conclusions. It can extract key information from the text and present it to the user in a concise manner.

[0081] Language translation task: This function performs language translation, converting text from one language to another. It can handle multiple language pairs and provide accurate translation results.

[0082] Automated replies and customer service: This can be used to automatically answer user questions and provide customer service. It can understand the user's intent and provide accurate answers or suggestions.

[0083] In one possible implementation, a user can open a synthesis application installed on a terminal device and input data (such as images, text, audio, video, etc.). The synthesis application can generate data from the input data using the method provided in the embodiments of this application and present the generated data to the user (the presentation method may include, but is not limited to, displaying, saving, uploading to the cloud, etc.).

[0084] In one possible implementation, a user can open a synthesis application installed on a terminal device and input data. The synthesis application can then send the input data to a cloud-based server. The cloud-based server uses the method provided in this application to generate data from the input data and sends the generated data back to the terminal device. The terminal device can then present the generated data to the user (the presentation method may include, but is not limited to, displaying, saving, or uploading to the cloud).

[0085] The following sections will describe the synthetic application in this application from the perspectives of functional architecture and product architecture that implements the functions.

[0086] Referring to Figure 1B, which is a schematic diagram of the functional architecture of the synthetic application in an embodiment of this application:

[0087] In one possible implementation, as shown in FIG1B, the synthetic application 102 may receive input parameters 101 (e.g., containing input data) and generate generated data 103. The synthetic application 102 may execute on at least one computer system (for example) and includes computer code that, when executed by one or more computers, causes the computers to execute a natural language model trained by the methods provided in the embodiments of this application.

[0088] Referring to Figure 1C, which is a schematic diagram of the entity architecture for running a synthetic application in an embodiment of this application:

[0089] Referring to Figure 1C, which illustrates a system architecture, the system may include a terminal 100 and a server 200. The server 200 may include one or more servers (Figure 1C illustrates this using one server as an example), and the server 200 may provide synthesis function services to one or more terminals.

[0090] The terminal 100 may have a synthesis application installed or a webpage related to the synthesis function open. The application and webpage can provide an interface. The terminal 100 can receive relevant parameters input by the user on the synthesis function interface and send the parameters to the server 200. The server 200 can obtain the processing result based on the received parameters and return the processing result to the terminal 100.

[0091] It should be understood that in some optional implementations, the terminal 100 can also complete the action of obtaining the processing result based on the received parameters on its own, without the need for the server to cooperate. This application embodiment is not limited to this.

[0092] The product form of terminal 100 in Figure 1C will be described next;

[0093] The terminal 100 in this application embodiment can be a mobile phone, tablet computer, wearable device, vehicle device, augmented reality (AR) / virtual reality (VR) device, laptop computer, ultra-mobile personal computer (UMPC), netbook, personal digital assistant (PDA), etc., and this application embodiment does not impose any restrictions on it.

[0094] Figure 1D shows a schematic diagram of an optional hardware structure for terminal 100.

[0095] Referring to Figure 1D, terminal 100 may include components such as a radio frequency unit 110, a memory 120, an input unit 130, a display unit 140, a camera 150 (optional), an audio circuit 160 (optional), a speaker 161 (optional), a microphone 162 (optional), a processor 170, an external interface 180, and a power supply 190. Those skilled in the art will understand that Figure 1D is merely an example of a terminal or multi-functional device and does not constitute a limitation on the terminal or multi-functional device; it may include more or fewer components than illustrated, or combine certain components, or use different components.

[0096] The input unit 130 can be used to receive input numerical or character information, and to generate key signal inputs related to user settings and function control of the portable multi-functional device. Specifically, the input unit 130 may include a touchscreen 131 (optional) and / or other input devices 132. The touchscreen 131 can collect touch operations performed by the user on or near it (such as operations performed by the user using fingers, knuckles, styluses, or any suitable object on or near the touchscreen), and drive the corresponding connection devices according to a pre-set program. The touchscreen can detect the user's touch actions, convert the touch actions into touch signals and send them to the processor 170, and can receive and execute commands sent by the processor 170; the touch signal includes at least touch point coordinate information. The touchscreen 131 can provide an input interface and an output interface between the terminal 100 and the user. In addition, various types of touchscreens, such as resistive, capacitive, infrared, and surface acoustic wave, can be used to implement the touchscreen. Besides the touchscreen 131, the input unit 130 may also include other input devices. Specifically, other input devices 132 may include, but are not limited to, one or more of the following: physical keyboard, function keys (such as volume control buttons 132, power buttons 133, etc.), trackball, mouse, joystick, etc.

[0097] Among them, the input device 132 can receive input data, etc.

[0098] The display unit 140 can be used to display information input by the user or information provided to the user, various menus of the terminal 100, interactive interfaces, file display, and / or playback of any multimedia file. In this embodiment, the display unit 140 can be used to display the interface of a synthesis application, generated data, etc.

[0099] The memory 120 can be used to store instructions and data. The memory 120 may primarily include an instruction storage area and a data storage area. The data storage area can store various types of data, such as multimedia files and text. The instruction storage area can store software units such as operating systems, applications, and instructions required for at least one function, or subsets or extended sets thereof. It may also include non-volatile random access memory. It provides the processor 170 with hardware, software, and data resources for managing the computing device, supporting control software and applications. It is also used for storing multimedia files, as well as storing running programs and applications.

[0100] The processor 170 is the control center of the terminal 100. It connects various parts of the terminal 100 via various interfaces and lines. By running or executing instructions stored in the memory 120 and calling data stored in the memory 120, it performs various functions and processes data of the terminal 100, thereby controlling the terminal device as a whole. Optionally, the processor 170 may include one or more processing units; preferably, the processor 170 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and applications, and the modem processor mainly handles wireless communication. It is understood that the modem processor may not be integrated into the processor 170. In some embodiments, the processor and memory can be implemented on a single chip; in some embodiments, they can also be implemented separately on independent chips. The processor 170 can also be used to generate corresponding operation control signals, send them to the corresponding components of the computing processing device, read and process data in the software, especially read and process data and programs in the memory 120, so that the various functional modules therein perform corresponding functions, thereby controlling the corresponding components to act according to the instructions.

[0101] The memory 120 can be used to store software code related to the data processing method, and the processor 170 can execute the steps of the chip's data processing method, and can also schedule other units (such as the above-mentioned input unit 130 and display unit 140) to achieve the corresponding functions.

[0102] The radio frequency unit 110 (optional) can be used for receiving and transmitting signals during information transmission or calls. For example, it can receive downlink information from the base station and process it for the processor 170; additionally, it can transmit uplink data to the base station. Typically, the RF circuit includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low-noise amplifier (LNA), a duplexer, etc. Furthermore, the radio frequency unit 110 can also communicate wirelessly with network devices and other devices. This wireless communication can use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Messaging Service (SMS), etc.

[0103] In this embodiment of the application, the radio frequency unit 110 can send input data to the server 200 and receive generated data sent by the server 200.

[0104] It should be understood that the radio frequency unit 110 is optional and can be replaced with other communication interfaces, such as a network port.

[0105] The terminal 100 also includes a power supply 190 (such as a battery) that supplies power to various components. Preferably, the power supply can be logically connected to the processor 170 through a power management system, thereby enabling functions such as charging, discharging, and power consumption management through the power management system.

[0106] Terminal 100 also includes an external interface 180, which can be a standard Micro USB interface or a multi-pin connector, which can be used to connect terminal 100 to other devices for communication or to connect a charger to charge terminal 100.

[0107] Although not shown, terminal 100 may also include a flash, a wireless fidelity (WiFi) module, a Bluetooth module, and sensors with various functions, which will not be described in detail here. Some or all of the methods described below can be applied to terminal 100 as shown in Figure 1D.

[0108] The product form of server 200 in Figure 1C is described below;

[0109] Figure 2 provides a schematic diagram of the structure of a server 200. As shown in Figure 2, the server 200 includes a bus 201, a processor 202, a communication interface 203, and a memory 204. The processor 202, the memory 204, and the communication interface 203 communicate with each other via the bus 201.

[0110] Bus 201 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of representation, only one thick line is used in Figure 2, but this does not indicate that there is only one bus or one type of bus.

[0111] The processor 202 can be any one or more of the following processors: central processing unit (CPU), graphics processing unit (GPU), microprocessor (MP), or digital signal processor (DSP).

[0112] Memory 204 may include volatile memory, such as random access memory (RAM). Memory 204 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid state drive (SSD).

[0113] The memory 204 can be used to store software code related to the data processing method, and the processor 202 can execute the steps of the chip's data processing method, and can also schedule other units to achieve corresponding functions.

[0114] It should be understood that the aforementioned terminal 100 and server 200 can be centralized or distributed devices. The processors (e.g., processor 170 and processor 202) in the aforementioned terminal 100 and server 200 can be hardware circuits (such as application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), general-purpose processors, digital signal processors (DSPs), microprocessors or microcontrollers, etc.) or combinations of these hardware circuits. For example, the processor can be a hardware system with instruction execution capabilities, such as a CPU or DSP, or a hardware system without instruction execution capabilities, such as an ASIC or FPGA, or a combination of the aforementioned hardware systems without instruction execution capabilities and hardware systems with instruction execution capabilities.

[0115] It should be understood that the steps related to the model inference process in the embodiments of this application involve AI-related operations. When performing AI operations, the instruction execution architecture of the terminal device and the server is not limited to the processor-memory architecture described above. The system architecture provided in the embodiments of this application will be described in detail below with reference to Figure 5.

[0116] Figure 5 is a schematic diagram of the system architecture provided in an embodiment of this application. As shown in Figure 5, the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550, and a data acquisition system 560.

[0117] The execution device 510 includes a calculation module 511, an I / O interface 512, a preprocessing module 513, and a preprocessing module 514. The calculation module 511 may include a target model / rule 501, while the preprocessing modules 513 and 514 are optional.

[0118] The execution device 510 can be a terminal device or a server that runs the aforementioned synthetic application.

[0119] The data acquisition device 560 is used to collect training samples. Training samples can be program files (including program code and program input data), etc. After collecting the training samples, the data acquisition device 560 stores these training samples in the database 530.

[0120] The training device 520 can maintain training samples in the database 530 to obtain the target model / rule 501 from the neural network to be trained.

[0121] It should be noted that in practical applications, the training samples maintained in database 530 may not all come from the data acquisition device 560; they may also be received from other devices. Furthermore, it should be noted that training device 520 may not necessarily train the target model / rule 501 entirely based on the training samples maintained in database 530; it may also obtain training samples from the cloud or other sources for model training. The above description should not be construed as limiting the embodiments of this application.

[0122] The target model / rule 501 trained by the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in Figure 5. The execution device 510 can be a terminal, such as a mobile terminal, tablet computer, laptop computer, augmented reality (AR) / virtual reality (VR) device, vehicle terminal, etc., or it can be a server, etc.

[0123] Specifically, the training device 520 can transfer the trained model to the execution device 510.

[0124] In Figure 5, the execution device 510 is configured with an input / output (I / O) interface 512 for data interaction with external devices. Users can input data to the I / O interface 512 through the client device 540 (e.g., input data in the embodiment of this application).

[0125] Preprocessing modules 513 and 514 are used to preprocess the input data received from the I / O interface 512. It should be understood that preprocessing modules 513 and 514 may be absent, or only one preprocessing module may be used. When preprocessing modules 513 and 514 are absent, the calculation module 511 can be used directly to process the input data.

[0126] During the preprocessing of input data by the execution device 510, or during the calculation module 511 of the execution device 510 performing calculations and other related processes, the execution device 510 can call data, code, etc. in the data storage system 550 for corresponding processing, or store the data, instructions, etc. obtained from the corresponding processing into the data storage system 550.

[0127] Finally, the I / O interface 512 provides the processing results (such as generated data) to the client device 540, thereby providing them to the user.

[0128] In the scenario shown in Figure 5, the user can manually provide input data, which can be done through the interface provided by I / O interface 512. Alternatively, the client device 540 can automatically send input data to I / O interface 512. If user authorization is required for the client device 540 to automatically send input data, the user can set the corresponding permissions in the client device 540. The user can view the output results of the execution device 510 on the client device 540, which can be presented in various forms such as display, sound, or animation. The client device 540 can also act as a data acquisition terminal, collecting the input data and output results of the input I / O interface 512 as shown in the figure, and storing them as new sample data in database 530. Alternatively, data can be collected directly from the I / O interface 512 without going through the client device 540, using the input data and output results of the input I / O interface 512 as shown in the figure, and storing them as new sample data in database 530.

[0129] It is worth noting that Figure 5 is merely a schematic diagram of a system architecture provided in an embodiment of this application. The positional relationships between the devices, components, modules, etc., shown in the figure do not constitute any limitation. For example, in Figure 5, the data storage system 550 is an external memory relative to the execution device 510. In other cases, the data storage system 550 can also be placed in the execution device 510. It should be understood that the aforementioned execution device 510 can be deployed in the client device 540.

[0130] From the inference side of the model:

[0131] In this embodiment, the computing module 511 of the execution device 510 can obtain the code stored in the data storage system 550 to implement the steps related to the model reasoning process in this embodiment.

[0132] In this embodiment of the application, the computing module 511 of the execution device 510 may include hardware circuits (such as application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), general-purpose processors, digital signal processors (DSPs), microprocessors or microcontrollers, etc.) or combinations of these hardware circuits. For example, the training device 520 may be a hardware system with instruction execution capabilities, such as a CPU or DSP, or a hardware system without instruction execution capabilities, such as an ASIC or FPGA, or a combination of the aforementioned hardware systems without instruction execution capabilities and hardware systems with instruction execution capabilities.

[0133] Specifically, the computing module 511 of the execution device 510 can be a hardware system with the function of executing instructions. The steps related to the model inference process provided in this application embodiment can be software code stored in the memory. The computing module 511 of the execution device 510 can obtain the software code from the memory and execute the obtained software code to implement the steps related to the model inference process provided in this application embodiment.

[0134] It should be understood that the computing module 511 of the execution device 510 can be a combination of a hardware system without the function of executing instructions and a hardware system with the function of executing instructions. Some steps related to the model reasoning process provided in the embodiments of this application can also be implemented by the hardware system in the computing module 511 of the execution device 510 without the function of executing instructions, which is not limited here.

[0135] From the training side of the model:

[0136] In this embodiment, the training device 520 can obtain the code stored in the memory (not shown in Figure 5, which can be integrated into the training device 520 or deployed separately from the training device 520) to implement the steps related to model training in this embodiment.

[0137] In this embodiment of the application, the training device 520 may include hardware circuits (such as application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), general-purpose processors, digital signal processors (DSPs), microprocessors or microcontrollers, etc.) or combinations of these hardware circuits. For example, the training device 520 may be a hardware system with instruction execution capabilities, such as a CPU or DSP, or a hardware system without instruction execution capabilities, such as an ASIC or FPGA, or a combination of the aforementioned hardware systems without instruction execution capabilities and hardware systems with instruction execution capabilities.

[0138] It should be understood that the training device 520 can be a combination of a hardware system without the function of executing instructions and a hardware system with the function of executing instructions. Some steps related to the training of the neutralization model provided in the embodiments of this application can also be implemented by the hardware system in the training device 520 without the function of executing instructions, which is not limited here.

[0139] II. Cloud services providing synthesis functionality provided by the server:

[0140] In one possible implementation, the server can provide composition services to the client side through an application programming interface (API).

[0141] In this process, the terminal device can send relevant parameters (such as input data) to the server through the API provided by the cloud. The server can obtain the processing result (such as generated data) based on the received parameters and return the processing result to the terminal.

[0142] The description of the terminal and server can be found in the above embodiments, and will not be repeated here.

[0143] Figure 6A illustrates the process of using a synthetic function cloud service provided by a cloud platform.

[0144] 1. Activate and purchase content moderation services.

[0145] 2. Users can download the software development kit (SDK) corresponding to the content moderation service. Cloud platforms usually provide multiple development versions of the SDK for users to choose from according to their development environment needs, such as JAVA version SDK, Python version SDK, PHP version SDK, Android version SDK, etc.

[0146] 3. After downloading the corresponding version of the SDK to their local machine according to their needs, users can import the SDK project into their local development environment, configure and debug it in the local development environment, and develop other functions in the local development environment, thus forming an application that integrates the capabilities of composite functional classes.

[0147] 4. When a composition application is used, it can trigger an API call for the composition function when the composition function is required. When the application triggers the composition function, it sends an API request to the running instance of the composition function service in the cloud environment. The API request carries the input data, which is processed by the running instance in the cloud environment to obtain the processing result (such as generated data).

[0148] 5. The cloud environment returns the processing result to the application, thus completing a synthesis function service call.

[0149] In addition to applications and cloud services, the implementation of this application can also be in a large model inference acceleration library or a large model application SDK.

[0150] To better understand the solutions of the embodiments of this application, the following uses text generation as an example and combines Figures 2 to 4 to briefly introduce the possible application scenarios of the embodiments of this application.

[0151] Figure 3 illustrates a natural language processing (NLP) system, which includes user devices and data processing devices. The user devices include smart terminals such as mobile phones, personal computers, or information processing centers. The user devices are the initiators of natural language data processing, acting as the initiators of requests such as language question answering or queries; typically, users initiate requests through their user devices.

[0152] The aforementioned data processing equipment can be cloud servers, network servers, application servers, management servers, or other devices or servers with data processing capabilities. The data processing equipment receives queries / voice / text from smart terminals via an interactive interface, then performs language data processing through a storage device and a data processing processor, employing methods such as machine learning, deep learning, search, reasoning, and decision-making. The processing results are then fed back to the user device. The storage device in the data processing equipment can be a general term, including local storage and a database storing historical data. The database can be located on the data processing equipment or on other network servers.

[0153] In the natural language processing system shown in Figure 3, the user device can receive instructions from the user. For example, the user device can receive a piece of text input by the user and then send a request to the data processing device, so that the data processing device can perform natural language processing applications (such as natural language generation, text classification, text reasoning, named entity recognition, translation, etc.) on the piece of text obtained by the user device, thereby obtaining the processing results of the corresponding natural language processing applications on the piece of text (such as prediction results, classification results, reasoning results, named entity recognition results, translation results, etc.).

[0154] In this embodiment of the application, the user equipment can receive instructions from the user. For example, the user equipment can receive a piece of text input by the user (e.g., input data) and then send a request to the data processing device, so that the data processing device performs a natural language processing application (e.g., text synthesis) on the piece of text obtained by the user equipment, thereby obtaining the processing result (e.g., generated data) of the corresponding natural language processing application on the piece of text.

[0155] The text is shown in Figure 3. The data processing device can process the above-mentioned text data using the method provided in the embodiments of this application.

[0156] Figure 4 illustrates another natural language processing system. In Figure 4, the user equipment directly acts as a data processing device. This user equipment can directly receive input from the user and process it directly by the hardware of the user equipment itself. The specific process is similar to that in Figure 3, and can be referred to the description above, so it will not be repeated here.

[0157] Figure 4 is a schematic diagram of the natural language processing related devices provided in the embodiments of this application.

[0158] The processors in Figures 3 and 4 can perform data training / machine learning / deep learning using neural network models or other models, and use the models finally trained or learned from the data (such as the natural language models in the embodiments of this application) to perform natural language processing applications (such as program synthesis, etc.) on text data (such as the input data text described in the embodiments of this application) to obtain the corresponding processing results.

[0159] Since the embodiments of this application involve a large number of neural network applications, for ease of understanding, the relevant terms and concepts such as neural networks involved in the embodiments of this application will be introduced below.

[0160] (1) Neural Network

[0161] A neural network can be composed of neural units, which can be defined as a computational unit that takes xs (i.e., input data) and an intercept of 1 as input. The output of this computational unit can be:

[0162] Where s = 1, 2, ..., n, where n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer, and the activation function can be the sigmoid function. A neural network is a network formed by connecting multiple of the above-mentioned individual neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field, which can be a region composed of several neural units.

[0163] (2) Transformer layer

[0164] The neural network includes an embedding layer and at least one transformer layer. The at least one transformer layer can be N transformer layers (N being an integer greater than 0). Each transformer layer includes sequentially adjacent attention layers, add and normalize layers, feed-forward layers, and add and normalize layers. In the embedding layer, the current input is embedded to obtain multiple embedding vectors. In the attention layer, P input vectors are obtained from the layer above the first transformer layer. Using any first input vector among the P input vectors as the center, intermediate vectors corresponding to the first input vector are obtained based on the correlation between each input vector within a preset attention window and the first input vector. This process determines P intermediate vectors corresponding to the P input vectors. In the pooling layer, the P intermediate vectors are merged into Q output vectors, where the multiple output vectors obtained from the last transformer layer are used as feature representations of the current input.

[0165] (3) Attention mechanism

[0166] Attention mechanisms mimic the internal processes of biological observation—aligning internal experience with external senses to increase the precision of observation in specific areas. They enable the rapid sifting of high-value information from a large volume of data using limited attentional resources. Attention mechanisms can quickly extract important features from sparse data and are therefore widely used in natural language processing tasks, particularly machine translation. Self-attention mechanisms, an improvement on attention mechanisms, reduce reliance on external information and are better at capturing the internal correlations of data or features. The core idea of attention mechanisms can be rewritten as follows:

[0167] In this formula, Lx = ||Source|| represents the length of the Source. The meaning is that the elements in the Source are imagined as a series of data pairs. Given a Query element in the Target, the similarity or relevance between the Query and each Key is calculated to obtain the weight coefficient of the Value corresponding to each Key. Then, the Values are weighted and summed to obtain the final Attention value. Therefore, the Attention mechanism essentially performs a weighted sum of the Values of the elements in the Source, while the Query and Key are used to calculate the weight coefficients of their corresponding Values. Conceptually, Attention can be understood as selectively filtering a small amount of important information from a large amount of information and focusing on this important information, ignoring most of the unimportant information. The focusing process is reflected in the calculation of the weight coefficients; the larger the weight, the more focused it is on its corresponding Value. That is, the weight represents the importance of the information, and the Value is the corresponding information. Self-attention can be understood as intra attention. The attention mechanism occurs between the elements of the Target (Query) and all elements of the Source. Self-attention refers to the attention mechanism that occurs between elements within the Source or between elements within the Target. It can also be understood as the attention calculation mechanism in the special case where Target = Source. The specific calculation process is the same, only the calculation object changes.

[0168] (4) Natural Language Processing (NLP)

[0169] Natural language is human language, and Natural Language Processing (NLP) is the processing of human language. NLP is a systematic process of analyzing, understanding, and extracting information from text data in an intelligent and efficient manner. By using NLP and its components, we can manage very large amounts of text data, perform numerous automated tasks, and solve a wide variety of problems, such as automatic summarization, machine translation (MT), named entity recognition (NER), relation extraction (RE), information extraction (IE), sentiment analysis, speech recognition, question answering systems, and topic segmentation, among others.

[0170] (5) Pre-trained language model

[0171] A pre-trained language model is a natural language sequence encoder that encodes each word in a natural language sequence into a vector representation for prediction tasks. Its training consists of two phases. In the pre-training phase, the model is trained on a large-scale unsupervised text environment to learn word representations. In the fine-tuning phase, the model is initialized using the parameters learned in the pre-training phase and then trained on downstream tasks such as text classification and sequence labeling with fewer steps, successfully transferring the semantic information obtained in pre-training to these downstream tasks.

[0172] (6) Reasoning / Deployment: The forward computation process of a neural network.

[0173] (7) Large Language Model (LLM): A large language model is a natural language processing model trained on large-scale data, typically with billions or tens of billions of parameters. These models learn to capture the general features of language by learning from a large amount of text data during the pre-training stage, and can then be fine-tuned on downstream tasks to adapt to the needs of specific tasks.

[0174] (8) Matrix processing unit (cube unit): The unit in the NPU used to calculate matrix multiplication. It can only calculate matrix multiplication of data of the same type, such as integer matrix multiplication by integer matrix, or floating-point matrix multiplication by floating-point matrix. In general NPU chips, the cube unit has strong computing power.

[0175] (9) Vector processing unit: The unit used by the NPU to perform vector operations. It can perform a variety of vector operations, but its computing power is weaker than that of the Cube unit.

[0176] (10) Global memory (GM): Memory used to store data. It has a large space. The data involved in the calculation of Cube and Vector units need to be read from GM, but the bandwidth (reading speed) of reading data from GM is small.

[0177] (11) L2: L2 cache (secondary cache) is used to store data. It has a smaller space than GM and a higher read and write speed than GM. At the beginning of the calculation process, the data is located in GM and needs to enter L2 first and then enter Cube / UB; while the data coming out of Cube / UB can directly enter L2.

[0178] (12) L1: Level 1 cache, i.e., the cache in the Cube unit. The space is relatively smaller than L2, and the read and write speed is relatively higher than L2.

[0179] (13) L0: Level 0 cache, which is a cache in the Cube unit specifically used to store computational data. It has the smallest space and the highest read and write speed.

[0180] (14)UB: The cache in the Vector unit. The computation on the Vector side obtains data from UB.

[0181] (15)MTE3: The data path from L1 / UB cache to GM / L2, denoted as MTE3_cube / MTE3_vec respectively.

[0182] (16)MTE2: Data transfer path from L2 cache to L1 / UB cache, denoted as MTE2_cube / MTE2_vec respectively.

[0183] (17)MTE1: Data transfer path from L1 cache to L0 cache.

[0184] (18) Fixp: Data transfer path from L0 cache to GM / L2.

[0185] Large language models have become a hot research area in recent years and are an important component of general artificial intelligence, with wide applications in chatbots, search systems, and assisted programming. Based on the Transformer architecture, the space complexity of the attention mechanism, a crucial component of large language models, is quadratically proportional to the length of the training time sequence. Data movement of intermediate variables significantly impacts computational time, leading to excessively long overall training times for large language models. TriDao's proposed Flash Attention (forward) / Flash Attention Gradient (backward) reduces the peak memory usage during forward / backward calculations by decoupling Softmax computation, minimizing data movement and thus significantly improving the training / inference speed of large language models.

[0186] Attention Mechanism is a widely used mechanism in deep learning, demonstrating excellent performance in handling serialization tasks. It achieves efficient feature extraction and enhances model performance by calculating different attention allocations at each stage of the serialization task. Specifically, Attention Mechanism involves three input-related matrices: Q(Query), K(Key), and V(Value), all with dimensions (N, d), where N represents the sequence length and d represents the feature dimension. The forward calculation formula for Attention Mechanism is as follows: 2.P=Softmax(S); 3.O=PV;

[0187] The Softmax operation is performed on each row vector of the matrix to normalize the vectors into weights.

[0188] x = [x1, x2, ..., x d ], m = max i x i ;

[0189] The corresponding reverse calculation process is as follows: 1. dP=dO×V T ; 2.dS=P⊙(dP-rowsum(P⊙dP)); 3.dQ=dS×dK, dK=(dS) T ×Q, dV=P T ×dO;

[0190] The second step is to calculate the gradient of the Softmax operation. ⊙ represents the Hadamard product of the matrix (pointwise multiplication), and rowsum(·) represents calculating the row sum of each row of the matrix.

[0191] The ability to extract information from long texts is a crucial performance indicator for large models. Therefore, with the development of large models, the length of context is increasing daily. Both the sequence length of the dataset used in the training phase and the length of the prompt words in the full inference phase have reached hundreds of thousands or even millions of times. Compared to the size of the input / output matrices Q, K, V, O (O(Nd)), the space complexity of the intermediate matrices S, P has a quadratic relationship with the sequence length (O(Nd)). 2 Therefore, some calculations can lead to extremely high memory peaks during algorithm execution, and the overall efficiency of the algorithm will be greatly reduced due to data movement.

[0192] The Softmax operation in the Attention Mechanism requires normalization of the row vectors of the Attention matrix S. This strong coupling necessitates the explicit generation of an entire row of S. Flash Attention technology decouples this operation, breaking down the overall Attention Mechanism computation into several sub-blocks. It utilizes the stored row maximum value and exponential row sum to achieve online incremental updates, avoiding the explicit generation of intermediate matrices and significantly reducing memory usage spikes and data movement. Specifically, consider two vectors x and y as an example:

[0193] Where m(x) = max i x i m(y) = max i y i m = max(m(x), m(y));

[0194] Therefore, by tracking the maximum row value m and the exponential row sum l during the algorithm process, the overall calculation result can be updated online incrementally using the Attention Mechanism results of each local area. Taking the forward calculation of two adjacent blocks as an example, see Figure 6B:

[0195] As shown in Figure 6B, the Flash Attention in the forward phase achieves online incremental updates by tracking variables m and l in the computation of each sub-matrix block. Therefore, it is not necessary to generate the overall intermediate matrices S and P; the local intermediate matrices can be used directly, ensuring that the peak memory usage of the algorithm is O(Nd). For the reverse phase, the Flash Attention Gradient process requires calculating the Softmax gradient: dS = P⊙(dP - rowsum(P⊙dP));

[0196] We need to sum the entire row of P⊙dP, and the space complexity of both P and dP is O(N). 2 To reduce overall memory peaks, Flash Attention Gradient uses the following equation: rowsum(P⊙dP)=rowsum(O⊙dO);

[0197] Therefore, the overall Flash Attention Gradient process can also be divided into blocks, with intermediate variables obtained through recalculation for the input Q. i O i ,dO i ,K j V j : 3.P ij =simpleSoftmax(S ij ); 4.dS ij =P ij ⊙(dP ij -rowsum(O i ⊙dO i )); 5.dQ i ←dQ i +dS ij ×K j ; 7.dV j ←dV j +(P ij ) T ×dO i ;

[0198] The third step, simpleSoftmax(·), represents a fast calculation using the maximum row value m obtained from the forward pass and the sum of the exponent rows l:

[0199] In fact, in step four, rowsum(O i ⊙dO iThe input to each sub-block of FAG is simply an N-dimensional vector, and this calculation can be performed directly before the algorithm begins. Therefore, essentially, the input to each sub-block of FAG can be disregarded for O(n). i .

[0200] The existing implementation schemes for the FA and FAG operators on GPUs and NPUs are consistent, using row-by-row and column-by-column computation methods, respectively. In the forward pass, the computation logic of the FA operator is shown in Figure 6C (the rectangular blocks in this and subsequent figures represent matrix blocks for the Attention Matrix S, where each sub-block (which can be called a basic block) corresponds to a local Attention Mechanism computation. Different sub-blocks within the same row need to track the row maximum value and the exponential row sum for online incremental updates. The ultimate goal is to complete the Attention Mechanism computation for all sub-blocks in the figure and implement online incremental updates during the process.)

[0201] As shown in Figure 6C, each FA operator calculates an entire row block. That is, the same Q is sequentially used to perform Attention Mechanism calculations with all K and V and is incrementally updated online. During this process, only the update of the output O row block is involved.

[0202] In the reverse phase, the computational logic of the FAG operator can be shown in Figure 6D:

[0203] The three outputs of the FAG operator, dQ, dK, and dV, are all calculated and updated cumulatively in blocks. Furthermore, the global results m and l of the forward FA operator can be used for simpleSoftmax calculation, thus avoiding inter-block variable updates. Considering output transport, since each column updates the same row block of dK and dV, the current scheme calculates column-by-column. That is, the same pair of K and V is calculated using the reverse Attention Mechanism with all Q and dO, and the corresponding output is updated. This process only involves updating the row block of dK and dV, and updating all row blocks of dQ.

[0204] Existing technologies are proposed based on reducing data movement, without taking into account the architectural characteristics of the NPU for specific adaptation, and without making full use of L1 Cache and L2 Cache.

[0205] For a positive FA operator, the computation can be divided into four parts: MM1(S=Q×K) T ); VEC1(S→P); MM2(O=P×V); VEC2(online incremental update of O). For the inverse FAG operator, the computation can be divided into three parts: MM1(dP=dO×V) T ),MM2(S=Q×K T);VEC1(S→P,dP→dS);MM3(dQ=dS×K),MM4(dK=(dS) T ×Q),MM5(dV=P T Since both FA and FAG operators perform matrix multiplication and non-matrix multiplication alternately, data needs to be repeatedly transferred between the Cube cell side and the Vector cell side.

[0206] In other words, existing implementations of the FA and FAG operators do not fully enable L1 Reuse. In each stage of computation, a row block Q of Q is first... i and several K row blocks K j ,K j+1 ,…,K j+b-1 Move the data from L2 Cache to L1 Cache (MTE2_cube), then sequentially move (Q...) i ,K j ),(Q i ,K j+1 ),…,(Q i ,K j+b-1 ) Move it into L0(MTE1) to calculate MM1, and get To fully utilize computing power, the result of each MM1 block is transferred from L0 to L2 Cache (Fixp) after computation, and then moved from L2 Cache to UB (MTE2_vec) for VEC1 computation to obtain P. i,j ,P i,j+1 ,…,P i,j+b-1 Then it is passed back to the L2 Cache (MTE3_vec) and compared with the corresponding V line block V. j V j+1 ,…,V j+b-1 Together, they are moved from L2 Cache to L1 Cache for MM2 calculation, resulting in O. i,j O i,j+1 ,...,O i,j+b-1 Finally, an online incremental update is performed in the vector unit (this step is generally less time-consuming). It can be observed that in the first technique, only Q is reused in the L1 cache (i.e., L1 reuse). This is because each calculation of the attention matrix corresponds to the calculation of one row block. During this process, the Q row block is not moved out of the L1 cache. Therefore, each row block of Q can complete all its corresponding calculations with one move. However, each row block of K and V can only complete its corresponding calculation once with one move, without any L1 reuse.

[0207] The reverse FAG operator also has this problem. In the calculation of each stage, a row block K of K and V is first processed. j V j and several row blocks Q,dO i ,dO i ,...,Q i+b-1 ,dO i+b-1 Move data from L2 Cache to L1 Cache (MTE2_cube), and sequentially move (dO) data. i V j Q i ,K j ),(dO i+1 V j Q i+1 ,K j ),…,(dO i+b-1 V j Q i+b-1 ,K j ) Move the data into L0(MTE1) to calculate MM1 and MM2, and obtain dP. i,j ,S i,j ,dP i+1,j ,S i+1,j ,…,dP i+b-1,j ,S i+b-1,j ,in Similar to the forward FA operator, intermediate variables need to be transmitted to the UB in real time for vector cell-side computation to obtain P. i,j ,dS i,j ,P i,j+1 ,dS i,j+1 ,...,P i,j+b-1 ,,dS i,j+b-1 Then, the data is transmitted back to the L1 Cache in real time for cube unit test calculations, updating the corresponding dQ, dK, and dV row blocks. Since the operator process is performed column-by-column, K and V actually enable L1 reuse, because a single transfer from the L2 Cache to the L1 Cache completes all the corresponding operations, while Q and dO do not enable any L1 reuse.

[0208] Considering the positive FA operator, the MM1 and MM2 operations are performed in the cube unit, while the VEC1 and VEC2 parts need to be executed in the vector unit. This means that the intermediate variables must enter the L2 cache after MM1 and then be passed to the UB for vector unit calculation. The output of the vector unit needs to be passed back to the L2 cache and then to the L1 for cube core calculation. Considering that the L2 cache on the NPU has a large capacity, calculating only one line at a time does not actually make full use of the L2 cache, because the intermediate variables only involve the result of one line block.

[0209] To address the aforementioned problems, embodiments of this application provide a data processing method. The data processing method of this application embodiment will be described in detail below with reference to the accompanying drawings.

[0210] Referring to Figure 7, which is a flowchart of a data processing method provided in an embodiment of this application, as shown in Figure 7, the data processing method provided in an embodiment of this application may include steps 701 to 703, which will be described in detail below.

[0211] 701. Move first data from memory to second memory, the first data including data required by the computing unit during the same batch of fast attention FA operations, the first data including a first Q vector, a second Q vector and a first KV vector;

[0212] In one possible implementation, the memory is dedicated memory (GM), and the first cache is a level 1 cache (L1).

[0213] In this context, "the same batch" can be understood as calculations performed in the same phase.

[0214] Referring to the right side of Figure 8, the first Q vector and the second Q vector can correspond to the Q vectors of the same basic block in the same column on different rows. For example, the first Q vector and the second Q vector can correspond to two data blocks labeled 1 and 2 in the same column.

[0215] The key to fully enabling the Level 1 cache (L1) is that a single input data transfer can handle as many computational tasks as possible corresponding to the input data. Therefore, in the forward FA operator, the computation of a stage can be expanded from within one row block to within several row blocks. Each input matrix transfer stage transfers several Q, K, V row blocks at once, instead of one Q row block and several K, V row blocks, and processes their corresponding computational tasks, without needing to transfer the input matrix again during the process. The specific process is shown in Figure 8 (the left figure shows the solution of the prior art, and the right figure shows the solution of the embodiment of this application):

[0216] As can be observed from the right side of Figure 8, the computational logic of this embodiment has been transformed from a double loop (outer loop row, inner loop column) to a quadruple loop (first-level loop row block, second-level loop column block, third-level loop row block column block containing rows, and fourth-level loop row block column block containing columns). At this point, a single transport of a Q-row block still corresponds to all its computations, while a single transport of a K,V row block corresponds to multiple computations instead of a single one. This improves the L1 reuse of K,V and thus reduces transport time.

[0217] This allows for full utilization of the L1 cache, improving the reuse rate of the input matrix and significantly reducing overall data movement.

[0218] In this embodiment of the application, when performing fast attention operations, the data retrieved from memory in each batch includes different Q vectors corresponding to the same KV vector. This allows the Q vectors to be reused during fast attention operations (the same Q vector can be used for different KV vectors), thereby reducing the amount of data transfer.

[0219] 702. The fast attention (FA) operation is performed on the first Q vector and the first KV vector in the second memory through the computing unit to obtain the first operation result;

[0220] In one possible implementation, the first data further includes a second KV vector. The fast attention (FA) operation can be performed on the first Q vector, the second KV vector, and the first operation result in the first cache by the computing unit to obtain a third operation result. The third operation result includes an update result of the first operation result.

[0221] 703. The fast attention (FA) operation is performed on the second Q vector and the first KV vector in the first cache by the computing unit to obtain the second operation result.

[0222] In one possible implementation, the first data further includes a third KV vector, and the fast attention FA operation can be performed on the second Q vector, the third KV vector, and the second operation result in the first cache by the computing unit to obtain a fourth operation result; wherein the fourth operation result includes an update result of the second operation result.

[0223] In one possible implementation, after the computing unit completes the fast attention operation on the first data, the first KV vector is deleted from the first cache.

[0224] In one possible implementation, the size of the first data is smaller than the capacity of the first memory.

[0225] In one possible implementation, the intermediate result obtained from performing a fast attention (FA) operation on the first data is stored in a second cache (L2), and the size of the intermediate result is smaller than the capacity of the second cache (L2).

[0226] For example, the overall calculation logic of FA can be shown in Figure 9A:

[0227] Assume Q is partitioned into N Q Each row block, K,V is divided into N. KV Each line block, represented by pseudocode, illustrates the computational logic:

[0228] Existing technologies require the transfer of N Q A Q basic block, N Q N KV Each K,V basic block; while the method in this application embodiment only needs to transport N basic blocks. Q A basic Q block, The K and V basic blocks improve the L1 reuse of K and V and reduce the overall MTE2 handling.

[0229] The amount of data moved in each batch (i.e., the size of the first data) involves two parameters: the number of Q rows / blocks *a* and the number of K and V rows / blocks *b*. Considering the characteristics of the NPU architecture, the values of *a* and *b* can be determined by modeling using the sizes of the L1 Cache and L2 Cache. Let the basic Q row / block size be (baseM, baseK), the basic K and V row / block sizes be (baseN, baseK), and the sizes of the L1 Cache and L2 Cache be M1 and M2 (Bytes) respectively. Then the constraint can be expressed as:

[0230] L1 Cache constraints: 2*a*baseM*baseK+4*b*baseN*baseK <M1-T1

[0231] L2 Cache constraints: 2*a*baseM*baseN <M2-T2

[0232] Where T1 and T2 represent the additional space occupied by L1 Cache and L2 Cache during the process. The space occupied by L1 Cache is actually the space required for the Q, K, and V line blocks to be moved in one operation. The space occupied by L2 Cache is due to the need to retain the intermediate state of different O line blocks during the process for subsequent online incremental updates to reach the final result. The overall implementation flow of the algorithm can be shown in Figure 9B.

[0233] In one possible implementation, during the fast attention gradient computation, second data can be moved from the first memory to the second memory. The second data includes data required by the computation unit during the same batch of fast attention gradient (FAG) computations. The first data includes a fourth Q vector, a first gradient, a fourth KV vector, and a fifth KV vector. The computation module is used to perform the fast attention gradient (FAG) computation on the first four Q vectors, the first gradient, and the fourth KV vector in the second memory to obtain a fifth computation result. Furthermore, the computation unit performs the fast attention gradient (FAG) computation on the fourth Q vector, the first gradient, and the fifth KV vector in the first cache to obtain a sixth computation result.

[0234] In one possible implementation, the size of the second data is smaller than the capacity of the first memory.

[0235] For the inverse FAG operator, a similar approach can be used to improve the data L1 reuse of Q,dO, as shown in Figure 9C. The left figure shows the prior art, and the right figure shows the solution of the embodiment of this application.

[0236] Similar to the forward FA operator logic, a single transport of a K,V row block still corresponds to all its computations, while a single transport of a Q,dO row block corresponds to multiple computations instead of a single one. This improves the L1 reuse of Q,dO and reduces transport time.

[0237] To fully utilize the chip's L1 and L2 caches, a mathematical model was developed for data movement, and the granularity of each movement stage was determined using the chip's parameters. For the forward FA operator, the space required for a single movement to the L1 cache and the L2 cache space required for the O operations needed for online incremental updates during computation were considered, and constraint inequalities were constructed to determine the specific implementation granularity. For the backward FAG operator, L2 cache constraints were not required; only the L1 cache requirement for a single movement needed to be considered.

[0238] For example, the overall computational logic diagram of FAG is shown in Figure 9D:

[0239] Assume Q is partitioned into N Q Each row block, K,V is divided into N. KV Each line block, represented by pseudocode, illustrates the computational logic:

[0240] Existing technologies require the transfer of N Q N KV Each Q,dO basic block, N KV Each K,V basic block; while the method in this application embodiment only needs to transport... Each Q,dO basic block, N KV The method of this application involves two parameters: the number of Q rows and blocks moved at one time, *a*, and the number of K and V rows and blocks, *b*. Considering the characteristics of the NPU architecture, the values of *a* and *b* can be determined by modeling using the L1 cache size. Let the size of the Q basic row and block be (baseM, baseK), the size of the K and V basic row and block be (baseN, baseK), and the size of the L1 cache be M1 (Bytes). Then the constraint can be expressed as:

[0241] L1 Cache constraints: 4*a*baseM*baseK+4*b*baseN*baseK <M1-T1

[0242] T1 represents the additional space occupied by the L1 Cache during the process. The space occupied by the L1 Cache is actually the space required for the Q, K, V line blocks to be moved once. In the reverse phase, no online incremental update is required, so there is no need to retain intermediate variables that have not been calculated. Therefore, this scheme has no strong constraints on the size of the L2 Cache.

[0243] Referring to Figure 10, which is a schematic diagram of the structure of a data processing apparatus provided in an embodiment of this application, as shown in Figure 10, the data processing apparatus 1000 provided in this embodiment includes:

[0244] Data management module 1001 is used to move first data from memory to second memory. The first data includes data required by the computing unit when performing fast attention FA operations in the same batch. The first data includes a first Q vector, a second Q vector, and a first KV vector.

[0245] The computing module 1002 of the computing unit is used to perform the fast attention FA operation on the first Q vector and the first KV vector in the second memory to obtain a first operation result; and to perform the fast attention FA operation on the second Q vector and the first KV vector in the first cache to obtain a second operation result.

[0246] In one possible implementation, the first data further includes a second KV vector, and the calculation module is further configured to perform the fast attention FA operation on the first Q vector, the second KV vector and the first operation result in the first cache to obtain a third operation result; wherein the third operation result includes an update result of the first operation result.

[0247] In one possible implementation, the first data further includes a third KV vector, and the calculation module is further configured to perform the fast attention FA operation on the second Q vector, the third KV vector, and the second operation result in the first cache to obtain a fourth operation result; wherein the fourth operation result includes an update result of the second operation result.

[0248] In one possible implementation, the memory is dedicated memory (GM), and the first cache is a level 1 cache (L1).

[0249] In one possible implementation, the data management module 1001 is further configured to delete the first KV vector from the first cache after the computing unit completes the fast attention operation on the first data.

[0250] In one possible implementation, the size of the first data is smaller than the capacity of the first memory.

[0251] In one possible implementation, the intermediate result obtained from performing a fast attention (FA) operation on the first data is stored in a second cache (L2), and the size of the intermediate result is smaller than the capacity of the second cache (L2).

[0252] In one possible implementation, the data management module 1001 is further configured to:

[0253] The second data is moved from the memory to the second memory. The second data includes the data required by the computing unit when performing fast attention gradient (FAG) calculations in the same batch. The first data includes the fourth Q vector, the first gradient, the fourth KV vector, and the fifth KV vector.

[0254] The calculation module 1002 is further configured to perform the fast attention gradient (FAG) operation on the first four-Q vector, the first gradient, and the fourth KV vector in the second memory to obtain a fifth operation result;

[0255] Furthermore, the Fast Attention Gradient (FAG) operation is performed on the fourth Q vector, the first gradient, and the fifth KV vector in the first cache to obtain the sixth operation result.

[0256] In one possible implementation, the size of the second data is smaller than the capacity of the first memory.

[0257] The following describes a terminal device provided in an embodiment of this application. Please refer to Figure 11, which is a structural schematic diagram of a terminal device provided in an embodiment of this application. The terminal device 1100 can specifically be a virtual reality (VR) device, a mobile phone, a tablet, a laptop computer, a smart wearable device, etc., and is not limited here. Specifically, the terminal device 1100 includes: a receiver 1101, a transmitter 1102, a processor 1103, and a memory 1104 (the number of processors 1103 in the terminal device 1100 can be one or more; Figure 11 shows one processor as an example). The processor 1103 may include an application processor 11031 and a communication processor 11032. In some embodiments of this application, the receiver 1101, transmitter 1102, processor 1103, and memory 1104 can be connected via a bus or other means.

[0258] Memory 1104 may include read-only memory and random access memory, and provides instructions and data to processor 1103. A portion of memory 1104 may also include non-volatile random access memory (NVRAM). Memory 1104 stores processor and operation instructions, executable modules, or data structures, or subsets thereof, or extended sets thereof, wherein the operation instructions may include various operation instructions for implementing various operations.

[0259] Processor 1103 controls the operation of the execution device. In specific applications, the various components of the execution device are coupled together through a bus system, which may include not only the data bus, but also power buses, control buses, and status signal buses. However, for clarity, all buses are referred to as the bus system in the diagram.

[0260] The methods disclosed in the embodiments of this application can be applied to or implemented by the processor 1103. The processor 1103 can be an integrated circuit chip with signal processing capabilities. During implementation, each step of the above method can be completed by the integrated logic circuitry in the hardware of the processor 1103 or by instructions in software form. The processor 1103 can be a general-purpose processor, a digital signal processor (DSP), a microprocessor, or a microcontroller, and may further include an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. The processor 1103 can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly embodied in the execution of a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can reside in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory 1104. Processor 1103 reads the information in memory 1104 and, in conjunction with its hardware, completes the steps involved in the model training or model inference process described above.

[0261] Receiver 1101 can be used to receive input digital or character information, and to generate signal inputs related to the settings and function control of the execution device. Transmitter 1102 can be used to output digital or character information through the first interface; transmitter 1102 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; transmitter 1102 may also include a display device such as a display screen.

[0262] This application embodiment also provides a server. Referring to Figure 12, Figure 12 is a schematic diagram of a server structure provided in this application embodiment. The server 1200 can vary significantly due to different configurations or performance. It may include one or more central processing units (CPUs) 1212 (e.g., one or more processors) and a memory 1232, and one or more storage media 1230 (e.g., one or more mass storage devices) for storing application programs 1242 or data 1244. The memory 1232 and storage media 1230 can be temporary or persistent storage. The program stored in the storage media 1230 may include one or more modules (not shown in the figure), each module may include a series of instruction operations on the server. Furthermore, the CPU 1212 may be configured to communicate with the storage media 1230 and execute the series of instruction operations in the storage media 1230 on the server 1200.

[0263] Server 1200 may also include one or more power supplies 1226, one or more wired or wireless network interfaces 1250, one or more input / output interfaces 1258; or, one or more operating systems 1241, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™, etc.

[0264] In this embodiment, the central processing unit 1212 is used to perform actions related to model training or model inference in the above embodiments.

[0265] This application also provides a computer program product that, when run on a computer, causes the computer to perform steps as performed by the aforementioned execution device, or causes the computer to perform steps as performed by the aforementioned training device.

[0266] This application also provides a computer-readable storage medium storing a program for signal processing, which, when run on a computer, causes the computer to perform steps as performed by the aforementioned execution device, or causes the computer to perform steps as performed by the aforementioned training device.

[0267] The execution device, training device, or terminal device provided in this application embodiment can specifically be a chip. The chip includes a processing unit and a communication unit. The processing unit can be, for example, a processor, and the communication unit can be, for example, an input / output interface, pins, or circuits. The processing unit can execute computer execution instructions stored in the storage unit to cause the chip within the execution device to execute the data processing method described in the above embodiments, or to cause the chip within the training device to execute the data processing method described in the above embodiments. Optionally, the storage unit can be a storage unit within the chip, such as a register or cache. Alternatively, the storage unit can be a storage unit located outside the chip within the wireless access device, such as a read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, such as random access memory (RAM).

[0268] Specifically, please refer to Figure 13, which is a schematic diagram of a chip structure provided in an embodiment of this application. The chip can be represented as a neural network processor (NPU) 1300. The NPU 1300 is mounted as a coprocessor on the host CPU, and tasks are assigned by the host CPU. The core part of the NPU is the arithmetic circuit 1303, which is controlled by the controller 1304 to extract matrix data from the memory and perform multiplication operations.

[0269] In some implementations, the arithmetic circuit 1303 internally includes multiple processing engines (PEs). In some implementations, the arithmetic circuit 1303 is a two-dimensional pulsating array. The arithmetic circuit 1303 can also be a one-dimensional pulsating array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1303 is a general-purpose matrix processor.

[0270] For example, suppose we have an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit retrieves the corresponding data of matrix B from the weight memory 1302 and caches it in each PE of the arithmetic circuit. The arithmetic circuit retrieves the data of matrix A from the input memory 1301 and performs matrix operations with matrix B. The partial result or the final result of the obtained matrix is stored in the accumulator 1308.

[0271] Unified memory 1306 is used to store input and output data. Weight data is directly transferred to weight memory 1302 via Direct Memory Access Controller (DMAC) 1305. Input data is also transferred to unified memory 1306 via DMAC.

[0272] BIU stands for Bus Interface Unit, which is used for interaction between the AXI bus and the DMAC and the Instruction Fetch Buffer (IFB) 1309.

[0273] The Bus Interface Unit (BIU) 1310 is used by the instruction fetch memory 1309 to fetch instructions from external memory, and also by the memory access controller 1305 to fetch the original data of the input matrix A or the weight matrix B from external memory.

[0274] The DMAC is mainly used to move input data from external memory DDR to unified memory 1306, or to weight data to weight memory 1302, or to input data to input memory 1301.

[0275] The vector computation unit 1307 includes multiple processing units that further process the output of the computation circuit 1303 when needed, such as vector multiplication, vector addition, exponential operations, logarithmic operations, size comparisons, etc. It is mainly used for computation in non-convolutional / fully connected layers of neural networks, such as Batch Normalization, pixel-level summation, and upsampling of feature planes.

[0276] In some implementations, the vector computation unit 1307 can store the processed output vector in the unified memory 1306. For example, the vector computation unit 1307 can apply a linear function, or a nonlinear function, to the output of the computation circuit 1303, such as performing linear interpolation on the feature planes extracted by the convolutional layer, or, for example, accumulating a vector of values to generate activation values. In some implementations, the vector computation unit 1307 generates normalized values, pixel-level summed values, or both. In some implementations, the processed output vector can be used as an activation input to the computation circuit 1303, for example, for use in subsequent layers of the neural network.

[0277] The instruction fetch buffer 1309 connected to the controller 1304 is used to store the instructions used by the controller 1304;

[0278] Unified memory 1306, input memory 1301, weighted memory 1302, and instruction fetch memory 1309 are all on-chip memories. External memory is proprietary to this NPU hardware architecture.

[0279] The processor mentioned above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the above program.

[0280] It should also be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. In addition, in the device embodiment drawings provided in this application, the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.

[0281] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CPUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk, or optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, training equipment, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0282] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.

[0283] The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device or data center that integrates one or more available media. The available media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid-state drives (SSDs)).

Claims

1. A data processing method, characterized in that, The method includes: The first data is moved from memory to memory, and the first data includes the data required by the computing unit during the same batch of fast attention FA operations. The first data includes a first Q vector, a second Q vector, and a first KV vector. The fast attention (FA) operation is performed on the first Q vector and the first KV vector in the second memory through the computing unit to obtain the first operation result. Furthermore, the fast attention (FA) operation is performed on the second Q vector and the first KV vector in the first cache by the computing unit to obtain the second operation result.

2. The method according to claim 1, characterized in that, The first data also includes a second KV vector, and the method further includes: The computing unit performs the fast attention (FA) operation on the first Q vector, the second KV vector, and the first operation result in the first cache to obtain a third operation result; wherein the third operation result includes an update result of the first operation result.

3. The method according to claim 1 or 2, characterized in that, The first data also includes a third KV vector, and the method further includes: The computing unit performs the fast attention (FA) operation on the second Q vector, the third KV vector, and the second operation result in the first cache to obtain a fourth operation result; wherein the fourth operation result includes an update result of the second operation result.

4. The method according to any one of claims 1 to 3, characterized in that, The intermediate result obtained from performing a fast attention (FA) operation on the first data is stored in a second cache (L2), and the size of the intermediate result is smaller than the capacity of the second cache (L2).

5. The method according to any one of claims 1 to 4, characterized in that, The memory is dedicated memory (GM), and the first cache is a level 1 cache (L1).

6. The method according to any one of claims 1 to 5, characterized in that, The method further includes: After the computing unit completes the fast attention operation on the first data, the first KV vector is deleted from the first cache.

7. The method according to any one of claims 1 to 6, characterized in that, The size of the first data is smaller than the capacity of the first memory.

8. The method according to any one of claims 1 to 7, characterized in that, The method further includes: The second data is moved from the memory to the second memory. The second data includes the data required by the computing unit when performing fast attention gradient (FAG) calculations in the same batch. The first data includes the fourth Q vector, the first gradient, the fourth KV vector, and the fifth KV vector. The calculation module is used to perform the fast attention gradient (FAG) operation on the first four-Q vector, the first gradient, and the fourth KV vector in the second memory to obtain a fifth calculation result. Furthermore, the computation unit performs the Fast Attention Gradient (FAG) operation on the fourth Q vector, the first gradient, and the fifth KV vector in the first cache to obtain the sixth computation result.

9. The method according to claim 8, characterized in that, The size of the second data is smaller than the capacity of the first memory.

10. A data processing apparatus, characterized in that, The device includes: The data management module is used to move first data from memory to second memory. The first data includes the data required by the computing unit when performing fast attention (FA) operations in the same batch. The first data includes a first Q vector, a second Q vector, and a first KV vector. The computing module of the computing unit is configured to perform the fast attention FA operation on the first Q vector and the first KV vector in the second memory to obtain a first operation result; and to perform the fast attention FA operation on the second Q vector and the first KV vector in the first cache to obtain a second operation result.

11. The apparatus according to claim 10, characterized in that, The first data also includes a second KV vector. The calculation module is further configured to perform the fast attention FA operation on the first Q vector, the second KV vector and the first operation result in the first cache to obtain a third operation result; wherein the third operation result includes an update result of the first operation result.

12. The apparatus according to claim 10 or 11, characterized in that, The first data also includes a third KV vector. The calculation module is further configured to perform the fast attention FA operation on the second Q vector, the third KV vector and the second operation result in the first cache to obtain a fourth operation result; wherein the fourth operation result includes an update result of the second operation result.

13. The apparatus according to any one of claims 10 to 12, characterized in that, The memory is dedicated memory (GM), and the first cache is a level 1 cache (L1).

14. The apparatus according to any one of claims 10 to 13, characterized in that, The data management module is further configured to delete the first KV vector from the first cache after the computing unit completes the fast attention operation on the first data.

15. The apparatus according to any one of claims 10 to 14, characterized in that, The size of the first data is smaller than the capacity of the first memory.

16. The apparatus according to any one of claims 10 to 15, characterized in that, The intermediate result obtained from performing a fast attention (FA) operation on the first data is stored in a second cache (L2), and the size of the intermediate result is smaller than the capacity of the second cache (L2).

17. The apparatus according to any one of claims 10 to 16, characterized in that, The data management module is also used for: The second data is moved from the memory to the second memory. The second data includes the data required by the computing unit when performing fast attention gradient (FAG) calculations in the same batch. The first data includes the fourth Q vector, the first gradient, the fourth KV vector, and the fifth KV vector. The calculation module is further configured to perform the fast attention gradient (FAG) operation on the first four-Q vector, the first gradient, and the fourth KV vector in the second memory to obtain a fifth calculation result; Furthermore, the Fast Attention Gradient (FAG) operation is performed on the fourth Q vector, the first gradient, and the fifth KV vector in the first cache to obtain the sixth operation result.

18. The apparatus according to claim 17, characterized in that, The size of the second data is smaller than the capacity of the first memory.

19. A computer storage medium, characterized in that, The computer storage medium stores one or more instructions that, when executed by one or more computers, cause the one or more computers to perform the operation of the method according to any one of claims 1 to 9.

20. A computer program product, characterized in that, Includes computer-readable instructions that, when executed on a computer device, cause the computer device to perform the method as described in any one of claims 1 to 9.

21. A system comprising at least one processor and at least one memory; the processor and the memory are connected via a communication bus and communicate with each other. The at least one memory is used to store code; The at least one processor is used to execute the code to perform the method as described in any one of claims 1 to 9.

22. A chip, comprising a processor, characterized in that, The processor is used to support the data processing device in implementing the method as described in any one of claims 1 to 9.