Method for executing attention mechanism computing task, and related product

WO2026124224A1PCT designated stage Publication Date: 2026-06-18CAMBRICON (KUNSHAN) INFORMATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
CAMBRICON (KUNSHAN) INFORMATION TECHNOLOGY CO LTD
Filing Date
2025-11-27
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing technologies suffer from uneven computational load when using SIMD architecture to perform attention mechanism computation tasks, resulting in some computing units being idle, and it is difficult to achieve effective task splitting by designing complex load balancing algorithms.

Method used

An adaptive scheduling scheme is adopted, which allows the computing units to actively request the next computing task based on their own completion status, thus avoiding the design of complex load balancing algorithms and making full use of the parallel computing power of multiple computing units.

🎯Benefits of technology

It enables fast execution of attention mechanism computation tasks, avoids idle computing units, simplifies development and maintenance, and improves computational efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025138113_18062026_PF_FP_ABST
    Figure CN2025138113_18062026_PF_FP_ABST
Patent Text Reader

Abstract

A method for using an artificial intelligence processor to execute an attention mechanism computing task, and a related product, which relate to the technical field of artificial intelligence. An artificial intelligence processor (201) may be comprised in a combined processing apparatus (20), and the combined processing apparatus (20) may further comprise an interface apparatus (202) and a processing apparatus (203). The artificial intelligence processor (201) interacts with the processing apparatus (203), so as to jointly complete a computing operation specified by a user. The combined processing apparatus (20) may further comprise a storage apparatus (204), wherein the storage apparatus (204) is separately connected to the artificial intelligence processor (201) and the processing apparatus (203), and is used for storing data of the artificial intelligence processor (201) and data of the processing apparatus (203). The method for using an artificial intelligence processor to execute an attention mechanism computing task and the related product provide an adaptive scheduling scheme for the attention mechanism computing task, such that the parallel computing capability of the artificial intelligence processor can be fully exerted, thereby avoiding the situation of idle computing power.
Need to check novelty before this filing date? Find Prior Art

Description

Methods and related products for performing attention-based computational tasks Cross-reference to related applications

[0001] This application claims priority to Chinese patent application filed on December 9, 2024, application number 202411796786.2, entitled "Method and related products for performing attention mechanism computational tasks". Technical Field

[0002] This disclosure generally relates to the field of artificial intelligence technology. More specifically, this disclosure relates to a method for performing attention mechanism computational tasks using an artificial intelligence processor, an artificial intelligence processor, a chip, and a board. Background Technology

[0003] Currently, deep learning algorithms based on attention mechanisms and their variants have demonstrated powerful performance in tasks such as natural language processing, image processing, and assisted programming, and their applications are becoming increasingly widespread at an unprecedented rate. However, training and using these algorithms for inference involves enormous computational demands, and improving the computational speed of attention mechanisms and their variants has become a significant challenge limiting the further development of these algorithms.

[0004] Artificial intelligence (AI) processor chips are typically designed to efficiently execute deep learning and machine learning algorithms. To handle massive amounts of data and complex algorithms, AI processor chips usually need high parallel computing capabilities, enabling them to process multiple data points simultaneously. Single Instruction Multiple Data (SIMD), as a parallel computing architecture, uses multiple processing units to execute the same instruction simultaneously. A single instruction acts on multiple data points, thus achieving parallel processing of multiple data points and improving program execution speed. In high-throughput scenarios, to maximize SIMD performance, the data to be processed needs to be distributed as evenly as possible across each processing unit. This avoids situations where some processing units are idle waiting for data while others are still processing, thereby improving overall processing efficiency.

[0005] However, for some complex computational tasks, such as the attention mechanism computation task mentioned above, task decomposition can be cumbersome and complicated. Although attempts are made to distribute the computational task evenly across multiple computing units according to the scale of the computational task, in practical applications, this often leads to computational bottlenecks in multiple computing units.

[0006] In view of this, there is an urgent need to provide a solution for using artificial intelligence processors to perform attention mechanism computation tasks, so as to achieve the computation of attention mechanism at a faster running speed and avoid wasting computing power. Summary of the Invention

[0007] In order to at least solve one or more technical problems described in the background section above, this disclosure proposes the following technical solutions and several embodiments thereof.

[0008] In a first aspect, this disclosure proposes a method for performing attention mechanism computation tasks using an artificial intelligence processor, comprising: dividing the attention mechanism computation task into multiple unit computation tasks; and, until the attention mechanism computation task is completed, each computation unit cyclically performing: requesting a unit computation task; and completing the requested unit computation task; wherein the artificial intelligence processor comprises at least two of the computation units.

[0009] In a second aspect, this disclosure discloses an artificial intelligence processor, comprising: a control unit configured to divide an attention mechanism computation task into multiple unit computation tasks; and a computation unit configured to cyclically perform the following operations until the attention mechanism computation task ends: requesting a unit computation task; and completing the requested unit computation task; wherein the artificial intelligence processor includes at least two of the computation units.

[0010] In the third aspect, this disclosure discloses a chip configured to include the artificial intelligence processor described in the second aspect.

[0011] In the fourth aspect, this disclosure discloses a board that includes the chip described in the third aspect.

[0012] By utilizing the method, AI processor, chip, and board proposed in this disclosure for performing attention-based computational tasks using an AI processor, the computing unit completing the current round of tasks can automatically request the tasks to be processed in the next round. This adaptive approach solves the problem of uneven task allocation, prevents some computing units from being idle, and fully utilizes the parallel computing power of the computing units. Moreover, this adaptive scheduling scheme in the embodiments of this disclosure avoids the design of complex load balancing algorithms, which would increase development and maintenance difficulties. Attached Figure Description

[0013] The above and other objects, features, and advantages of exemplary embodiments of this disclosure will become readily apparent upon reading the following detailed description with reference to the accompanying drawings. In the drawings, several embodiments of this disclosure are illustrated by way of example and not limitation, and like or corresponding reference numerals denote like or corresponding parts, wherein:

[0014] Figure 1 shows a schematic diagram of the board structure according to an embodiment of this disclosure;

[0015] Figure 2 shows a schematic diagram of the combined processing device in the chip according to an embodiment of this disclosure;

[0016] Figure 3 illustrates an exemplary internal structure diagram of an artificial intelligence processor in some embodiments of this disclosure;

[0017] Figure 4 shows an exemplary structural diagram of a processor core in some embodiments of this disclosure;

[0018] Figure 5 illustrates an exemplary schematic diagram of a processor core writing data to a processor core of another cluster in some embodiments of this disclosure;

[0019] Figure 6 shows an exemplary schematic diagram of a two-layer, three-stage pipeline in some embodiments of this disclosure;

[0020] Figure 7 shows an exemplary block diagram of the computing module in some embodiments of this disclosure;

[0021] Figure 8 shows an exemplary flowchart of a method for performing attention mechanism computational tasks using an artificial intelligence processor in some embodiments of this disclosure;

[0022] Figure 9 shows an exemplary schematic diagram of the self-attention mechanism calculation process in some embodiments of this disclosure;

[0023] Figure 10 illustrates an exemplary schematic diagram of the multi-head self-attention mechanism computation process in some embodiments of this disclosure;

[0024] Figure 11 shows a schematic diagram illustrating the impact of masking operations on the execution time of the arithmetic unit. Detailed Implementation

[0025] The technical solutions in the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this disclosure, not all of them. Based on the embodiments in this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.

[0026] It should be understood that the terms “comprising” and “including” used in this disclosure and claims indicate the presence of the described features, integrals, steps, operations, elements and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or collections thereof.

[0027] It should also be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of this disclosure. As used in this disclosure and claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in this disclosure and claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes such combinations.

[0028] As used in this specification and claims, the term "if" may be interpreted, depending on the context, as "when," "once," "in response to determination," or "in response to detection." Similarly, the phrase "if determined" or "if [described condition or event] is detected" may be interpreted, depending on the context, as "once determined," "in response to determination," "once [described condition or event] is detected," or "in response to detection of [described condition or event]."

[0029] The specific embodiments disclosed herein will now be described in detail with reference to the accompanying drawings.

[0030] The working principle of the attention mechanism is mainly to allow the model to focus on information at different positions in the input sequence, capturing long-distance dependencies. As a variant of the attention mechanism, self-attention can handle internal dependencies in a sequence without relying on external information. Currently, deep learning algorithms based on attention mechanisms and their variants have demonstrated powerful performance in tasks such as natural language processing, image processing, and assisted programming, and their applications are becoming increasingly widespread at an unprecedented rate. However, training and using these algorithms for inference involves enormous computational costs, and improving the computational speed of attention mechanisms and their variants has become a significant problem limiting the further development of these algorithms.

[0031] Parallel processing technology is an important method for solving large-scale computing tasks. SIMD, as a parallel computing architecture, uses multiple processing units to execute the same instruction simultaneously, thereby achieving parallel processing of multiple data sets and improving program execution speed. SIMD architecture achieves instruction-level parallelism, meaning a single instruction operates on multiple data sets. This requires that when using SIMD architecture in high-throughput scenarios such as executing complex algorithms and processing large amounts of data, the computational data be distributed as evenly as possible across each processing unit.

[0032] For self-attention mechanisms and their variants, existing technologies typically handle these computational tasks within a SIMD architecture by evenly distributing the task across the various computational units. For example, the scale of a multi-head attention mechanism computational task is primarily determined by the batch size, head number, and sequence length. The batch size refers to the number of samples processed in a training batch or the number of samples processed simultaneously in a batch during inference. The head number refers to the number of attention heads obtained by splitting the Q (query), K (key), and V (value) matrix along the word embedding or model dimensions. The sequence length refers to the number of elements in the input sequence; for example, when processing text data, the sequence length can be the number of words in a sentence. When splitting a multi-head attention mechanism computational task, a common approach in existing technologies is to evenly distribute the total number of elements (SUM = Batch Size * Head Number * Sequence Length) across each computational unit.

[0033] The inventors of this disclosure have discovered the following problem in the prior art: Due to the masking operation in multi-head attention computation, some data in the dot product of Q and K becomes invalid, having no effect on subsequent calculations. Therefore, the existing method of evenly distributing computational tasks across each computing unit based on the scale of the attention mechanism computation task essentially leads to uneven computational load among computing units, resulting in some computing power being idle. Artificial intelligence processors are typically designed to handle highly parallel computational tasks, and to simplify processor design, complex task scheduling logic is generally not required. Therefore, to solve the aforementioned uneven load problem, the conventional design approach involves analyzing the impact of the masking operation in detail before allocating tasks to computing units, and splitting tasks considering this impact to distribute them as evenly as possible. However, this solution requires designing very complex load balancing algorithms, which is very difficult to develop. Moreover, if the masking operation varies with different attention mechanism computation tasks, task splitting schemes need to be analyzed and designed separately for different attention mechanism computation tasks, increasing the workload. Furthermore, if operations that affect computational load, such as mask operations, are variable within a single attention mechanism computation task, it becomes nearly impossible to accurately analyze their impact and perform effective task decomposition.

[0034] In light of this, the inventors have deviated from conventional design approaches. Instead of attempting to construct complex task balancing algorithms to implement task splitting schemes, which then distribute tasks to computing units that passively accept and execute assigned computational tasks, the inventors have enabled computing units to proactively request the next computational task based on their own performance—that is, to achieve adaptive scheduling. Therefore, this disclosure proposes an artificial intelligence processor and a method for using the artificial intelligence processor to execute attention mechanism computational tasks, thereby fully utilizing the parallel computing power of multiple computing units to complete attention mechanism computational tasks at a faster execution speed.

[0035] It is important to note that in this disclosed embodiment, the attention mechanism computation task can exist in various application fields, and the data involved can be of various types. For example, some major application areas of attention mechanisms include: machine translation, sentiment analysis, text summarization, text classification, and named entity recognition in Natural Language Processing (NLP), with corresponding data types including text data, word embeddings, and sentence structures; facial recognition, image classification, object detection, and image segmentation in Computer Vision (CV), with corresponding data types including pixel data and image features; intelligent assistants, automatic caption generation, and speech-to-text conversion in Automatic Speech Recognition (ASR), with corresponding data types including audio signals and spectrograms; and generative AI applications. Generative AI refers to artificial intelligence technology that uses complex algorithms, models, and rules to learn from large-scale datasets to create new original content, such as, but not limited to, content of various types including text, images, sound, video, and code. Accordingly, the input data processed by the attention mechanism computation task can be image data, audio data, video data, speech data, text data, document data, etc. The output of attention mechanism computation tasks can include a probability score of an image belonging to a specific object category, a probability score of a document being about a specific topic, a probability score of a text fragment in the target language being a correct translation of a text fragment in the source language, or a probability score of a text fragment being a correct transcription of spoken discourse, etc.

[0036] Figure 1 shows a schematic diagram of the structure of a board 10 according to an embodiment of this disclosure. As shown in Figure 1, the board 10 includes a chip 101, which is a system-on-a-chip (SoC) that integrates one or more combined processing devices. The combined processing device is an artificial intelligence computing unit used to support various deep learning and machine learning algorithms, meeting the intelligent processing needs of complex scenarios in fields such as computer vision, speech, natural language processing, and data mining. In particular, deep learning technology is widely used in the field of cloud intelligence. A significant characteristic of cloud intelligence applications is the large amount of input data, which places high demands on the platform's storage and computing capabilities. The board 10 of this embodiment is suitable for cloud intelligence applications, possessing massive off-chip storage, on-chip storage, and substantial computing power.

[0037] Chip 101 is connected to external device 103 via external interface device 102. External device 103 may be, for example, a server, computer, camera, monitor, mouse, keyboard, network card, or Wi-Fi interface. Data to be processed can be transmitted from external device 103 to chip 101 via external interface device 102. The calculation results from chip 101 can be transmitted back to external device 103 via external interface device 102. Depending on the application scenario, external interface device 102 may have different interface forms, such as a PCIe interface.

[0038] The board 10 also includes a storage device 104 for storing data, which includes one or more memory cells 105. The storage device 104 is connected to and transmits data with the controller 106 and the chip 101 via a bus. The controller 106 in the board 10 is configured to regulate the state of the chip 101. Therefore, in one application scenario, the controller 106 may include a microcontroller (MCU).

[0039] Figure 2 is a structural diagram illustrating the combined processing device in chip 101 of this embodiment. As shown in Figure 2, the combined processing device 20 includes an artificial intelligence processor 201, an interface device 202, a processing device 203, and a storage device 204.

[0040] The artificial intelligence processor 201 is configured to execute user-specified operations. It is mainly implemented as a single-core or multi-core intelligent processor to perform deep learning or machine learning calculations. It can interact with the processing device 203 through the interface device 202 to jointly complete the user-specified operations.

[0041] Interface device 202 is used to transmit data and control commands between artificial intelligence processor 201 and processing device 203. For example, artificial intelligence processor 201 can obtain input data from processing device 203 via interface device 202 and write it to on-chip storage device of artificial intelligence processor 201. Further, artificial intelligence processor 201 can obtain control commands from processing device 203 via interface device 202 and write them to on-chip control cache of artificial intelligence processor 201. Alternatively or optionally, interface device 202 can also read data from storage device of artificial intelligence processor 201 and transmit it to processing device 203.

[0042] The processing device 203, as a general-purpose processing device, performs basic controls including but not limited to data transfer and starting and / or stopping the artificial intelligence processor 201. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs. As mentioned above, the artificial intelligence processor 201 disclosed herein can be considered as having a single-core structure or a homogeneous multi-core structure. However, when the artificial intelligence processor 201 and the processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.

[0043] Storage device 204 may be DRAM, an off-chip memory used to store data to be processed, and may be DDR, typically 16G or larger, used to store data of artificial intelligence processor 201 and / or processing device 203.

[0044] Figure 3 shows a schematic diagram of the internal structure of the artificial intelligence processor 201. The artificial intelligence processor 201 is used to process input data such as computer vision, speech, natural language processing, and data mining. The artificial intelligence processor 201 in the figure adopts a multi-core hierarchical architecture design. As a system-on-a-chip, it includes multiple clusters, and each cluster includes multiple processor cores. In other words, the artificial intelligence processor 201 is constructed in a hierarchical structure of system-on-a-chip, cluster, and processor cores.

[0045] From the perspective of the system-on-a-chip hierarchy, as shown in Figure 3, the artificial intelligence processor 201 includes an external storage controller 301, a peripheral communication module 302, an on-chip interconnect module 303, a synchronization module 304, and multiple clusters 305.

[0046] There can be multiple external storage controllers 301; two are shown exemplarily in the figure. These controllers respond to access requests from the processor core to access external storage devices, such as DRAM 204 in Figure 2, thereby reading data from or writing data to external storage. The peripheral communication module 302 receives control signals from the processing device 203 via the interface device 202, initiating the AI ​​processor 201 to execute tasks. The on-chip interconnect module 303 connects the external storage controllers 301, the peripheral communication module 302, and multiple clusters 305 to transmit data and control signals between the modules. The synchronization module 304 is a global barrier controller (GBC) used to coordinate the working progress of each cluster and ensure information synchronization. The multiple clusters 305 are the computing cores of the AI ​​processor 201; four are shown exemplarily in the figure. With hardware development, the AI ​​processor 201 disclosed herein may also include eight, sixteen, sixty-four, or even more clusters 305. The clusters 305 are used to efficiently execute deep learning algorithms.

[0047] In terms of cluster hierarchy, as shown in Figure 3, each cluster 305 includes multiple processor cores (IPU cores) 306 and one memory core (MEM core) 307.

[0048] Four processor cores 306 are shown exemplarily in the figure, but this disclosure does not limit the number of processor cores 306. Figure 4 shows a schematic diagram of the internal structure of the processor core. Each processor core 306 includes three main modules: a control module 41, an arithmetic module 42, and a storage module 43.

[0049] The control module 41 coordinates and controls the operation of the computation module 42 and the storage module 43 to complete the deep learning task. It includes an instruction fetch unit (IFU) 411 and an instruction decode unit (IDU) 412. The instruction fetch unit 411 fetches instructions from the processing device 203, and the instruction decode unit 412 decodes the fetched instructions and sends the decoding result as control information to the computation module 42 and the storage module 43.

[0050] The computation module 42 includes a vector operation unit 421 and a matrix operation unit 422. The vector operation unit 421 is used to perform vector operations and can support complex operations such as vector multiplication, addition, and nonlinear transformations; the matrix operation unit 422 is responsible for the core computations of deep learning algorithms, such as matrix multiplication and convolution.

[0051] Storage module 43 is used to store or move related data, including neuron RAM (NRAM) 431, weight RAM (WRAM) 432, input / output direct memory access (IODMA) 433, and move direct memory access (MVDMA) 434. NRAM 431 is used to store feature maps for computation by processor core 306 and intermediate results after computation; WRAM 432 is used to store the weights of the deep learning network; IODMA 433 controls the memory access of NRAM 431 / WRAM 432 and DRAM 204 through broadcast bus 309; MVDMA 434 controls the memory access of NRAM 431 / WRAM 432 and SRAM 308.

[0052] Returning to Figure 3, storage core 307 is primarily used for storage and communication, namely storing shared data or intermediate results among processor cores 306, and performing communication between cluster 305 and DRAM 204, communication between clusters 305, and communication between processor cores 306. In other embodiments, storage core 307 has scalar operation capabilities for performing scalar operations.

[0053] Storage core 307 includes a shared memory unit (SRAM) 308, a broadcast bus 309, a cluster direct memory access (CDMA) module 310, and a global direct memory access (GDMA) module 311. SRAM 308 acts as a high-performance data relay station. Data multiplexed between different processor cores 306 within the same cluster 305 does not need to be obtained from DRAM 204 by each processor core 306 individually. Instead, it is relayed between processor cores 306 via SRAM 308. Storage core 307 only needs to quickly distribute the multiplexed data from SRAM 308 to multiple processor cores 306, thereby improving inter-core communication efficiency and significantly reducing on-chip and off-chip I / O access.

[0054] Broadcast bus 309, CDMA 310, and GDMA 311 are used to perform communication between processor cores 306, communication between clusters 305, and data transfer between cluster 305 and DRAM 204, respectively. These will be explained below.

[0055] The broadcast bus 309 is used to complete high-speed communication between the processor cores 306 within the cluster 305. In this embodiment, the broadcast bus 309 supports inter-core communication methods including unicast, multicast, and broadcast. Unicast refers to point-to-point (i.e., data transmission from one processor core to another) data transmission. Multicast is a communication method that transmits a piece of data from SRAM 308 to several specific processor cores 306. Broadcast is a communication method that transmits a piece of data from SRAM 308 to all processor cores 306, and is a special case of multicast.

[0056] CDMA 310 is used to control SRAM 308 access between different clusters 305 within the same AI processor 201. Figure 5 illustrates the working principle of CDMA 310 when one processor core wants to write data to the processor core of another cluster. In this application scenario, the same AI processor includes multiple clusters. For ease of explanation, only cluster 0 and cluster 1 are shown in the figure. Cluster 0 and cluster 1 each include multiple processor cores. Similarly, for ease of explanation, only processor core 0 is shown in cluster 0, and only processor core 1 is shown in cluster 1. Processor core 0 wants to write data to processor core 1.

[0057] First, processor core 0 sends a unicast write request to write data into its local SRAM 0. CDMA 0 acts as the master and CDMA 1 acts as the slave. The master pushes the write request to the slave, that is, the master sends the write address AW and the write data W to transmit the data to SRAM 1 of cluster 1. Then, the slave sends a write response B as a response. Finally, processor core 1 of cluster 1 sends a unicast read request to read the data from SRAM 1.

[0058] Returning to Figure 3, GDMA 311 works in conjunction with external memory controller 301 to control memory access from SRAM 308 to DRAM 204 in cluster 305, or to read data from DRAM 204 into SRAM 308. As previously mentioned, communication between DRAM 204 and NRAM 431 or WRAM 432 can be achieved through two channels. The first channel is direct communication between DRAM 204 and NRAM 431 or WRAM 432 via IODMA 433; the second channel involves first transferring data between DRAM 204 and SRAM 308 via GDMA 311, and then transferring data between SRAM 308 and NRAM 431 or WRAM 432 via MVDMA 434. Although the second channel appears to require more components and has a longer data flow, in some embodiments, the bandwidth of the second channel is actually much greater than that of the first channel. Therefore, communication between DRAM 204 and NRAM 431 or WRAM 432 may be more efficient via the second channel. The embodiments disclosed herein can select the data transmission channel based on their hardware capabilities.

[0059] In other embodiments, the functions of GDMA 311 and IODMA 433 can be integrated into the same component. For ease of description, this disclosure treats GDMA 311 and IODMA 433 as different components. For those skilled in the art, any component whose implemented functions and achieved technical effects are similar to those disclosed herein falls within the scope of protection of this disclosure. Furthermore, the functions of GDMA 311, IODMA 433, CDMA 310, and MVDMA 434 can also be implemented by the same component. Similarly, any component whose implemented functions and achieved technical effects are similar to those disclosed herein falls within the scope of protection of this disclosure.

[0060] One of the key reasons why the computing device 201 has strong computing power is its three-level operation hierarchy of system-on-chip cluster-processor core, combined with a three-level memory design of DRAM-SRAM-NRAM / WRAM, which allows data to be cached and computed at appropriate levels, forming a sufficient pipeline.

[0061] The computing device 201 performs calculations in three main stages: Load stage: loading data; Compute stage: transferring data, performing calculations, and transferring intermediate results; Store stage: storing the results.

[0062] More specifically, in some embodiments, a two-layer, three-stage pipeline can be employed, as shown in Figure 6. The first-layer load stage 601, computation stage 602, and write-back stage 603 occur at the cluster level. In the first-layer load stage 601, the GDMA 330 loads data from DRAM 204 into SRAM 308. In the first-layer computation stage 602, the cluster 305 performs calculations on the loaded on-chip cell diagram and generates the calculation results. In the first-layer write-back stage 603, the GDMA 330 writes the calculation results back from SRAM 308 to DRAM 204.

[0063] Since cluster 305 includes multiple processor cores 306, the first-level computation stage 602 actually divides the on-chip cell graph into corresponding subgraphs through storage core 307 and broadcasts them to at least one processor core 306 for computation. Therefore, the second-level tertiary watershed occurs in processor core 306. More specifically, the second-level loading stage 604 loads the subgraph from SRAM 308 into NRAM 431 using MVDMA 434. The second-level computation stage 605 moves the subgraph and subweights to the arithmetic module 42 for computation, and then moves the intermediate results back to NRAM 431. The second-level write-back stage 606 is when MVDMA 434 writes the intermediate results from NRAM 431 back to SRAM 308.

[0064] The first layer of the pipeline refers to the fact that the first layer load stage 601, the first layer computation stage 602, and the first layer store-back stage 603 can be performed in parallel. Taking the same cluster 305 processing the j-th on-chip cell graph, the (j+1)-th on-chip cell graph, and the (j+2)-th on-chip cell graph as an example, firstly, the j-th on-chip cell graph is loaded into SRAM 308 in the first layer load stage 601. Then, the j-th on-chip cell graph is computed in the first layer computation stage 602, and the first computation result is transferred back to SRAM 308. Simultaneously, while the j-th on-chip cell graph is being computed, the (j+1)-th on-chip cell graph is loaded into SRAM 308 in the first layer load stage 607. When the first calculation result is stored back into DRAM 204 in the first layer store-back stage 603, the (j+1)th on-chip cell diagram is calculated in the first layer calculation stage 608, and the second calculation result is transferred back into SRAM 308. At the same time, the (j+2)th on-chip cell diagram is loaded into SRAM 308 in the first layer load stage 610. The first layer pipeline proceeds in this manner.

[0065] To facilitate the aforementioned pipelined operation, the SRAM 308 in this embodiment includes two storage spaces: ping-pong and pong-pong. Data pipelined according to the ping-pong attribute of the SRAM 308 is divided into three types: input / output ping-pong (IO parity), input ping-pong (input parity), and no ping-pong (no parity). IO parity supports parallel loading, computation, and write-back. To achieve IO parity, the ping-pong and pong-pong storage cells need to be exactly equal, each used for loading and write-back respectively. Input ping-pong only supports parallel write-back and computation, which adds extra time to data transfer within the SRAM 308. Compared to IO parity, the ping-pong and pong-pong storage cells do not need to be exactly equal, but an additional cache of the same size as the write-back storage space needs to be allocated. No ping-pong means that loading / write-back and computation are serial, and no additional space is required.

[0066] To achieve the aforementioned first-level pipeline, some embodiments of the SRAM 308 have ping-pong and pong-pong memory cells of the same size to achieve an input / output ping-pong effect. Continuing with the explanation in FIG6, the storage area involved in the first-level load stage 601, first-level computation stage 602, and first-level write-back stage 603 of the j-th on-chip cell diagram is limited to ping-pong memory cells, while the storage area involved in the first-level load stage 607, first-level computation stage 608, and first-level write-back stage 609 of the (j+1)-th on-chip cell diagram is limited to pong-pong memory cells, and the storage area involved in the first-level load stage 610, first-level computation stage 611, and first-level write-back stage 612 of the (j+2)-th on-chip cell diagram is again limited to ping-pong memory cells, and ping-pong and pong-pong memory cells are used alternately for storage in this manner.

[0067] The second-level pipeline refers to the parallel operation of the second-level load stage 604, the second-level computation stage 605, and the second-level store-back stage 606. Consider an example where the same processor core 306 wants to process the i-th, (i+1)-th, and (i+2)-th subgraphs in the j-th on-chip unit graph. First, the i-th subgraph is broadcast to NRAM 431 in the second-level load stage 604. Then, the i-th subgraph is computed in the second-level computation stage 605 to produce the i-th intermediate result, which is then moved back to NRAM 431. Simultaneously, the (i+1)-th subgraph is broadcast to NRAM 431 in the second-level load stage 613. The i-th intermediate result is stored back into SRAM 308 in the second layer storage-back stage 606. At the same time, the (i+1)-th subgraph is calculated in the second layer calculation stage 614 to generate the (i+1)-th intermediate result, and the (i+1)-th intermediate result is moved back into NRAM 431. The (i+2)-th subgraph is loaded into NRAM 431 in the second layer loading stage 615.

[0068] Considering that each cluster 305 has different tasks and the completion time will naturally be different, the synchronization module 304 in some embodiments will use synchronization barrier instructions to synchronize the task completion time in order to avoid timing errors.

[0069] To improve computational efficiency and throughput, the arithmetic module 42 in processor core 306 can be designed to support multi-stage pipelined computation.

[0070] Figure 7 shows a block diagram of a computing module according to some embodiments of this disclosure. The computing module here can be a computing module within a single processor core, or a computing module considering multiple processor cores jointly. As shown in Figure 7, the computing module 42 can include multiple computing units, each computing unit can be configured as a multi-stage computing pipeline, and multiple computing units constitute multiple multi-stage computing pipelines.

[0071] Figure 7 illustrates a first group of pipelined operation circuits 701, a second group of pipelined operation circuits 702, and a third group of pipelined operation circuits 703, where each group of pipelined operation circuits can constitute a multi-stage operation pipeline in the context of this disclosure. Taking the first group of pipelined operation circuits 701, which constitutes the first multi-stage operation pipeline, as an example, it can perform pipelined operations including stage 1-1, stage 1-2, stage 1-3, ... stage 1-N, for a total of N stages of pipelined operations. Similarly, the second and third groups of pipelined operation circuits also have structures that support N stages of pipelined operations. Through this exemplary architecture, those skilled in the art can understand that the multiple groups of pipelined operation circuits disclosed herein can constitute multiple multi-stage operation pipelines, and these multiple multi-stage operation pipelines can execute their respective multiple operation instructions in parallel. These operation instructions can be obtained by parsing the computation instructions.

[0072] To execute each stage of the pipelined operation described above, arithmetic circuits, including one or more arithmetic units, can be arranged at each stage to execute the corresponding arithmetic instructions, thereby implementing the arithmetic operation at that stage. In some embodiments, in response to receiving multiple arithmetic instructions, one or more sets of pipelined circuits disclosed herein can be configured to perform multiple data operations, such as executing single instruction multiple data (SIMD) instructions. For example, the aforementioned multiple arithmetic instructions can be obtained by the control module in processor core 306 parsing the received computation instructions, and the opcode of the computation instructions can represent multiple operations performed by the multi-stage arithmetic pipeline.

[0073] To achieve multi-stage pipelined operations, each stage of the pipelined operation can include, but is not limited to, one or more of the following arithmetic units or circuits: random number processing circuit, addition / subtraction circuit, subtraction circuit, lookup table circuit, parameter configuration circuit, multiplier, pooler, comparator, absolute value circuit, logic unit, position indexing circuit, or filter.

[0074] The above describes the relevant hardware implementation of the embodiments disclosed herein. Based on the above hardware environment, this disclosure proposes a scheme for performing attention mechanism computation tasks using an artificial intelligence processor. This scheme can fully utilize the parallel computing power of multiple computing units in the artificial intelligence processor, avoiding idle situations for some computing units, thereby completing the attention mechanism computation task at a faster running speed.

[0075] Figure 8 shows an exemplary flowchart of a method for performing attention mechanism computation tasks using an artificial intelligence processor in some embodiments of this disclosure. As shown in Figure 8, in step 801, the attention mechanism computation task is divided into multiple unit computation tasks. Then, in step 802, until the attention mechanism computation task is completed, each computing unit in the artificial intelligence processor cyclically executes: requesting a unit computation task and completing the requested unit computation task. The artificial intelligence processor includes at least two of the aforementioned computing units. Step 801 can be executed, for example, by the control module 41 in the processor core 306 of the artificial intelligence processor shown in Figure 4, which can also be called a control unit. Step 802 is executed by each computing unit, corresponding to the computing module 42 in the processor core 306 shown in Figure 4. As mentioned above, the computing module 42 can include multiple computing units, each computing unit can be configured as a multi-stage computing pipeline, and multiple computing units constitute multiple multi-stage computing pipelines, which can execute their respective unit computation tasks in parallel.

[0076] It is understood that, according to the method for using an AI processor to perform attention mechanism computation tasks as proposed in this disclosure, the computing unit that has completed the current round of tasks can automatically request the tasks to be processed in the next round. This task scheduling method is adaptive, which can solve the problem of uneven task allocation, avoid some computing units being idle, and fully utilize the parallel computing power of the computing units. Moreover, this adaptive scheduling scheme in the embodiments of this disclosure also avoids designing complex load balancing algorithms, which would increase development and maintenance difficulties. It is understood that the attention mechanism can be an attention mechanism or its variants, including but not limited to self-attention mechanisms, multi-head self-attention mechanisms, and self-attention mechanisms and multi-head self-attention mechanisms with various masking operations.

[0077] Figure 9 illustrates an exemplary schematic diagram of the self-attention mechanism calculation process in some embodiments of this disclosure. As shown in Figure 9, for the input sequence data S = [S1; S2; S3; S4], firstly, three matrices W_Q, W_K, and W_V are used to perform a linear transformation on it to obtain the query (Q) matrix, key (K) matrix, and value (V) matrix, respectively; secondly, an inner product operation is performed on Q and K to obtain the original score (Logits) matrix; then, a softmax operation is performed on each row of Logits to obtain the attention score (Score) matrix; finally, the score is multiplied by V to obtain the output (O) of the self-attention mechanism.

[0078] As shown by the dashed arrows in Figure 9, the masking operation is introduced into the self-attention mechanism to obtain the masked self-attention mechanism. The mask is typically a matrix with the same shape as the input sequence, where the elements are either 0 or 1. When calculating attention weights, the mask is combined element-wise with the original score matrix to control which positions of information can be focused on and which positions need to be masked. For example, in Transformer-based natural language processing models, the lengths of the input sequences are often inconsistent. To batch process these sequences, shorter sequences are usually padded to make their length match that of the longest sequence. However, these padding values ​​have no semantic meaning in the actual context. Including them in the attention score calculation would introduce noise and affect the model's performance. Therefore, a padding mask operation can be introduced into the Transformer to mask the positions corresponding to the padding values. For example, to ensure that the model can only see the sequence information before the current position when predicting the next word, and not the sequence information before the current position, a looking-ahead mask operation can be introduced in the Transformer to block the positions after the current position.

[0079] Multi-head self-attention, as an extension of self-attention, repeatedly computes the attention mechanism. Each computation of the attention mechanism can be regarded as an attention head. Through multiple self-attention heads, complex relationships in the input sequence can be captured from different subspaces or perspectives, thereby enhancing the expressive power of the model.

[0080] Figure 10 illustrates an exemplary schematic diagram of the multi-head self-attention mechanism computation process in some embodiments of this disclosure. As shown in Figure 10, the input S = [S1; S2; S3; S4] is first linearly transformed to obtain matrices Q, K, and V, respectively. Then, a splitting operation is performed on the word embedding dimension or the model dimension (the feature dimension used to represent sequence elements during model computation) to obtain multiple attention heads. Specifically, Q is split to obtain matrices Q0 and Q1, K is split to obtain matrices K0 and K1, and V is split to obtain matrices V0 and V1. Then, similar to the self-attention mechanism, O_0 is calculated based on Q0, K0, and V0 through inner product, softmax, and matrix multiplication; O_1 is calculated based on Q1, K1, and V1 through inner product, softmax, and matrix multiplication. Finally, O_0 and O_1 are concatenated to obtain the output O of the multi-head self-attention mechanism. Accordingly, a masking operation is introduced into the attention score calculation operation of the multi-head attention mechanism to obtain the masked multi-head self-attention mechanism.

[0081] For ease of description, Figures 9 and 10 illustrate the computational processes of self-attention and multi-head self-attention mechanisms using a single sample as an example. It's understandable that in practical applications, such as during training, the model can process samples in batches, performing calculations and processing on a batch of samples in a loop, and then updating the model parameters. Similarly, during inference, the model can process samples in batches, completing the calculations and processing on a batch of samples within a single batch, thus improving inference efficiency.

[0082] Returning to Figure 8, it is understandable that, based on the scale of the attention mechanism computation task, the attention mechanism computation task can be divided into multiple unit computation tasks.

[0083] In some embodiments, the step of dividing the attention mechanism computation task into multiple unit computation tasks may include: determining the maximum amount of data that a single computation unit can compute in a single operation; and dividing the attention mechanism computation task into multiple unit computation tasks along the sequence length dimension based on the maximum amount of data. Accordingly, in some embodiments, the control unit in the artificial intelligence processor may be further configured to: determine the maximum amount of data that a single computation unit can compute in a single operation; and divide the attention mechanism computation task into multiple unit computation tasks along the sequence length dimension based on the maximum amount of data.

[0084] It is understandable that, in order to fully utilize the computing power of the processing unit, it is preferable that the divided unit computing tasks can fully utilize the processing bandwidth of the processing unit, thereby reducing the number of operations and speeding up the processing speed. The processing bandwidth can be predetermined based on the specific hardware configuration, such as 512B. In addition, since the sequence length dimension is usually larger than the batch size and head number dimensions, it is preferable to split the attention mechanism computing task along the sequence length dimension. It is also understandable that, depending on the specific situation, it can also be split along one or more other dimensions.

[0085] In some embodiments, the sequence length of the attention mechanism computation task is seq_q, the partition size is tile_q, the batch size is B, the number of heads in the attention mechanism is h, and the maximum amount of data that a single computation unit can compute in a single run is Nmax, where tile_q×B×h≤Nmax. Preferably, tile_q×B×h=Nmax.

[0086] To manage the scheduling of attention mechanism computation tasks and confirm task completion, in some embodiments, the method may further include: determining the total number of attention mechanism computation tasks; and storing the total number in a shared cache. Accordingly, the artificial intelligence processor disclosed herein also includes a shared cache for storing the total number of attention mechanism computation tasks. Here, the shared cache refers to the cache space shared by the computational units in the artificial intelligence processor 201. In some embodiments, the shared cache may be located in a high-speed storage area within the artificial intelligence processor 201, such as on shared memory shared among multiple clusters. In other embodiments, the shared cache may also be located outside the artificial intelligence processor 201, for example on storage device 204 in FIG2.

[0087] The total number of attention mechanism computation tasks can be described or characterized in various ways. In some embodiments, it can be directly characterized by the number of data points in the attention mechanism computation tasks. For example, for a multi-head attention mechanism computation task, its scale can be described by three dimensions: Batch Size, Head Num, and Sequence Length. The total number of such computation tasks can be expressed as Batch Size * Head Num * Sequence Length. In other embodiments, it can be characterized by the number of unit computation tasks into which the task is divided. For example, when the sequence length seq_q is divided according to the partition size tile_q of the sequence length dimension, ceil(seq_q / tile_q) unit computation tasks can be obtained. The total number of such computation tasks can be expressed as ceil(seq_q / tile_q), where ceil represents rounding up.

[0088] By calculating the total number of tasks through a statistical attention mechanism, and by updating this total number to maintain the unprocessed (remaining) unit computation tasks when a computation unit requests a unit computation task, it can be ensured that all unit computation tasks are executed.

[0089] In some embodiments, the step of requesting a unit computation task in step 802 may include: the computing unit initiating the requesting unit computation task through atomic operations. Accordingly, for the artificial intelligence processor disclosed herein, the computing unit is further configured to: initiate the requesting unit computation task through atomic operations.

[0090] Understandably, the process of a computation unit requesting a unit computation task involves concurrent access issues. When multiple computation units simultaneously request a unit computation task, atomic operations can ensure data consistency and correctness. Atomic operations can accurately avoid mutual interference between different computation units, guaranteeing that the task is correctly obtained.

[0091] In some embodiments, the computing unit may employ a three-stage data pipeline, including at least load-compute-write, to execute a unit computation task. For example, referring to the two-layer three-stage pipeline described above in conjunction with Figure 6, depending on the specific storage architecture, a three-stage data pipeline, including at least load-compute-write, can be configured. In this case, the computing unit can request the next unit computation task not only after the current unit computation task has completely finished (e.g., written back to memory), but also immediately after the loading phase of the current unit computation task. This ensures seamless data pipeline continuity and further shortens the overall processing time.

[0092] In some embodiments, the step of the computing unit initiating the request for a unit computation task via an atomic operation includes: accessing the shared cache to determine whether there are any unexecuted unit computation tasks; requesting a unit computation task in response to the existence of an unexecuted unit computation task; and updating the total number based on the requested unit computation task. Accordingly, for the artificial intelligence processor disclosed herein, the computing unit is further configured to implement the above steps.

[0093] Understandably, in the initial stage of executing the attention mechanism computation task, the total number of unit computation tasks is recorded in the shared cache. After a computation unit successfully requests a unit computation task, the number of requested unit computation tasks can be subtracted from the total number, for example, by subtracting 1, thereby updating the total number. In some embodiments, in response to the total number of attention mechanism computation tasks being greater than 0, it can be determined that there are unexecuted unit computation tasks. When the total number is less than or equal to 0, it indicates that the attention mechanism computation task has been completed.

[0094] It can be argued that when multiple computing units perform attention mechanism computation tasks, the number of invalid elements differs between unit computation tasks due to masking operations. Some computing units complete their tasks quickly and remain idle in the later stages, while others are constantly executing computation tasks at full capacity, resulting in a load imbalance problem. The adaptive scheduling scheme proposed in this embodiment to address this load imbalance problem first divides the attention mechanism computation task into unit computation tasks and obtains the total number of unit computation tasks, SUM. Then, a counter is set in the shared cache of the AI ​​processor to record the number of remaining unprocessed unit computation tasks, and the value of the counter is initialized to SUM. Next, each computing unit requests unit computation tasks through atomic operations, and after completing the requested unit computation task, it requests other unit computation tasks again through atomic operations, thereby ensuring that each computing unit is always in a computational state and improving the utilization rate of the computing units. In this process, when each computational unit requests a unit computation task through atomic operations, it first checks whether the counter is greater than 0. If the counter is greater than 0, it means that a unit computation task can be requested, and thus a unit computation task is requested, and the counter is decremented by 1; otherwise, it means that all unit computation tasks have been executed and the attention mechanism computation task has ended.

[0095] As previously described, the computing units in the artificial intelligence processor of this disclosure embodiment can be configured as multi-stage computing pipelines; multiple computing units constitute multiple multi-stage computing pipelines. Multiple multi-stage pipelines execute their respective unit computing tasks in parallel.

[0096] Specifically, based on the exemplary architecture of the multi-stage arithmetic pipeline described above in conjunction with Figure 7, those skilled in the art will understand that multiple multi-stage arithmetic pipelines can process large numbers of operands in parallel for the same instruction.

[0097] In some embodiments, the steps of completing the requested unit computation task include multi-level operations. These multi-level operations include: calculating a raw attention score for the query data block and key data block of the requested unit computation task; performing a masking operation on the raw attention score to obtain a masked raw attention score; performing a softmax operation on the masked raw attention score to obtain a masked attention score; and calculating the output of the requested unit computation task based on the masked attention score and the value data block of the requested unit computation task.

[0098] Accordingly, for the artificial intelligence processor disclosed in this embodiment, each multi-stage computational pipeline can be configured to complete the requested unit computational task according to the steps described above. For example, the above computation can utilize multiple first multipliers (first stage) and at least one addition tree (second stage) included in the multi-stage pipelined computational circuit structure to perform the operation of calculating the original attention score (i.e., the dot product of Q and K), utilize multiple second multipliers (third stage) to perform the masking operation (i.e., the bitwise multiplication of the original attention score with the mask template), utilize a nonlinear arithmetic unit (fourth stage) to perform the Softmax operation, and utilize multiple third multipliers (fifth stage) to obtain the output of the unit computational task (i.e., the multiplication of the mask attention score with V). It can also be understood that some pipeline stages can be further subdivided. For example, the pipeline stage performing the Softmax operation can further include several sub-stages, such as comparators (finding the maximum value), exponentiation operations, etc.

[0099] As mentioned earlier, at least some data in the dot product of Q and K becomes invalid due to the masking operation, rendering it useless for subsequent calculations. In this case, unused one or more stages of the pipelined computation circuit can be bypassed. This means that one or more stages of the pipelined computation circuit can be selectively used as needed, without requiring the computation to go through all stages. For example, for invalid data after the masking operation, subsequent pipeline stages can be bypassed, thus saving the invalid computation.

[0100] Since invalid computations can be omitted, in some embodiments, the above-mentioned multi-level computation steps further include: in response to the different number of invalid elements in the mask attention score, the time required for the multiple multi-level computation pipelines corresponding to the above multiple computation units to complete their respective unit computation tasks is different.

[0101] Figure 11 illustrates the impact of masking operations on the execution time of the computation units, to better understand the technical effects of the embodiments disclosed herein. As shown, the blocks in the figure represent the data to be processed, which is divided into four equal parts along the sequence length dimension and processed in parallel by two computation units, performing attention mechanism operations on the corresponding data. Due to the masking operation, some data in the dot product of Q and K becomes invalid values; the shaded upper triangle in the figure represents these invalid values. The number of invalid values ​​in each of the divided data parts is different, thus resulting in different computational loads for each computation unit and different times required to complete the data block computation.

[0102] If, according to the existing allocation scheme, data block 1 is allocated to processing unit 1 and data block 2 is allocated to processing unit 2, processing unit 1 will typically finish executing faster than processing unit 2. However, processing unit 1 can only be allocated the next data block for processing after processing unit 2 has finished its operation. Therefore, although the computational tasks are attempted to be evenly distributed across each processing unit, this actually results in an uneven computational load among the processing units, leading to some computing power being idle.

[0103] In contrast, according to the adaptive scheduling scheme of this disclosure embodiment, data block 1 is allocated to computation unit 1, and data block 2 is allocated to computation unit 2. Similarly, computation unit 1 will complete its execution faster than computation unit 2. However, at this time, computation unit 1 can immediately request the next data block to be processed, such as data block 3, without waiting for computation unit 2 to complete its processing. In this way, after each computation unit finishes executing its current unit computation task, it can autonomously request the remaining unit computation tasks. Before the overall attention mechanism computation task is completed, all computation units are in a computational state. In other words, the multiple multi-level computation pipelines are always in a "flowing" state and will not be idle, thereby improving the execution speed of the attention mechanism computation task.

[0104] Although the adaptive scheduling scheme of this disclosure embodiment has been described above in conjunction with attention mechanism computation tasks, those skilled in the art will understand that the above adaptive scheduling scheme can also be applied to other scenarios, such as when load balancing is required for complex algorithms implemented on artificial intelligence processors, thereby avoiding the design of complex load balancing algorithms.

[0105] Based on the foregoing description, those skilled in the art will understand that the artificial intelligence processor of the embodiments disclosed herein can be implemented in a single chip. Therefore, this disclosure also discloses a chip that includes the artificial intelligence processor as described in any of the foregoing embodiments. Additionally, this disclosure also discloses a circuit board that includes the aforementioned chip.

[0106] Depending on the application scenario, the electronic devices or apparatus disclosed herein may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablets, smart terminals, PC devices, IoT terminals, mobile terminals, mobile phones, dashcams, navigators, sensors, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and / or medical devices. The vehicles include airplanes, ships, and / or vehicles; the home appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, lights, gas stoves, and range hoods; the medical devices include MRI scanners, ultrasound machines, and / or electrocardiographs. The electronic devices or apparatus disclosed herein can also be applied in fields such as the Internet, IoT, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and healthcare. Furthermore, the electronic devices or apparatus disclosed herein can also be used in application scenarios related to artificial intelligence, big data, and / or cloud computing, such as cloud computing, edge computing, and terminal applications. In one or more embodiments, the high-computing-power electronic devices or apparatuses according to the present disclosure can be applied to cloud devices (e.g., cloud servers), while the low-power electronic devices or apparatuses can be applied to terminal devices and / or edge devices (e.g., smartphones or cameras). In one or more embodiments, the hardware information of the cloud devices and the hardware information of the terminal devices and / or edge devices are compatible with each other, so that suitable hardware resources can be matched from the hardware resources of the cloud devices to simulate the hardware resources of the terminal devices and / or edge devices based on the hardware information of the terminal devices and / or edge devices, so as to complete the unified management, scheduling and collaborative work of end-to-cloud or cloud-edge-end integration.

[0107] It should be noted that, for the sake of brevity, this disclosure describes some methods and their embodiments as a series of actions and combinations thereof. However, those skilled in the art will understand that the solutions disclosed herein are not limited by the order of the described actions. Therefore, based on the disclosure or teachings of this document, those skilled in the art will understand that some steps can be performed in a different order or simultaneously. Furthermore, those skilled in the art will understand that the embodiments described in this disclosure can be considered optional embodiments, that is, the actions or modules involved are not necessarily essential for the implementation of one or more solutions disclosed herein. In addition, depending on the solution, the description of some embodiments in this disclosure may have different emphases. In view of this, those skilled in the art will understand that parts not described in detail in a certain embodiment of this disclosure can also be referred to the relevant descriptions of other embodiments.

[0108] In terms of specific implementation, based on the disclosure and teachings of this document, those skilled in the art will understand that several embodiments disclosed herein can also be implemented in other ways not disclosed herein. For example, regarding the various units in the electronic device or apparatus embodiments described above, this document divides them based on logical functions, but in actual implementation, there may be other division methods. As another example, multiple units or components can be combined or integrated into another system, or some features or functions in a unit or component can be selectively disabled. Regarding the connection relationships between different units or components, the connections discussed above in conjunction with the accompanying drawings can be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect couplings involve communication connections utilizing interfaces, where the communication interface can support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

[0109] In this disclosure, the units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units. The aforementioned components or units may be located in the same location or distributed across multiple network units. Furthermore, depending on actual needs, some or all of the units can be selected to achieve the purpose of the solution described in the embodiments of this disclosure. Additionally, in some scenarios, multiple units in the embodiments of this disclosure may be integrated into one unit or each unit may exist physically independently.

[0110] In some implementation scenarios, the integrated unit described above can be implemented as a software program module. If implemented as a software program module and sold or used as an independent product, the integrated unit can be stored in a computer-readable storage device (CMSDD). Therefore, when the disclosed solution is embodied in a software product (e.g., a computer-readable storage medium), the software product can be stored in a memory, which may include several instructions to cause a computer device (e.g., a personal computer, server, or network device) to execute some or all of the steps of the method described in the embodiments of this disclosure. The aforementioned memory may include, but is not limited to, various media capable of storing program code, such as USB flash drives, flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.

[0111] In other implementation scenarios, the integrated units described above can also be implemented in hardware, i.e., as specific hardware circuits, which may include digital circuits and / or analog circuits. The physical implementation of the circuit's hardware structure may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors. Therefore, the various devices described herein (e.g., artificial intelligence processors or other processing devices) can be implemented using appropriate hardware processors, such as CPUs, GPUs, FPGAs, DSPs, and ASICs. Furthermore, the aforementioned storage units or storage devices can be any suitable storage medium (including magnetic storage media or magneto-optical storage media), such as resistive random access memory (RRAM), dynamic random access memory (DRAM), static random access memory (SRAM), enhanced dynamic random access memory (EDRAM), high-bandwidth memory (HBM), hybrid memory cube (HMC), ROM, and RAM.

[0112] While numerous embodiments of this disclosure have been shown and described herein, it will be apparent to those skilled in the art that such embodiments are provided by way of example only. Many modifications, alterations, and alternatives will occur to those skilled in the art without departing from the spirit and intent of this disclosure. It should be understood that various alternatives to the embodiments of this disclosure described herein may be employed in the practice of this disclosure. The appended claims are intended to define the scope of this disclosure and therefore cover equivalents or alternatives within the scope of these claims.

Claims

1. A method for performing attention mechanism computational tasks using an artificial intelligence processor, comprising: Divide the attention mechanism computation task into multiple unit computation tasks; Each computational unit executes cyclically until the attention mechanism computation task is completed: Requesting the unit to calculate the task; as well as Complete the requested unit calculation task; The artificial intelligence processor includes at least two of the aforementioned computing units.

2. The method as described in claim 1, wherein, The steps to divide the attention mechanism computation task into multiple unit computation tasks include: Determine the maximum amount of data that a single processing unit can compute in a single operation; Based on the maximum data volume, the attention mechanism computation task is divided into multiple unit computation tasks along the sequence length dimension.

3. The method as described in claim 2, wherein, The computing units in the artificial intelligence processor are configured as a multi-stage computing pipeline; Multiple processing units constitute multiple multi-stage processing pipelines; The multiple multi-stage pipelines execute their respective unit computation tasks in parallel.

4. The method of claim 3, wherein, The steps to complete the requested unit computation task include multi-level operations; wherein, the steps of the multi-level operations include: Calculate the raw attention score for the query data block and key data block of the requested unit computation task; The original attention score is masked to obtain the masked original attention score; Perform a Softmax operation on the masked original attention score to obtain the masked attention score; The output of the requested unit computation task is calculated based on the masked attention score and the Value data block of the requested unit computation task.

5. The method of claim 4, wherein, The steps of the multi-level operation also include: In response to the different number of invalid elements in the mask attention score, the time required for the multi-level, multi-process pipeline to complete their respective unit computation tasks is inconsistent.

6. The method as described in any one of claims 2-5, wherein, The sequence length of the attention mechanism computation task is seq_q, the partition size is tile_q, the batch size is B, the number of heads of the attention mechanism is h, and the maximum amount of data that a single computation unit can compute in a single run is Nmax, where tile_q×B×h≤Nmax.

7. The method as described in any one of claims 1-6, wherein, The method further includes: Determine the total number of computational tasks for the attention mechanism; and The total quantity is stored in a shared cache.

8. The method of claim 7, wherein, The steps for requesting the unit to calculate the task include: The computational unit initiates the computational task of the requesting unit through atomic operations.

9. The method according to claim 8, wherein, The computing unit employs at least a three-stage data pipeline including load-compute-store to execute the unit computation task, and the step of the computing unit initiating the request for the unit computation task through atomic operations includes: The request is initiated in response to the completion of the loading phase of the current unit's computational task.

10. The method as described in any one of claims 8-9, wherein the step of the computing unit initiating the requesting unit computation task through atomic operations includes: Access the shared cache to determine if there are any unexecuted unit computation tasks; In response to the existence of unexecuted unit computation tasks, request unit computation tasks; as well as Update the total quantity based on the requested unit calculation task.

11. The method of claim 10, wherein, The steps to determine if there are any unexecuted unit computation tasks include: In response to the total number of computational tasks of the attention mechanism being greater than 0, it is determined that there are unexecuted unit computational tasks.

12. An artificial intelligence processor, comprising: A control unit configured to divide the attention mechanism computation task into multiple unit computation tasks; The processing unit is configured to perform the following operations cyclically until the attention mechanism computation task is completed: Requesting the unit to calculate the task; as well as Complete the requested unit calculation task; The artificial intelligence processor includes at least two of the aforementioned computing units.

13. The artificial intelligence processor as described in claim 12, wherein, The control unit is further configured to: Determine the maximum amount of data that a single processing unit can compute in a single operation; Based on the maximum data volume, the attention mechanism computation task is divided into multiple unit computation tasks along the sequence length dimension.

14. The artificial intelligence processor of claim 13, wherein, The computing unit is configured as a multi-stage computing pipeline; Multiple processing units constitute multiple multi-stage processing pipelines; The multiple multi-stage computation pipelines execute their respective unit computation tasks in parallel.

15. The artificial intelligence processor of claim 14, wherein, Each multi-stage computation pipeline of the AI ​​processor is configured to complete the requested unit computation task as follows: Calculate the raw attention score for the query data block and key data block of the requested unit computation task; The original attention score is masked to obtain the masked original attention score; Perform a Softmax operation on the masked original attention score to obtain the masked attention score; The output of the requested unit computation task is calculated based on the mask attention score and the Value data block of the requested unit computation task.

16. The artificial intelligence processor of claim 15, wherein, The AI ​​processor responds to the different number of invalid elements in the mask attention score, and the time required for the multiple multi-stage computing pipelines to complete their respective unit computing tasks varies.

17. The artificial intelligence processor as described in any one of claims 13-16, wherein, The sequence length of the attention mechanism computation task is seq_q, the partition size is tile_q, the batch size is B, the number of heads of the attention mechanism is h, and the maximum amount of data that a single computation unit can compute in a single run is Nmax, where tile_q×B×h≤Nmax.

18. The artificial intelligence processor as described in any one of claims 12-17, wherein, The artificial intelligence processor also includes: A shared cache is used to store the total number of computational tasks performed by the attention mechanism.

19. The artificial intelligence processor of claim 18, wherein, The arithmetic unit is further configured to: The request unit computation task is initiated through atomic operations.

20. The artificial intelligence processor of claim 19, wherein the computing unit employs at least a three-stage data pipeline including load-compute-store to execute the unit computation task, and the computing unit is further configured to: The request is initiated in response to the completion of the loading phase of the current unit's computational task.

21. The artificial intelligence processor as described in any one of claims 18-19, wherein, The arithmetic unit is further configured to: Access the shared cache to determine if there are any unexecuted unit computation tasks; In response to the existence of unexecuted unit computation tasks, request unit computation tasks; as well as Update the total quantity based on the requested unit calculation task.

22. A chip configured to include an artificial intelligence processor as described in any one of claims 12-21.

23. A circuit board comprising the chip according to claim 22.