Task scheduling method based on multi-core system, task scheduling apparatus, and related product

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By optimizing task scheduling in a multi-core system, the problems of low computational resource utilization and I/O bottleneck in group matrix multiplication tasks are solved, achieving efficient parallel processing and improving the computational performance of the multi-core system.

WO2026124153A1PCT designated stage Publication Date: 2026-06-18CAMBRICON (KUNSHAN) INFORMATION TECHNOLOGY CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: CAMBRICON (KUNSHAN) INFORMATION TECHNOLOGY CO LTD
Filing Date: 2025-11-19
Publication Date: 2026-06-18

Smart Images

Figure CN2025136071_18062026_PF_FP_ABST

Patent Text Reader

Abstract

Disclosed in the present disclosure are a task scheduling method based on a multi-core system, a task scheduling apparatus, and a related product. The task scheduling apparatus may be comprised in a combined processing apparatus as a processing apparatus, and the combined processing apparatus may further comprise an interface apparatus and a computing apparatus. The computing apparatus and the processing apparatus interact with each other and jointly complete a computing operation specified by a user. The combined processing apparatus may further comprise a storage apparatus, which is respectively connected to the computing apparatus and the processing apparatus, and is used for storing data of the computing apparatus and data of the processing apparatus. The solution of the present disclosure provides a method for scheduling group matrix multiplication calculation tasks on a multi-core system, which method can effectively reduce data memory access and improve the processing efficiency.

Need to check novelty before this filing date? Find Prior Art

Description

Task scheduling methods, task scheduling devices, and related products based on multi-core systems Cross-references to related applications

[0001] This application claims priority to Chinese patent application filed on December 9, 2024, with application number 202411803713.1 and entitled "Task scheduling method, task scheduling device and related products based on multi-core system".

[0002] This application claims priority to Chinese patent application filed on December 9, 2024, with application number 202411804499.1 and entitled "Method, apparatus and related products for performing computational tasks on a multi-core system". Technical Field

[0003] This disclosure generally relates to the field of task scheduling. More specifically, this disclosure relates to a task scheduling method, task scheduling apparatus, computer-readable storage medium, computer program product, processing device, chip, and board based on a multi-core system. Background Technology

[0004] Mixture of Experts (MoE) is a machine learning architecture that divides an artificial intelligence model into different sub-networks (or "experts"), each focusing on a subset of the input data to collectively accomplish a task. This architecture significantly reduces computational costs during pre-training and inference, even for large models with billions of parameters, by selectively activating the experts needed for specific tasks, rather than activating the entire neural network for every task, thus improving efficiency.

[0005] Group General Matrix Multiplication (GEMM) is a core operator in MoE, handling batch matrix multiplication problems. However, unlike ordinary batch matrix multiplication, the matrix dimension of each expert in Group GEMM is variable. This design gives Group GEMM flexibility when handling different experts, but it also introduces computational challenges. The partitioning of computational tasks among multi-core processors has a significant impact on the performance of Group GEMM. Good task partitioning allows multiple cores to collaborate, greatly improving the performance of Group GEMM.

[0006] In view of this, there is an urgent need to provide a solution for task scheduling based on multi-core systems to support efficient parallel processing of computational tasks such as Group GEMM. Summary of the Invention

[0007] In order to at least address one or more of the technical problems mentioned above, this disclosure proposes a task scheduling scheme based on multi-core systems in several aspects.

[0008] In a first aspect, this disclosure provides a task scheduling method based on a multi-core system. The task includes a group matrix multiplication operation task, which comprises multiple sets of matrix multiplications, each set having a variable matrix size. The multi-core system includes multiple computational units, each computational unit including one or more processor cores, each processor core having local memory. The method includes: acquiring layout information of the group matrix multiplication operation task; determining, based on the layout information, a splitting block size block_k in a K-dimensional space and a splitting strategy within a single computational unit, where K is the column dimension of the left-multiplication matrix and the row dimension of the right-multiplication matrix; determining, based on block_k, a splitting block size block_n in an N-dimensional space and a splitting block size block_m in an M-dimensional space to satisfy an optimization objective; where M is the row dimension of the left-multiplication matrix and N is the column dimension of the right-multiplication matrix, and the optimization objective includes any one of the following: memory access, processor core utilization; and scheduling the group matrix multiplication operation task to execute in parallel on the computational units in the multi-core system according to the determined splitting sizes block_k, block_n, and block_m.

[0009] In a second aspect, this disclosure provides a task scheduling apparatus based on a multi-core system, including a processor and a memory, wherein: the processor is configured to execute program instructions; the memory is configured to store the program instructions; and when the program instructions are loaded and executed by the processor, the processor performs the task scheduling method based on a multi-core system according to the first aspect.

[0010] In a third aspect, this disclosure provides a computer-readable storage medium storing program instructions that, when loaded and executed by a processor, cause the processor to perform the task scheduling method based on a multi-core system as described in the first aspect.

[0011] In a fourth aspect, this disclosure provides a computer program product, including a computer program or instructions that, when executed by a processor, implement the task scheduling method based on a multi-core system as described in the first aspect.

[0012] In a fifth aspect, this disclosure provides a processing apparatus including a task scheduling apparatus based on a multi-core system as described in the second aspect.

[0013] In a sixth aspect, this disclosure provides a chip including the processing apparatus described in the fifth aspect.

[0014] In a seventh aspect, this disclosure provides a board including the chip described in the sixth aspect.

[0015] Through the task scheduling method, task scheduling device and related products based on multi-core systems provided above, the embodiments disclosed in this paper provide an optimized task scheduling scheme for the implementation of group matrix multiplication operations on multi-core systems. This can effectively reduce frequent data exchange and loading, avoid input / output (I / O) bottleneck problems, and fully utilize the parallel operation characteristics of multi-core architecture to improve processing efficiency. Attached Figure Description

[0016] The above and other objects, features, and advantages of exemplary embodiments of this disclosure will become readily apparent upon reading the following detailed description with reference to the accompanying drawings. In the drawings, several embodiments of this disclosure are illustrated by way of example and not limitation, and like or corresponding reference numerals denote like or corresponding parts, wherein:

[0017] Figure 1 shows an exemplary structural diagram of a board in some embodiments of this disclosure;

[0018] Figure 2 shows an exemplary structural diagram of the combined processing apparatus in some embodiments of this disclosure;

[0019] Figure 3 illustrates an exemplary internal structure diagram of a computing device in some embodiments of this disclosure;

[0020] Figure 4 shows an exemplary structural diagram of a processor core in some embodiments of this disclosure;

[0021] Figure 5 illustrates an exemplary schematic diagram of a processor core writing data to a processor core of another cluster in some embodiments of this disclosure;

[0022] Figure 6 shows an exemplary schematic diagram of a two-layer, three-stage pipeline in some embodiments of this disclosure;

[0023] Figure 7 illustrates a schematic diagram of the storage (memory) architecture of a multi-core system to which embodiments of this disclosure can be applied;

[0024] Figure 8 shows an exemplary flowchart of a task scheduling method based on a multi-core system according to an embodiment of this disclosure;

[0025] Figure 9 illustrates a grid-block abstraction of a matrix multiplication task according to some embodiments of this disclosure;

[0026] Figure 10 illustrates the task block processing method for a group matrix multiplication task according to some embodiments of this disclosure;

[0027] Figure 11 shows a logical flowchart of the task block cyclic execution process according to some embodiments of this disclosure;

[0028] Figure 12 schematically illustrates the storage diagram of data at different storage levels;

[0029] Figure 13 illustrates a four-stage pipelined process for a matrix multiplication task according to some embodiments of this disclosure;

[0030] Figure 14 illustrates a schematic diagram of the output matrix blocks being rearranged according to some embodiments of this disclosure;

[0031] Figure 15 illustrates a schematic diagram of the output matrix block order rearrangement according to other embodiments of this disclosure.

[0032] Figure 16 schematically illustrates a situation of unbalanced load;

[0033] Figure 17 shows a schematic logic flowchart of adaptive scheduling for performing group matrix multiplication tasks according to some embodiments of this disclosure;

[0034] Figure 18 shows a flowchart of a method for performing computational tasks on a multi-core system according to an embodiment of this disclosure. Detailed Implementation

[0035] The technical solutions in the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this disclosure, not all of them. Based on the embodiments in this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.

[0036] It should be understood that the terms “comprising” and “including” used in this disclosure and claims indicate the presence of the described features, integrals, steps, operations, elements and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or collections thereof.

[0037] It should also be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of this disclosure. As used in this disclosure and claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in this disclosure and claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes such combinations.

[0038] As used in this specification and claims, the term "if" may be interpreted, depending on the context, as "when," "once," "in response to determination," or "in response to detection." Similarly, the phrase "if determined" or "if [described condition or event] is detected" may be interpreted, depending on the context, as "once determined," "in response to determination," "once [described condition or event] is detected," or "in response to detection of [described condition or event]."

[0039] The specific embodiments disclosed herein will now be described in detail with reference to the accompanying drawings. Exemplary hardware environment

[0040] Figure 1 shows a schematic diagram of the structure of a board 10 according to an embodiment of this disclosure. As shown in Figure 1, the board 10 includes a chip 101, which is a system-on-a-chip (SoC) that integrates one or more combined processing devices. The combined processing device is an artificial intelligence computing unit used to support various deep learning and machine learning algorithms, meeting the intelligent processing needs of complex scenarios in fields such as computer vision, speech, natural language processing, and data mining. In particular, deep learning technology is widely used in the field of cloud intelligence. A significant characteristic of cloud intelligence applications is the large amount of input data, which places high demands on the platform's storage and computing capabilities. The board 10 of this embodiment is suitable for cloud intelligence applications, possessing massive off-chip storage, on-chip storage, and abundant computing power.

[0041] Chip 101 is connected to external device 103 via external interface device 102. External device 103 may be, for example, a server, computer, camera, monitor, mouse, keyboard, network card, or Wi-Fi interface. Data to be processed can be transmitted from external device 103 to chip 101 via external interface device 102. The calculation results from chip 101 can be transmitted back to external device 103 via external interface device 102. Depending on the application scenario, external interface device 102 may have different interface forms, such as a PCIe interface.

[0042] The board 10 also includes a storage device 104 for storing data, which includes one or more memory cells 105. The storage device 104 is connected to and transmits data with the controller 106 and the chip 101 via a bus. The controller 106 in the board 10 is configured to regulate the state of the chip 101. Therefore, in one application scenario, the controller 106 may include a microcontroller (MCU).

[0043] Figure 2 is a structural diagram illustrating the combined processing device in chip 101 of this embodiment. As shown in Figure 2, the combined processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a storage device (DRAM) 204.

[0044] The computing device 201 is configured to perform user-specified operations. It is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations. It can interact with the processing device 203 through the interface device 202 to jointly complete the user-specified operations.

[0045] Interface device 202 is used to transmit data and control commands between computing device 201 and processing device 203. For example, computing device 201 can obtain input data from processing device 203 via interface device 202 and write it to on-chip storage device of computing device 201. Further, computing device 201 can obtain control commands from processing device 203 via interface device 202 and write them to on-chip control cache of computing device 201. Alternatively or optionally, interface device 202 can also read data from storage device of computing device 201 and transmit it to processing device 203.

[0046] Processing device 203, as a general-purpose processing device, performs basic control including but not limited to data transfer, and starting and / or stopping computing device 201. Depending on the implementation, processing device 203 may be one or more types of processors, including but not limited to digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs. As mentioned above, computing device 201 disclosed herein can be considered as having a single-core structure or a homogeneous multi-core structure. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.

[0047] Storage device 204 may be DRAM, an off-chip memory used to store data to be processed, and is DDR, typically 16G or larger in size, used to store data of computing device 201 and / or processing device 203.

[0048] Figure 3 shows a schematic diagram of the internal structure of computing device 201. Computing device 201 is used to process input data for computer vision, speech, natural language processing, data mining, etc. The computing device 201 in the figure adopts a multi-core hierarchical architecture design. As a system-on-a-chip, computing device 201 includes multiple clusters, and each cluster includes multiple processor cores. In other words, computing device 201 is constructed in a hierarchical structure of system-on-a-chip, clusters, and processor cores.

[0049] From the perspective of the system-on-a-chip hierarchy, as shown in Figure 3, the computing device 201 includes an external storage controller 301, a peripheral communication module 302, an on-chip interconnect module 303, a synchronization module 304, and multiple clusters 305.

[0050] There can be multiple external storage controllers 301; two are shown exemplarily in the figure. These controllers respond to access requests from the processor core to access external storage devices, such as DRAM 204 in Figure 2, thereby reading data from or writing data to external storage. The peripheral communication module 302 receives control signals from the processing device 203 via the interface device 202, initiating the computing device 201 to execute tasks. The on-chip interconnect module 303 connects the external storage controllers 301, the peripheral communication module 302, and multiple clusters 305 to transmit data and control signals between the modules. The synchronization module 304 is a global barrier controller (GBC) used to coordinate the working progress of each cluster and ensure information synchronization. The multiple clusters 305 are the computing core of the computing device 201; four are shown exemplarily in the figure. With hardware development, the computing device 201 disclosed herein may also include eight, sixteen, sixty-four, or even more clusters 305. The clusters 305 are used to efficiently execute deep learning algorithms.

[0051] In terms of cluster hierarchy, as shown in Figure 3, each cluster 305 includes multiple processor cores (IPU cores) 306 and one memory core (MEM core) 307.

[0052] Four processor cores 306 are shown exemplarily in the figure, but this disclosure does not limit the number of processor cores 306. Their internal architecture is shown in Figure 4. Each processor core 306 includes three main modules: a control module 41, a processing module 42, and a storage module 43.

[0053] The control module 41 coordinates and controls the operation of the computation module 42 and the storage module 43 to complete the deep learning task. It includes an instruction fetch unit (IFU) 411 and an instruction decode unit (IDU) 412. The instruction fetch unit 411 fetches instructions from the processing device 203, and the instruction decode unit 412 decodes the fetched instructions and sends the decoding result as control information to the computation module 42 and the storage module 43.

[0054] The computation module 42 includes a vector operation unit 421 and a matrix operation unit 422. The vector operation unit 421 is used to perform vector operations and can support complex operations such as vector multiplication, addition, and nonlinear transformations; the matrix operation unit 422 is responsible for the core computations of deep learning algorithms, such as matrix multiplication and convolution.

[0055] Storage module 43 is used to store or move related data, including neuron RAM (NRAM) 431, weight RAM (WRAM) 432, input / output direct memory access (IODMA) 433, and move direct memory access (MVDMA) 434. NRAM 431 is used to store feature maps for computation by processor core 306 and intermediate results after computation; WRAM 432 is used to store the weights of the deep learning network; IODMA 433 controls the memory access of NRAM 431 / WRAM 432 and DRAM 204 through broadcast bus 309; MVDMA 434 controls the memory access of NRAM 431 / WRAM 432 and SRAM 308.

[0056] Returning to Figure 3, storage core 307 is primarily used for storage and communication, namely storing shared data or intermediate results among processor cores 306, and performing communication between cluster 305 and DRAM 204, communication between clusters 305, and communication between processor cores 306. In other embodiments, storage core 307 has scalar operation capabilities for performing scalar operations.

[0057] Storage core 307 includes a shared memory unit (SRAM) 308, a broadcast bus 309, a cluster direct memory access (CDMA) module 310, and a global direct memory access (GDMA) module 311. SRAM 308 acts as a high-performance data relay station. Data multiplexed between different processor cores 306 within the same cluster 305 does not need to be obtained from DRAM 204 by each processor core 306 individually. Instead, it is relayed between processor cores 306 via SRAM 308. Storage core 307 only needs to quickly distribute the multiplexed data from SRAM 308 to multiple processor cores 306, thereby improving inter-core communication efficiency and significantly reducing on-chip and off-chip I / O access.

[0058] Broadcast bus 309, CDMA 310, and GDMA 311 are used to perform communication between processor cores 306, communication between clusters 305, and data transfer between cluster 305 and DRAM 204, respectively. These will be explained below.

[0059] The broadcast bus 309 is used to complete high-speed communication between the processor cores 306 within the cluster 305. In this embodiment, the broadcast bus 309 supports inter-core communication methods including unicast, multicast, and broadcast. Unicast refers to point-to-point (i.e., data transmission from one processor core to another) data transmission. Multicast is a communication method that transmits a piece of data from SRAM 308 to several specific processor cores 306. Broadcast is a communication method that transmits a piece of data from SRAM 308 to all processor cores 306, and is a special case of multicast.

[0060] CDMA 310 is used to control SRAM 308 access between different clusters 305 within the same computing device 201. Figure 5 illustrates the working principle of CDMA 310 when one processor core wants to write data to the processor core of another cluster. In this application scenario, the same computing device includes multiple clusters. For ease of explanation, only cluster 0 and cluster 1 are shown in the figure. Cluster 0 and cluster 1 each include multiple processor cores. Similarly, for ease of explanation, only processor core 0 is shown in cluster 0, and only processor core 1 is shown in cluster 1. Processor core 0 wants to write data to processor core 1.

[0061] First, processor core 0 sends a unicast write request to write data into its local SRAM 0. CDMA 0 acts as the master and CDMA 1 acts as the slave. The master pushes the write request to the slave, that is, the master sends the write address AW and the write data W to transmit the data to SRAM 1 of cluster 1. Then, the slave sends a write response B as a response. Finally, processor core 1 of cluster 1 sends a unicast read request to read the data from SRAM 1.

[0062] Returning to Figure 3, GDMA 311 works in conjunction with external memory controller 301 to control memory access from SRAM 308 to DRAM 204 in cluster 305, or to read data from DRAM 204 into SRAM 308. As previously mentioned, communication between DRAM 204 and NRAM 431 or WRAM 432 can be achieved through two channels. The first channel is direct communication between DRAM 204 and NRAM 431 or WRAM 432 via IODMA 433; the second channel involves first transferring data between DRAM 204 and SRAM 308 via GDMA 311, and then transferring data between SRAM 308 and NRAM 431 or WRAM 432 via MVDMA 434. Although the second channel appears to require more components and has a longer data flow, in some embodiments, the bandwidth of the second channel is actually much greater than that of the first channel. Therefore, communication between DRAM 204 and NRAM 431 or WRAM 432 may be more efficient via the second channel. The embodiments disclosed herein can select the data transmission channel based on their hardware capabilities.

[0063] In other embodiments, the functions of GDMA 311 and IODMA 433 can be integrated into the same component. For ease of description, this disclosure treats GDMA 311 and IODMA 433 as different components. For those skilled in the art, any component whose implemented functions and achieved technical effects are similar to those disclosed herein falls within the scope of protection of this disclosure. Furthermore, the functions of GDMA 311, IODMA 433, CDMA 310, and MVDMA 434 can also be implemented by the same component. Similarly, any component whose implemented functions and achieved technical effects are similar to those disclosed herein falls within the scope of protection of this disclosure.

[0064] One of the key reasons why the computing device 201 has strong computing power is its three-level operation hierarchy of system-on-chip cluster-processor core, combined with a three-level memory design of DRAM-SRAM-NRAM / WRAM, which allows data to be cached and computed at appropriate levels, forming a sufficient pipeline.

[0065] The computing device 201 performs calculations in three main stages: Load stage: loading data; Compute stage: transferring data, performing calculations, and transferring intermediate results; Store stage: storing the results.

[0066] More specifically, in some embodiments, a two-layer, three-stage pipeline can be employed, as shown in Figure 6. The first-layer load stage 601, computation stage 602, and write-back stage 603 occur at the cluster level. In the first-layer load stage 601, the GDMA 330 loads data from DRAM 204 into SRAM 308. In the first-layer computation stage 602, the cluster 305 performs calculations on the loaded on-chip cell diagram and generates the calculation results. In the first-layer write-back stage 603, the GDMA 330 writes the calculation results back from SRAM 308 to DRAM 204.

[0067] Since cluster 305 includes multiple processor cores 306, the first-level computation stage 602 actually divides the on-chip cell graph into corresponding subgraphs through storage core 307 and broadcasts them to at least one processor core 306 for computation. Therefore, the second-level tertiary watershed occurs in processor core 306. More specifically, the second-level loading stage 604 loads the subgraph from SRAM 308 into NRAM 431 using MVDMA 434. The second-level computation stage 605 moves the subgraph and subweights to the arithmetic module 42 for computation, and then moves the intermediate results back to NRAM 431. The second-level write-back stage 606 is when MVDMA 434 writes the intermediate results from NRAM 431 back to SRAM 308.

[0068] The first layer of the pipeline refers to the fact that the first layer load stage 601, the first layer computation stage 602, and the first layer store-back stage 603 can be performed in parallel. Taking the same cluster 305 processing the j-th on-chip cell graph, the (j+1)-th on-chip cell graph, and the (j+2)-th on-chip cell graph as an example, firstly, the j-th on-chip cell graph is loaded into SRAM 308 in the first layer load stage 601. Then, the j-th on-chip cell graph is computed in the first layer computation stage 602, and the first computation result is transferred back to SRAM 308. Simultaneously, while the j-th on-chip cell graph is being computed, the (j+1)-th on-chip cell graph is loaded into SRAM 308 in the first layer load stage 607. When the first calculation result is stored back into DRAM 204 in the first layer store-back stage 603, the (j+1)th on-chip cell diagram is calculated in the first layer calculation stage 608, and the second calculation result is transferred back into SRAM 308. At the same time, the (j+2)th on-chip cell diagram is loaded into SRAM 308 in the first layer load stage 610. The first layer pipeline proceeds in this manner.

[0069] To facilitate the aforementioned pipelined operation, the SRAM 308 in this embodiment includes two storage spaces: ping-pong and pong-pong. Data pipelined according to the ping-pong attribute of the SRAM 308 is divided into three types: input / output ping-pong (IO parity), input ping-pong (input parity), and no ping-pong (no parity). IO parity supports parallel loading, computation, and write-back. To achieve IO parity, the ping-pong and pong-pong storage cells need to be exactly equal, each used for loading and write-back respectively. Input ping-pong only supports parallel write-back and computation, which adds extra time to data transfer within the SRAM 308. Compared to IO parity, the ping-pong and pong-pong storage cells do not need to be exactly equal, but an additional cache of the same size as the write-back storage space needs to be allocated. No ping-pong means that loading / write-back and computation are serial, and no additional space is required.

[0070] To achieve the aforementioned first-level pipeline, some embodiments of the SRAM 308 have ping-pong and pong-pong memory cells of the same size to achieve an input / output ping-pong effect. Continuing with the explanation in FIG6, the storage area involved in the first-level load stage 601, first-level computation stage 602, and first-level write-back stage 603 of the j-th on-chip cell diagram is limited to ping-pong memory cells, while the storage area involved in the first-level load stage 607, first-level computation stage 608, and first-level write-back stage 609 of the (j+1)-th on-chip cell diagram is limited to pong-pong memory cells, and the storage area involved in the first-level load stage 610, first-level computation stage 611, and first-level write-back stage 612 of the (j+2)-th on-chip cell diagram is again limited to ping-pong memory cells, and ping-pong and pong-pong memory cells are used alternately for storage in this manner.

[0071] The second-level pipeline refers to the parallel operation of the second-level load stage 604, the second-level computation stage 605, and the second-level store-back stage 606. Consider an example where the same processor core 306 wants to process the i-th, (i+1)-th, and (i+2)-th subgraphs in the j-th on-chip unit graph. First, the i-th subgraph is broadcast to NRAM 431 in the second-level load stage 604. Then, the i-th subgraph is computed in the second-level computation stage 605 to produce the i-th intermediate result, which is then moved back to NRAM 431. Simultaneously, the (i+1)-th subgraph is broadcast to NRAM 431 in the second-level load stage 613. The i-th intermediate result is stored back into SRAM 308 in the second layer storage-back stage 606. At the same time, the (i+1)-th subgraph is calculated in the second layer calculation stage 614 to generate the (i+1)-th intermediate result, and the (i+1)-th intermediate result is moved back into NRAM 431. The (i+2)-th subgraph is loaded into NRAM 431 in the second layer loading stage 615.

[0072] Considering that each cluster 305 has different tasks and the completion time will naturally be different, the synchronization module 304 in some embodiments will use synchronization barrier instructions to synchronize the task completion time in order to avoid timing errors.

[0073] While an exemplary multi-core hierarchical architecture has been described above, considering that group matrix multiplication tasks may be implemented on various multi-core architectures, the task scheduling scheme needs to consider compatibility and universality across different multi-core architectures. The inventors note that the goals of task splitting optimization on multi-core architectures are primarily twofold: I / O memory access and processor core utilization.

[0074] On the one hand, due to the limited capacity of internal storage resources, the massive amount of data processing can lead to a large amount of data interaction between the processor core and external storage devices. The bandwidth of the input / output (I / O) bus between the processor core and external memory is limited, which can easily lead to serious I / O bottlenecks, causing data transfer delays and significantly reducing computational efficiency during parallel processing. Furthermore, not only does the bandwidth limitation of the I / O bus become a bottleneck for system performance, but the large amount of I / O accesses between the processor core and external storage devices also negatively impacts computational and power consumption.

[0075] On the other hand, current computer hardware employs tensor core technology to accelerate large-scale matrix multiplication, which implements a relatively large-scale matrix multiplication on a single computational unit. For example, some hardware architectures can compute a 16*8*16 (M*K*N) matrix multiplication in a single computational unit. This requires that the M and N dimensions be aligned to 16. However, in MoE's Group GEMM, the matrix dimensions of each expert are variable, leading to a loss of computational power due to alignment issues with the M and N dimensions. Therefore, it is necessary to optimize the utilization of processor cores.

[0076] In view of this, the embodiments disclosed herein abstract a storage (memory) architecture for various multi-core system structures to ensure compatibility with existing multi-core architectures. Based on this, considering the amount of data interaction between the processor core and external circuits (e.g., external storage devices) and / or the utilization rate of the processor core, a task scheduling scheme is provided to efficiently schedule group matrix multiplication tasks to be executed in parallel on a multi-core system, thereby improving the computational efficiency of parallel operations on the multi-core system.

[0077] Figure 7 illustrates a schematic diagram of the storage (memory) architecture of a multi-core system to which embodiments of this disclosure can be applied. This storage architecture takes into account a variety of multi-core systems and is compatible with different existing multi-core architectures.

[0078] As shown in the diagram, the lowest-level DRAM 710 is off-chip memory used to store input / output data, such as input data processed by the processor core and processing results obtained by the multi-core system. DRAM 710 can be DDR memory, with types including LPDDR, GDDR, and HBM. The processor core can access off-chip memory via an on-chip memory control unit (e.g., a DDR controller) based on a bus protocol. DRAM 710 can be considered global storage or global memory at this chip level.

[0079] The next layer up in the diagram is the L2 cache 720. This cache layer also belongs to off-chip memory. The L2 cache 720 can include an LLC (Last Level Cache). As mentioned earlier, the LLC can have multiple operating modes, including data latching, streaming access, and flush-write-back to system memory. When the LLC's data latching mode is enabled, the characteristics of LLC latching data access can be fully utilized to maximize the utilization of data residing in the LLC, achieving the effect of bandwidth expansion.

[0080] Moving up one level in the diagram is shared memory 730. Shared memory 730 allows for on-chip storage sharing among multiple processor cores within a computing cluster. It is understood that multiple shared memories can exist, each serving multiple processor cores; only one shared memory is shown as an example in the diagram.

[0081] The top layer in the diagram is the local memory 740 for each processor core 750, and the shared memory 730 is the on-chip memory for the corresponding processor core 750. Depending on the specific hardware architecture, the local memory 740 can be implemented in different ways, such as a data cache or a register page, and the embodiments disclosed herein are not limited in this respect.

[0082] Although Figure 7 illustrates memory architectures abstracted from various multi-core architectures, those skilled in the art should understand that some storage levels in Figure 7 can be omitted in different hardware implementations. For example, in one implementation, shared memory 730 may be absent; in another, L2 cache 720 may be absent, or even if LLC is configured in hardware, its data latching mode may not be enabled, which is equivalent to having no L2 cache; in yet another implementation, neither shared memory 730 nor L2 cache 720 may be present, or the data latching mode of LLC may not be enabled. In the task scheduling scheme of the embodiments disclosed herein, based on the storage architecture of Figure 7, and taking into account the various possible implementations described above, an optimized task splitting scheme is provided to support the efficient parallel execution of group matrix multiplication-type operations in a multi-core architecture.

[0083] In different multi-core systems or different operating modes of the same multi-core system, multiple processor cores may be scheduled to perform parallel operations in different units or granularities. For example, in NVIDIA's multi-core systems, a single processor cluster (SM) is generally used as the basic unit, while in some Cambricon multi-core systems, a compute cluster is generally used as the basic unit. In the embodiments disclosed herein, the scheduling granularity of processor cores is referred to as a "computation unit," and each computation unit includes one or more processor cores. Depending on the different multi-core systems, the number of processor cores contained in a computation unit varies. For example, in some implementations, a compute cluster can be used as a computation unit, containing 4 processor cores; in other implementations, four compute clusters can be used as a computation unit, containing 16 processor cores; and in still other implementations, a single processor core can be used as a computation unit.

[0084] It should be noted that the "group matrix multiplication tasks" mentioned in this article include not only direct group matrix multiplication tasks, but also other tasks that can be transformed into group matrix multiplication operations, such as multiple convolution operations, multiple Einstein sums, etc.

[0085] A single matrix multiplication operation can usually be represented as: C = op(A) * op(B) (1)

[0086] Here, `op` indicates whether to transpose the matrix. Left multiplication matrix `op(A)` and right multiplication matrix `op(B)` are M×K and K×N matrices respectively, and the output matrix `C` is an M×N matrix. In most cases, matrices A and B do not need to be transposed, therefore `op(A) = A` and `op(B) = B`. For simplicity, matrices A and B will be mentioned directly below, omitting `op`. When there are multiple sets of matrix multiplication operations, the size of each set of multiplication matrices is variable. The input and output data of convolution operations typically involve dimensions such as batch, height (H), width (W), input channels (Ci), output channels (Co), kernel height (Kh), and width (Kw). These dimensions can be combined to transform into matrix multiplication operations. For example, for convolution operations, the M, N, and K dimensions can be M = Batch × H × W, N = Co, and K = Ci × Kh × Kw, respectively. Therefore, the scheme described in this disclosure based on matrix multiplication task can also be similarly applied to these computational tasks that can be transformed into matrix multiplication operations.

[0087] In intelligent computing systems, high-performance multi-core systems are typically used as accelerator cards in servers. They exchange data with the host CPU via the PCIe bus, forming a host-device working mode. For example, in the combined processing device 20 shown in Figure 2, computing device 201 is a high-performance multi-core system, serving as the device. Processing device 203 can serve as the host. The scheme of this disclosed embodiment includes two aspects: a task scheduling scheme on the host side and a task execution scheme on the device side. The host side determines the optimization parameters of the matrix multiplication splitting scheme based on the optimization objective of task splitting and scheduling, and clarifies the matrix multiplication splitting strategy. Then, the host side transmits these optimization parameters from the host side to the device side, i.e., the high-performance multi-core system, by launching the kernel. Each computing unit in the multi-core system on the device side determines and executes the Group GEMM task that it needs to process based on these optimization parameters. Group Matrix Multiplication Task Scheduling Scheme

[0088] In this disclosed embodiment, a task scheduling scheme based on a multi-core system is provided to distribute group matrix multiplication operations to multiple processor cores of the multi-core system for execution.

[0089] The biggest problem with existing group matrix multiplication decomposition methods is that the M, K, and N dimensions of a single group matrix multiplication are not large enough to fully utilize the computing power of a multi-core system, making decomposition a significant challenge. If M and N are large enough, existing decomposition methods from GEMM can be reused, such as the chessboard decomposition scheme provided in Chinese patent application publication CN118733206A. However, when M and N are moderate, and this is the computational bottleneck, using the mindset of single matrix multiplication to consider the group matrix multiplication problem cannot fully leverage the performance of group matrix multiplication.

[0090] Therefore, this disclosed embodiment specifically designs a partitioning optimization method for group matrix multiplication, taking into account its characteristics, that is, partitioning from the perspective of a single operational unit. In determining the size of the partition block, it distinguishes between different scenarios involving I / O bottlenecks and computational bottlenecks, thereby obtaining the optimal partitioning scheme. Here, the partition block refers to the size of the data block that each operational unit needs to compute in each round, which includes three dimensions M, N, and K, with the sizes of each dimension being block_m, block_n, and block_k, respectively.

[0091] Figure 8 illustrates an exemplary flowchart of a task scheduling method 800 based on a multi-core system according to an embodiment of this disclosure. The tasks to be scheduled include grouped matrix multiplication tasks, which comprise multiple sets of matrix multiplications, each with a variable matrix size. The multi-core system includes multiple computational units, each comprising one or more processor cores, and each processor core having local memory (e.g., referring to the memory architecture of Figure 7).

[0092] As shown in Figure 8, in step 810, the layout information of the group matrix multiplication operation task is obtained.

[0093] Before splitting the task, it is necessary to understand the computational scale of the task. Layout information may include the matrix dimensions of each group of matrix multiplications, such as the dimensions of M, K, and N, and whether transposition is required. Considering that the matrix dimensions of each group of matrix multiplications in group matrix multiplication tasks are variable, this disclosure embodiment designs the splitting scheme based on the maximum dimension size of each group of matrix multiplications. In some embodiments, obtaining the above layout information includes: obtaining the maximum values m_max, k_max, and n_max of each dimension of M, K, and N in the multiple groups of matrix multiplications. Depending on the location of the matrix multiplication data, there may be different ways to obtain the maximum values of the above dimensions.

[0094] In some embodiments, in response to the dimension data being on the host side, the maximum value of each dimension can be directly searched from the dimension data. It is understood that for Group GEMM, since M, K, and N of each group matrix multiplication are variable, the dimension data provided to the host side is arrays of M, K, and N (e.g., m_array[], k_array[], n_array[]), and the maximum value needs to be searched from each array.

[0095] In other embodiments, in response to the dimensional data on the device, the maximum value of each dimension can be determined based on prior knowledge. For example, in the inference scenario of a typical MoE network, M is variable, while K and N are immutable. Therefore, the maximum values of dimensions K and N are themselves, and the maximum value of dimension M, m_max, can be taken as m_max = batch * sequence, where batch is the number of batches in the inference scenario and sequence is the sequence length. As another example, in the training scenario of a typical MoE network, the scale of the forward and backward data is the same as in the inference scenario and can be set similarly; however, in the backward filter (i.e., the reverse convolution kernel), K is variable, while M and N are immutable. Therefore, the maximum value of K can be set as the maximum value of dimension M, m_array[], during the inference phase.

[0096] In some other embodiments, in response to the dimensional data being on the device, it is necessary to retrieve the dimensional data from the device and search for the maximum value of each dimension. This approach is general; for example, when M, K, and N are all variable on the device, the memcpy (memory copy) instruction can be used to copy them from the device to the host before performing the maximum value retrieval. Alternatively, based on the previous implementation, information from the device malloc (device memory allocation) stage or other prior information can be directly used to provide a maximum estimate. These estimates do not need to be highly precise; they only need to help determine whether the task is a computational bottleneck or an I / O bottleneck, as described later.

[0097] Next, in step 820, based on the acquired layout information, the split size block_k of the split block in the K dimension and the split strategy within a single operation unit are determined, where the K dimension is the column dimension of the left multiplication matrix and the row dimension of the right multiplication matrix.

[0098] Since the K-dimensional dimension involves both the column dimension of the left-multiplied matrix and the row dimension of the right-multiplied matrix, pre-defining the K-dimensional split size block_k is more conducive to reducing the search volume in the subsequent search optimization of the M and N-dimensional split sizes block_m and block_n, thereby reducing the host-side overhead.

[0099] In some embodiments, determining the partition size block_k in the K dimension may include: setting block_k to meet the efficiency requirements of matrix multiplication instructions supported by multi-core systems in response to group matrix multiplication tasks being computationally bottlenecked tasks; or setting block_k to prioritize improving the memory access efficiency of the largest matrix in response to group matrix multiplication tasks being memory access bottleneck tasks; wherein the largest matrix is determined based on the size of the left-multiplied matrix and the size of the right-multiplied matrix. In these embodiments, by distinguishing between computationally bottlenecked and memory access bottleneck tasks, the needs of different types of tasks can be considered more meticulously to better match optimization objectives.

[0100] Generally, depending on the storage dimension where dimension K is located, block_k can be set as follows: when the storage dimension where dimension K is located is the lowest dimension, block_k is set to a multiple of the cache line size; where the multiple is a natural number; or when the storage dimension where dimension K is located is not the lowest dimension, block_k is set to the alignment granularity align_k number of the instructions that implement matrix multiplication; where the multiple is a natural number.

[0101] Specifically, for computationally bottlenecked tasks, it is necessary to maximize computational efficiency, that is, to meet the efficiency requirements of matrix multiplication instructions supported by multi-core systems. The matrix multiplication instruction can be a single matrix multiplication instruction or an instruction composed of multiple matrix multiplication-accumulation (MMA) instructions; this disclosed embodiment has no limitations in this regard. When the storage dimension containing dimension K is the lowest dimension, block_k can be set to a cache line size based on the transpose information. For example, for GPUs, in scenarios where transpose information transB = true or transA = false, a memory sector size, such as 64B, can be selected. This setting method can guarantee basic memory access functionality and efficiency while reserving space for splitting dimensions M and N to improve computational efficiency. When the storage dimension containing dimension K is not the lowest dimension, block_k is set to a multiple of the alignment granularity align_k of the instruction implementing matrix multiplication, based on the transpose information. For example, in scenarios where transpose information transB = false and transA = true, the multiple of alignment granularity align_k is 2x, 3x, etc. This setup satisfies the minimum alignment granularity requirement of matrix multiplication instructions. At the same time, to meet the efficiency requirements of on-chip matrix multiplication calculation instructions and increase the depth of K-dimensional loops, the size of align_k can be selected to maximize the computing power of the operation unit and improve the computational efficiency of matrix multiplication instructions.

[0102] For memory access bottleneck tasks, it's crucial to ensure the memory access performance of the largest matrix to prevent it from being unavailable for computation due to memory bottlenecks. The largest matrix is determined by the sizes of the left and right multiplication matrices, specifically by the maximum values of N and M, based on the previously obtained layout information. When the storage dimension containing dimension K is the lowest dimension, the size of `block_k` can be appropriately increased. For example, in a transformer decoder scenario, when `transB = true`, `block_k` can be set to a multiple of the cache line size, such as 2 or 3 times. In the multi-core system described earlier, 2 times the cache line size, i.e., 1024B, is used. When the storage dimension containing dimension K is not the lowest dimension, `block_k` can be set smaller, ensuring only the minimum alignment granularity of the matrix multiplication instructions. For example, `block_k` can be set to the alignment granularity `align_k` of the instructions implementing matrix multiplication.

[0103] In some embodiments, the task type of the group matrix multiplication operation is determined based on the layout information. Further details are provided: a computing power limit is determined based on the layout information of the group matrix multiplication operation, the data bit width of the left multiplication matrix, and the data bit width of the right multiplication matrix; the bandwidth of the off-chip DRAM of the multi-core system and the number of floating-point operations per second of the multi-core system are used to determine the computation-to-memory access ratio; the computation-to-memory access ratio of the multi-core system and the computing power limit are compared, and the task type of the group matrix multiplication operation is determined based on the comparison result. Specifically, the following conditions can be used to determine whether a task is a computation bottleneck or a memory bottleneck. When the following conditions are met, the group matrix multiplication operation is determined to be a computation bottleneck task; otherwise, it is a memory bottleneck task:

[0104]

[0105] in, CGMA DEV The compute-to-memory ratio (TFLOPS) is the ratio of compute operations to memory operations in a multi-core system, reflecting the efficiency of computation and memory operations. TFLOPS stands for one trillion floating-point operations per second. DRAM , where is the bandwidth of the off-chip DRAM (e.g., 204 in Figure 2), in GB / s. `group` represents the number of matrix multiplication groups in the matrix multiplication task, `sizeof()` represents the data bit width, A is the left-multiplied matrix, B is the right-multiplied matrix, and M, K, and N are the maximum values of dimensions M, K, and N obtained in the preceding steps. Unless otherwise specified, for the sake of brevity, M, K, and N are used in the task scheduling scheme of this disclosed embodiment to refer to the maximum value of their respective dimensions.

[0106] The above describes how to determine the K-dimensional partition size `block_k` to address computational and memory access bottlenecks. Step 820 can also determine the partitioning strategy within a single computational unit. As mentioned earlier, the number of processor cores contained within a single computational unit varies depending on the multi-core system.

[0107] In some embodiments, determining the splitting strategy within a single computing unit may include: determining the number of processor cores Ncore within a single computing unit; splitting the block into pm*pn parts, where pm*pn = Ncore, pm is the number of splits in the M-dimensional dimension, and pn is the number of splits in the N-dimensional dimension.

[0108] Specifically, either of the following splitting methods can be used: evenly distribute pm and pn across the M and N dimensions; or split the other dimension in response to either the M or N dimensions in the splitting block being less than the corresponding threshold.

[0109] For example, in an architecture with four processor cores within a single processing unit, a typical approach is to perform a 2×2 split within a single block, that is, splitting along both block_m and block_n into two parts and distributing them across the four processor cores. However, in special cases, such as when one dimension is smaller than another, the larger dimension can be split, preserving the complete smaller dimension. For instance, if block_n is small, less than twice align_n (align_n is the alignment granularity of the matrix multiplication instruction in the N-dimensional dimension), while block_m is large, block_m can be split into four parts, preserving the complete block_n. Similarly, if block_m is small, less than align_m (align_m is the alignment granularity of the matrix multiplication instruction in the M-dimensional dimension) or 2*align_m, while block_n is large, block_n can be split into four parts, preserving the complete block_m.

[0110] Next, in step 830, based on the block_k determined above, the split size block_n in N dimensions and the split size block_m in M dimensions are determined to meet the optimization objective; where M is the row dimension of the left multiplication matrix and N is the column dimension of the right multiplication matrix. The optimization objective includes any of the following: memory access and processor core utilization.

[0111] In some embodiments, determining the split size block_n in N dimensions and the split size block_m in M dimensions may include: searching for a block_n that satisfies the above optimization objective and determining the corresponding block_m in an N-dimensional search space; or searching for a block_m that satisfies the above optimization objective and determining the corresponding block_n in an M-dimensional search space. As mentioned above, the search can be conducted in either the N-dimensional or M-dimensional space, and this disclosure embodiment does not limit this.

[0112] In some embodiments, the maximum search space is constructed as follows: The maximum search space in N or M dimensions is determined based on the partition size `block_k` in the K-dimensional space, the capacity of the local memory of the processor cores in a multi-core system, and the partitioning strategy within a single computational unit. Establishing the maximum search space in N or M dimensions based on these parameters avoids performing a full search in either dimension, thus reducing the search complexity.

[0113] As can be seen from the construction of the search space above, when splitting the search block size, a brute-force search in the three dimensions of M, K, and N was not adopted. Instead, a cyclic pruning method was used to reduce the search volume, thereby reducing the overhead on the host side. This optimized search scheme can significantly shorten the search time. For example, when M=K=N=8192, the host-side search time can reach about 3ms using a three-dimensional brute-force search scheme; while under the same conditions, the cyclic pruning method of this embodiment can complete the search on the host side in only 20us to 30us, while still ensuring the performance of matrix multiplication.

[0114] The following detailed description uses searching in N dimensions as an example. Those skilled in the art will understand that when using a search scheme in M dimensions, the specific search process can be implemented similarly, and will not be described in detail here.

[0115] First, based on block_k, the capacity of local memory, and the splitting strategy (pm*pn) within a single computational unit, the maximum N-dimensional search space is determined. That is, based on this known information, the maximum block_n value that can be accommodated on-chip is determined. Referring to the memory architecture of the multi-core system shown in Figure 7, the memory levels included in the hardware implementation of different multi-core systems may vary slightly. For example, in one implementation, the on-chip memory level may only consist of local memory; in this case, only the capacity of that local memory needs to be considered. In another implementation, the on-chip memory level includes both local memory and shared memory; therefore, the capacities of both need to be considered, and the smaller value is taken. Taking the latter implementation as an example, the maximum N-dimensional search space search_max_n can be set as follows:

[0116] search_max_n=min(max_n_l1*pn,max_n_smem) (3)

[0117] In formula (3) above, max_n_l1*pn represents the maximum block_n size that can be accommodated due to the limitation of L1_cache storage space; max_n_smem represents the maximum block_n size that can be accommodated due to the limitation of shared memory SMEM storage space. The smaller of the two is then taken as the maximum search space search_max_n in N dimensions. `l1_cache` represents the size of the local memory, and `SMEM` represents the size of the shared memory. For the internal structure of the processor core shown in Figure 4, it can be considered, for example, as a combination of NRAM 431 and WRAM 432. `align_m` represents the alignment granularity of the instructions implementing matrix multiplication in the M dimension; `pk` represents the number of parts `block_k` is split into during software pipelining when data is moved between local memory `l1_cache` and shared memory `SMEM`; `sizeof()` represents the data bit width, A is the left multiplication matrix, B is the right multiplication matrix, and C is the output matrix. `align_m * (block_k / pk) * sizeof(A) *` 2 `pm*align_m*block_k*sizeof(A)*2` represents the storage space required for the ping-pong buffer of the output matrix C; `(align_m*sizeof(C))*2` represents the storage space required for the ping-pong buffer of the right multiplication matrix B. `pm*align_m*block_k*sizeof(A)*2` represents the storage space occupied by the left multiplication matrix A in the shared memory SMEM; `block_k*sizeof(B)*2` represents the storage space occupied by the right multiplication matrix B in the shared memory SMEM.

[0118] Based on the maximum search space, the search steps of the above-mentioned cyclic pruning method include: determining the corresponding block_m or block_n according to the capacity of the local memory, block_k, and block_n or block_m in the current search loop; calculating the corresponding optimization objective according to the block_m or block_n corresponding to the current search loop; and determining the block_n and block_m corresponding to the optimal optimization objective according to the optimization objective corresponding to each search loop in the search space.

[0119] The following detailed description uses searching in N dimensions as an example. Those skilled in the art will understand that when using a search scheme in M dimensions, the specific search process can be implemented similarly, and will not be described in detail here.

[0120] In this embodiment, the search steps can be performed in the aforementioned N-dimensional maximum search space in a cyclical manner, including: determining the corresponding block_m based on the local memory capacity, block_k, and block_n in the current search loop; calculating the corresponding optimization objective based on block_m in the current search loop; and determining the optimal optimization objective's corresponding block_n and block_m based on the optimization objective for each search loop within the search space. The optimization objective can differ depending on the task type.

[0121] When group matrix multiplication tasks are memory access bottleneck tasks, the optimization objective is to minimize memory access time, i.e., to reduce memory access time as much as possible. In some embodiments, memory access time (I / O) can be calculated as follows:

[0122]

[0123] Where M, N, and K are the maximum values of dimensions M, N, and K, respectively, and sizeof() represents the data bit width.

[0124] When group matrix multiplication is a computationally bottleneck task, the optimization objective is processor core utilization, meaning the higher the processor core utilization, the better. In some embodiments, processor core utilization (DEV) is... utils It can be calculated as follows:

[0125]

[0126] in, CGMA DEV This refers to the compute-to-memory access ratio of a multi-core system; TFLOPS stands for one trillion floating-point operations per second. DRAM This refers to the bandwidth of the off-chip DRAM, measured in GB / s. `group` represents the number of groups in the matrix multiplication task, and `num` represents the number of groups in the matrix multiplication task. cluster The total is the number of units of operation. block PAD_UP(total) represents the total number of split blocks. block ,num cluster ) equals total block Divide by num cluster Round the result up and multiply by num cluster In the above formula (5), the first term in the multiplication indicates whether the computational memory access performance corresponding to the split blocks can meet the needs of the multi-core system where memory access is masked by computation; the second term in the multiplication indicates the utilization rate of the processor cores in the multi-core system.

[0127] The above iterative search process can be represented in pseudocode as follows:

[0128] As can be seen from the pseudocode above, within the maximum search space of N dimensions, in each search process, the method for determining the corresponding temp_block_m based on temp_block_n in the current search loop is similar to the method for determining the maximum search space. That is, the maximum size of block_m that can be accommodated due to the limitations of the local memory l1_cache storage space and the maximum size of block_m that can be accommodated due to the limitations of the shared memory SMEM storage space are taken as the smaller value of the two as the temp_block_m of the current search loop. Afterwards, depending on whether the task type is a computational bottleneck or a memory access bottleneck, the corresponding optimization objective is calculated using formulas (4) and (5) respectively, and the combination of block_n and block_m that makes the optimization objective optimal is selected.

[0129] Therefore, the optimal split sizes block_k, block_n, and block_m that make the optimization objective optimal can be determined.

[0130] Continuing with Figure 8, in step 840, group matrix multiplication tasks are scheduled to be executed in parallel on the computational units of a multi-core system according to the determined partition sizes block_k, block_n, and block_m. Specifically, the matrix multiplication tasks in the group matrix multiplication tasks can be divided into multiple partition blocks according to the determined partition sizes block_k, block_n, and block_m. Then, the corresponding partition blocks are cyclically distributed to each computational unit for execution according to the grid order of the M and N dimensions of the output matrix blocks.

[0131] As can be seen from the host-side task scheduling scheme described above, this scheme splits the group matrix multiplication task from the perspective of a single computational unit, which is more in line with the device-side implementation and facilitates task distribution on the device side. When searching for the split size, pre-determining block_k can greatly reduce the search volume for subsequent optimization parameters, reduce host-side overhead, and improve processing efficiency. Furthermore, when pre-determining block_k, it distinguishes between memory access bottlenecks and computational bottlenecks, thus better matching the optimization goals. Furthermore, in the design of optimization goals, different optimization goals are considered for different task types. For example, for computationally bottlenecked tasks, processor core utilization is taken into account, thereby overcoming the limitations of splitting from the perspective of a single computational unit. In summary, the disclosed embodiments provide an optimized task scheduling scheme for the implementation of group matrix multiplication tasks on multi-core systems, which can effectively reduce frequent data exchange and loading, avoid input / output (I / O) bottleneck problems, and fully utilize the parallel operation characteristics of multi-core architecture to improve processor core utilization.

[0132] Accordingly, this disclosure also provides a task scheduling apparatus, including a processor and a memory, wherein: the processor is configured to execute program instructions; the memory is configured to store the program instructions; when the program instructions are loaded and executed by the processor, the processor performs the task scheduling method described in any of the preceding embodiments. This disclosure also provides a computer-readable storage medium storing program instructions, which, when loaded and executed by a processor, cause the processor to perform the task scheduling method described in any of the preceding embodiments. This disclosure also provides a computer program product, including a computer program or instructions, which, when executed by a processor, implement the task scheduling method described in any of the preceding embodiments. This disclosure also provides a processing apparatus, including the task scheduling apparatus described above. This disclosure also provides a chip, including the processing apparatus described above. This disclosure also provides a board, including the chip described above. Group Matrix Multiplication Task Execution Scheme

[0133] As mentioned earlier, in intelligent computing systems, high-performance multi-core systems are typically used as accelerator cards for servers, exchanging data with the host via the PCIe bus. The preceding description of the group matrix multiplication task scheduling implemented on the host side in this embodiment, i.e., the task splitting process, is explained. The host side determines optimization parameters for the splitting scheme based on the task splitting scheduling optimization objective, including the splitting sizes block_k, block_n, and block_m, and the splitting strategy (pm, pn) within a single computational unit. Then, the host side transmits these optimization parameters from the host side to the high-performance multi-core system on the device side by launching the kernel. Each computational unit in the multi-core system determines and executes the Group GEMM task it needs to process based on these optimization parameters.

[0134] However, in some scenarios, such as when group matrix multiplication is itself a computational bottleneck, executing matrix multiplication one by one in a loop on a multi-core system is extremely inefficient. For example, if there are 6 groups, and M and N are relatively small (e.g., 512), while K is relatively large, distributing each group (i.e., one matrix multiplication) across 12 operation units and then looping through all 6 groups would be very inefficient. For instance, the computational efficiency on a certain multi-core system might only be slightly over 60%, even though this scenario is a computational bottleneck on that system. If the entire group matrix multiplication task is considered together, splitting M into two parts and N into one, resulting in 6 groups and a total of 12 blocks of output, and then distributing them across 12 operation units, the computational efficiency can reach over 90%. Moreover, in group matrix multiplication tasks, M, K, and N for each group are dynamically variable, making it difficult to effectively implement group matrix multiplication tasks using previous general matrix multiplication schemes. Therefore, the execution of group matrix multiplication tasks on the device side needs further adjustments to improve processing efficiency.

[0135] Figure 9 illustrates a grid-based abstraction of a matrix multiplication task according to some embodiments of this disclosure. As shown in Figure 9, a matrix multiplication task can be abstracted as a three-dimensional cuboid, where the horizontal coordinate X represents dimension N (right multiplication of matrix column dimension), the vertical coordinate Y represents dimension M (left multiplication of matrix row dimension), and the depth coordinate Z represents dimension K (left multiplication of matrix column dimension and right multiplication of matrix row dimension). This three-dimensional cuboid is divided into grids, each grid representing a task corresponding to a subdivided block, whose size in the X, Y, and Z coordinates is block_n, block_m, and block_k, respectively. Each grid can be identified by its index in the X, Y, and Z coordinates. For example, gridId.x represents the grid index in the N-dimensional subdivision, gridId.y represents the grid index in the M-dimensional subdivision, and gridId.z represents the grid index in the K-dimensional subdivision. It can be understood that since the matrix multiplication operation accumulates in the K-dimensional dimension, the size of the output matrix is M×N, corresponding to the XY plane. At this point, the output matrix can be divided into multiple output matrix blocks according to the grid on the XY plane, with each output matrix block corresponding to a task block.

[0136] Since the sizes of M, K, and N are variable in group matrix multiplication tasks, these task blocks can be uniformly numbered, allowing for efficient execution of the computation task from the perspective of the entire group matrix multiplication task. For example, a counter `excuted_blocks` can be maintained to indicate the number of task block grids processed in the current loop, and the task block grids can be sequentially distributed to the various computation units of the multi-core system.

[0137] In some embodiments, a method for performing group matrix multiplication operations on a multi-core system may include: dividing each group of matrix multiplications in the group matrix multiplication operation task into multiple task blocks according to predetermined task splitting information; and uniformly managing the task blocks of each group of matrix multiplications, with each operation unit cyclically executing the task blocks according to the network order of the M and N dimensions of the output matrix block corresponding to the task block to generate the corresponding output matrix block, where M is the row dimension of the output matrix and N is the column dimension of the output matrix. The predetermined task splitting information may be determined and transmitted by the host according to the aforementioned group matrix multiplication task scheduling scheme, or it may be determined in other ways; this disclosure embodiment does not limit this aspect.

[0138] Figure 10 illustrates the task block processing method for a group matrix multiplication task according to some embodiments of this disclosure.

[0139] Figure 10 shows the output matrix blocks for a group matrix multiplication task with group=2. The horizontal axis represents N dimensions, and the vertical axis represents M dimensions. Each grid represents the output matrix block corresponding to a task block, with a size of block_n × block_m. The first group has an N-dimensional size of N1 = 10 × block_n and an M-dimensional size of M1 = 3 × block_m; the second group has an N-dimensional size of N2 = 9 × block_n and an M-dimensional size of M2 = 2 × block_m. Assuming there are 8 computational units involved, deviceId represents the hardware identifier of each computational unit, with deviceId = 0 to 7 corresponding to these 8 units. This group matrix multiplication task is implemented through multiple rounds of computation loops. The same shading is used to represent the same round of computation.

[0140] As shown in the diagram, in each task loop, the eight output matrix blocks are sequentially assigned to eight computational units for parallel processing. In the fourth task loop, eight output matrix blocks are assigned to eight computational units across groups. Because the task blocks for matrix multiplication across groups are managed uniformly, cross-group assignment is possible. This cross-group assignment method fully utilizes the computing power of available computational units in each loop, improving the overall execution efficiency of the group matrix multiplication task.

[0141] Furthermore, since matrix multiplication is accumulated along the K dimensions of the input data, in some embodiments, the computational unit can iterate along the K dimensions when processing the assigned task blocks until the corresponding output matrix block is obtained.

[0142] Figure 11 shows a logical flowchart of the task block cyclic execution process according to some embodiments of this disclosure. As shown in Figure 11, in step 1101, the parameters are first initialized. These parameters include, for example, the number of executed task blocks (executed_blocks), the matrix multiplication group identifier (group_id) currently being processed, and the task block identifier (block_id) currently being processed.

[0143] Next, in step 1102, it is determined whether there are any unprocessed matrix multiplication groups. For example, this can be determined by whether the current group number is less than the total number of groups. If so, proceed to step 1103; otherwise, skip directly to step 1108 to end the process.

[0144] In step 1103, the total number of task blocks BLKS in the current matrix multiplication split is calculated, and the current upper limit of the total task blocks is updated to upper_bound = extracted_blocks + BLKS.

[0145] In step 1104, it is determined whether `block_id` is valid. For example, this can be determined by whether `block_id` is less than the total task block limit `upper_bound` and greater than or equal to the number of executed task blocks `executed_blocks`. If so, proceed to step 1105; otherwise, jump to step 1107, where the next matrix multiplication group needs to be moved to, and the number of executed task blocks `executed_blocks` is updated. As can be seen from the update of `block_id` in step 1106, determining whether `block_id` is valid here essentially means determining whether there are enough task blocks in the current matrix multiplication group to be allocated to `deviceDim` operation units for processing in a single task loop.

[0146] In step 1105, the corresponding task block in the current matrix multiplication group is located and executed. During the execution of a single task block, the product results are accumulated cyclically along the K-dimensional axis to obtain the final output matrix block.

[0147] Next, in step 1106, the task loop parameter block_id is updated to block_id + deviceDim, where deviceDim represents the number of computation units. By increasing block_id by deviceDim each time, it is ensured that there are enough task blocks allocated to deviceDim computation units in each task loop, and task blocks can be processed across groups in a single task loop.

[0148] Next, return to step 1104 and continue the next task loop.

[0149] The above task block execution process can be represented by the following pseudocode:

[0150] The above describes the execution flow of a group matrix multiplication task on a multi-core system, where the task blocks of each group of matrix multiplications are considered uniformly and distributed sequentially to the computational units for processing. As can be seen from the above flow, by considering the task blocks of each group of matrix multiplications uniformly, the performance degradation caused by group-based processing due to the variability of M, N, and K in different groups can be avoided. In the above execution process, the embodiments disclosed herein also provide numerous optimization measures to further improve processing efficiency.

[0151] In some embodiments, based on the memory architecture of the multi-core system described above in conjunction with Figure 7, a pipelining approach can be used to implement task block processing, thereby improving the parallelism and efficiency of task block processing. For example, for a multi-core system with a three-level cache architecture of global memory (first-level storage unit), shared memory (second-level storage unit), and local memory (L1 cache, third-level storage unit), a four-level pipelining approach can be used to perform task block matrix multiplication processing.

[0152] Figure 12 schematically illustrates the storage of data at different storage levels. As shown, the input matrices A and B, which require matrix multiplication, are stored in global memory 1210, which may be, for example, the off-chip DRAM 710 shown in Figure 7. A pipeline for loading the input matrices A and B is implemented on shared memory (SMEM) 1220. In some embodiments, the output matrix C is not cached in shared memory but resides directly in the local memory of the computation unit, and is then cyclically accumulated along the K-dimensional axis as needed. Smaller matrices A and B are also stored in the local memory 1230 of the computation unit. The figure also shows the matrix data used and calculated at each processor core of the computation unit 1240.

[0153] Data exchange or communication between global memory (first-level storage unit) 1210 and shared memory (second-level storage unit) 1220 can be achieved through a first direct memory access (DMA) interface. For example, referring to FIG3, the first DMA interface can be a GDMA interface. Data exchange or communication between shared memory (second-level storage unit) 1220 and local memory (third-level storage unit) 1230 can be achieved through a second DMA interface. For example, referring to FIG4, the second DMA interface can be an MVDMA interface. In some embodiments, a third DMA interface can exist between local memory (third-level storage unit) 1230 and global memory (first-level storage unit) 1210. For example, referring to FIG4, the third DMA interface can be an IODMA interface.

[0154] As can be seen from the above description, the second-level storage unit 1220 and the third-level storage unit 1230 support parallel processing, and data communication between different functional units can be achieved through different DMA interfaces, thereby making full use of parallelism to realize data-level pipeline.

[0155] In some embodiments, the aforementioned four-stage pipeline can be implemented using global memory (off-chip memory, first-stage storage unit), shared memory (second-stage storage unit), local memory (third-stage storage unit), and processor cores (specifically, the computational circuitry therein), thereby forming a four-stage pipeline of data transfer—data transfer—computation—store. In this pipeline, off-chip memory, shared memory, and local memory can all be processed in parallel, and processor cores can also be processed in parallel with off-chip memory, shared memory, and / or local memory.

[0156] Specifically, in some implementations, the shared memory can be configured with at least two shared memory regions to support simultaneous data access between one shared memory region and off-chip memory via a first DMA interface, and data access between the other shared memory region and local memory via a second DMA interface different from the first DMA interface. Each configured shared memory region can be used for time-sharing storage of input data blocks and / or corresponding output data blocks.

[0157] In some implementations, the local memory can be configured with multiple local memory areas to support simultaneous data access between one local memory area and shared memory via a second DMA interface, while the processor performs computational processing on data in another local memory area. Each local memory area configured in the local memory is used to time-share input data blocks and output data blocks as the results of computational processing.

[0158] Furthermore, in some computational processes that require caching, the local memory can also be configured with a computation cache area to temporarily store data for the processor core's computational processing, such as caching intermediate results accumulated over K dimensions. In some embodiments, the final computation result of the computation cache area can be directly stored in off-chip memory via a third DMA interface from the local memory, without going through shared memory.

[0159] Figure 13 illustrates more intuitively the four-stage pipelined process of a matrix multiplication task according to some embodiments of this disclosure. As shown in Figure 13, the first stage 1310 represents moving data from global memory (off-chip memory) to shared memory (SMEM), the second stage 1320 represents moving data from shared memory (SMEM) to local memory, the third stage 1330 represents the processor core performing matrix multiplication operations using data from local memory, and the fourth stage 1340 represents the final loop iteration of the K-dimensional matrix and storing the result back to off-chip memory.

[0160] In implementation, each of the above pipeline stages can be uniformly encapsulated into an operator Op. For example, the first stage is slowMemPromtOp, the second stage is fastMemPromtOp, the third stage is matmulOp, and the fourth stage is tailOp. The pipeline execution process in the K-dimensional loop can be represented by the following pseudocode:

[0161] The pseudocode above illustrates a pipeline in a K-dimensional loop. In this pipeline, slowMemPromptOp, fastMemPromptOp, and matmulOp (or tailOp) are all parallel. syncOp acts as an intra-kernel synchronization barrier within a single computational unit, using a producer / consumer pattern.

[0162] Furthermore, as can be seen from the pseudocode above, the operation of loading data from off-chip memory to shared memory in the first loop of the K-dimensional current task block (gridId.z=0) (slowMemPromptOp()) and the operation of calculating the last loop of the K-dimensional previous task block and storing the result back to off-chip memory (tailOp(currentMatmulInfo)) are executed in parallel. This allows the loop processing of the K-dimensional to be connected end to end, achieving a seamless pipeline.

[0163] For the architecture described in Figure 3, slowMemPromptOp is initiated by the SMEM storage core within the cluster.

[0164] In some embodiments, when the number of K-dimensional partition blocks is 1, a data residency operation can be triggered, thereby reducing data memory access processing. For example, when the number of K-dimensional partition blocks is 1, if the data to be loaded already exists in the shared memory, it is not necessary to load the data again, and the relevant processing unit can be notified through a message passing mechanism.

[0165] Specifically, in some implementations, as shown in the pseudocode above, `slowMemPromptOp` records information about the previously copied tensor, `previousSlowMemTensors`. Before performing the copy, this Op compares the information of the tensor to be copied with that of the previously copied tensor (note that matrices A and B are compared separately). If they are the same, the copy will not be triggered. Simultaneously, `slowMemPromptOp` passes this information to the next Op, `fastMemPromptOp`, via `prepare_fastMemOp`. Upon receiving this information, the Op will also not trigger the copy from shared memory (SMEM) to local memory (L1 cache), thus achieving data residency.

[0166] During the execution of group matrix multiplication tasks, it was found that performance was relatively poor when the remainder segments after splitting the M or N dimensions according to the block size in a certain group loop were very small. Therefore, optionally or additionally, in some embodiments, the size of the task block for the current group can be adjusted in real time to effectively improve processing performance.

[0167] Specifically, the remainder of the task block in the corresponding dimension is determined based on the split size block_n in N dimensions and the split size block_m in M dimensions; then the remainder of the task block in the corresponding dimension is compared with the corresponding predetermined value; based on the comparison result, the split size of the split block in the corresponding dimension of the current group is adjusted.

[0168] Based on the comparison results, the split size of the dimension with the remainder less than the predetermined value is adjusted. If both remainders are less than their respective predetermined values, both are adjusted.

[0169] In some implementations, the split size can be adjusted according to the following formula:

[0170] current_block_m=min(PAD_UP(ceil(m,m_loop),pm*align_m),block_m); (9)

[0171] current_block_n=min(PAD_UP(ceil(n,n_loop),pn*align_n),block_n); (10)

[0172] In the formula above, `m_loop = ceil(m, block_m)` represents the number of blocks in dimension M; `n_loop = ceil(n, block_n)` represents the number of blocks in dimension N. `PAD_UP(X, Y)` equals the result of dividing X by Y, rounded up, and then multiplied by Y. `PAD_UP(ceil(m, m_loop), pm*align_m)` represents the new split size after evenly distributing the M-dimensional blocks according to the original number of blocks while considering the M-dimensional alignment requirements. `min()` means taking the minimum value between the new split size and the original split size `block_m` as the adjusted split size `current_block_m`. The meaning of the N-dimensional block is similar and will not be repeated here.

[0173] Therefore, by adjusting the block size in real time during group matrix multiplication tasks, which may lead to relatively poor performance, we can more accurately adapt to the dimension size of each group, thereby achieving a more refined efficiency improvement. How to fully utilize the characteristics of global caches for performance optimization?

[0174] As described above with reference to Figure 7, some multi-core systems may also include the L2 cache 720 shown in the figure. The L2 cache 720 may include an LLC (the last-level cache, also referred to as the global cache in this document). LLCs can have various operating modes, including data latching, streaming access, and flushing back to system memory. When the LLC's data latching mode is enabled, the characteristics of LLC latching data access can be fully utilized to maximize the utilization of data residing within the LLC, achieving a bandwidth expansion effect. Therefore, in multi-core systems that also include a global cache, the global cache can be used to cache data so that computational units can preferentially load the data that needs to be cached from the global cache.

[0175] Specifically, on a hardware architecture with LLC (Global Cache), the characteristics of LLC can be fully utilized to rearrange the order of the output matrix blocks, improve the data reuse rate in LLC, thereby indirectly improving the computation-to-memory ratio or reducing the amount of memory access.

[0176] Figure 18 shows a flowchart of a method for performing computational tasks on a multi-core system according to an embodiment of this disclosure. The task includes group matrix multiplication tasks, which comprise multiple sets of matrix multiplications, each with a variable matrix size. The multi-core system includes multiple computational units and a global cache. Each computational unit includes one or more processor cores. The global cache is used to cache data so that the computational unit can preferentially load the data to be cached from the global cache. The method includes:

[0177] Step 1801): Based on the predetermined task splitting information, split each group matrix multiplication in the group matrix multiplication operation task into multiple task blocks;

[0178] Step 1802): The multi-core system adjusts the grid order of the task blocks that output the matrix in the group matrix multiplication operation task in the corresponding dimension;

[0179] Step 1803): The multi-core system manages the task blocks of each group of matrix multiplications in a unified manner. Each operation unit performs the operation in the grid order of the M-dimensional and N-dimensional dimensions according to the adjusted task blocks of the output matrix. This allows the multi-core system to obtain Sm×Sn grid-sized output matrix blocks after each loop calculates the task blocks of the input matrix. Here, M is the row dimension of the output matrix, N is the column dimension of the output matrix, Sm is the number of output matrix blocks in the M-dimensional dimension, and Sn is the number of output matrix blocks in the N-dimensional dimension.

[0180] In some embodiments, the rearrangement scheme for the output matrix blocks may include: adjusting the grid order of the output matrix blocks in the M and N dimensions such that, in each loop computation, the multi-core system obtains Sm×Sn grid-sized output matrix blocks when performing one computation, where S... m ×S n =num cluster ,num cluster This represents the number of computational units in a multi-core system.

[0181] To determine Sm and Sn, we can use num cluster Factorize to get x and y, that is: x * y = num cluster We need to select the combination of x and y that optimizes the memory access ratio or memory access volume.

[0182] Since an output matrix block may consist of the sum of multiple split blocks in the K-dimensional dimension, the combination of x and y can be determined according to different principles depending on the splitting of the K-dimensional dimension, corresponding to the rearranged Sm and Sn, which in turn determines the data that needs to be cached.

[0183] When the K-dimensional value of a task block is greater than block_k (i.e., there are multiple split blocks in K-dimensionality), the optimization target can be the overall computation-to-memory ratio of the multi-core system. This maximizes the computation-to-memory ratio of the multi-core system after adjusting the output matrix blocks, and then the data to be cached is determined based on the maximum computation-to-memory ratio of the multi-core system. The expression for calculating the memory access ratio is: In other words, the combination of x and y that maximizes the memory access ratio is chosen as the values of Sm and Sn, thus determining the data that needs to be cached.

[0184] When the K-dimensional value of the task block is less than or equal to block_k (that is, the K-dimensional is not split and there is only one split block), it is necessary to consider the cases of rearranging the output matrix C along the M-dimensional and N-dimensional respectively, as well as the corresponding rearrangement size in different cases, and take the minimum memory access (IO amount) of the input data (such as matrix A and matrix B) of the group matrix multiplication operation task as the optimization target.

[0185] Specifically, after adjusting the grid order of the output matrix blocks in the N-dimensional dimension, in num cluster From the combinations of x and y obtained by factoring, determine the minimum I / O amount of the input data corresponding to the resident matrix A. The I / O amount IOA of the input data corresponding to the resident matrix A can be calculated as follows:

[0186]

[0187] According to formula (11), select the combination of x and y that minimizes IOA.

[0188] Similarly, after adjusting the grid order of the output matrix blocks in the M dimension, in num cluster Among the combinations of x and y obtained by factoring, determine the minimum I / O of the input data corresponding to the resident B matrix. The I / O of the input data corresponding to the resident B matrix, IOB, can be calculated as follows:

[0189]

[0190] According to formula (12), select the combination of x and y that minimizes IOB.

[0191] Then, the minimum I / O value is selected from the minimum I / O value of the input data residing in matrix A and the minimum I / O value of the input data residing in matrix B. Based on this minimum I / O value, the data to be cached is determined. That is, the combination of x and y that minimizes memory access is selected as the values of Sm and Sn, thus determining the data to be cached. It can be seen that in a multi-core system with LLC, to fully utilize the characteristics of LLC, the output matrix C is rearranged, which changes the calculation order of the output blocks and improves the data reuse rate in LLC. In the above processing, the output I / O value is not considered because the on-chip storage I / O value is constant and will not increase unnecessarily; therefore, the output I / O value is a common factor and does not need to be considered separately.

[0192] In the specific implementation, we can first traverse all combinations of x and y to find the smallest IOA and its corresponding x and y, assuming it is recorded as (x1, y1, IO1); then traverse all combinations of x and y to find the smallest IOB and its corresponding x and y, assuming it is recorded as (x2, y2, IO2); then compare the two IO quantities IO1 and IO2, and select the smallest IO quantity and its corresponding x and y combination, thus determining the rearrangement scheme, including the rearrangement direction and rearrangement pattern Sm×Sn.

[0193] Figure 14 illustrates a schematic diagram of the sequential rearrangement of output matrix blocks according to some embodiments of this disclosure, wherein output matrix blocks processed in the same loop use the same background color or shading. In these embodiments, for simplicity, only matrix multiplication of one group is shown; however, those skilled in the art will understand that the above rearrangement method can be applied to matrix multiplication tasks of the entire group. This embodiment assumes that the K-dimensional value is greater than block_k.

[0194] As shown in Figure 14, the grid diagram above represents the processing order of the output matrix blocks before rearrangement. In this example, the M dimension is M = 4 × block_m, the N dimension is N = 10 × block_n, and there are 8 computational units involved in the calculation. deviceId = 0 to 7 corresponds to these 8 computational units. Following a row-first, column-later order, in each loop, the 8 adjacent output matrix blocks are assigned to the 8 computational units for parallel processing. For example, in the first loop, 1 × 8 output matrix blocks will be calculated. It can be seen that in this loop, if reuse is implemented based on LLC, only one block of left-multiplicating matrix A can be reused in the LLC; matrix B is not reused at all. The computation-to-memory ratio at this time is...

[0195] The grid diagram at the bottom of Figure 14 represents the processing order of the rearranged output matrix blocks. Assuming that in this embodiment, following the previous rearrangement principles, the combination of x and y that maximizes the memory access ratio is chosen to be 2×4, then after rearrangement, 2×4 output matrix blocks will be calculated in the first loop. It can be seen that in this loop, if reuse is implemented based on LLC, matrices A and B can be reused; both blocks of A and four blocks of B can be reused, resulting in a memory access ratio of [value missing]. Compared to before the rearrangement, the computation-to-memory ratio can be improved. Similarly, there are three loops that compute 2×4 output matrix blocks each time, and one loop that computes 4×2 output matrix blocks. Matrix A and B can also be reused, with 4 blocks of A and 2 blocks of B being reused.

[0196] Figure 15 illustrates a schematic diagram of the sequential rearrangement of output matrix blocks according to other embodiments of this disclosure, wherein the output matrix blocks processed in the same loop use the same background color or shadow. In these embodiments, for simplicity, only a single group of matrix multiplications is shown. Assuming this embodiment has a K-dimensional value less than or equal to block_k, i.e., only one split block exists in the K-dimensional dimension, this can be implemented using dwell techniques.

[0197] As shown in Figure 15, the grid diagram above represents the processing order of the output matrix blocks before rearrangement. This part is similar to Figure 14 and will not be described again.

[0198] The grid diagram at the bottom of Figure 15 represents the processing order of the rearranged output matrix blocks. In this embodiment, assuming the rearrangement principle is followed, the rearrangement scheme that minimizes IO is chosen to rearrange along the N-dimensional axis, with a rearrangement pattern of 4×2. After rearrangement, 4×2 output matrix blocks will be calculated in the first loop. It can be seen that in this loop, if multiplexing is implemented based on LLC, matrices A and B can be reused; all four blocks of A and two blocks of B can be reused. Combined with the residency technique, matrix A used by each computational unit only needs to be loaded once, and will reside on-chip thereafter without reloading. Before rearrangement, taking the first computational unit deviceId = 0 as an example, matrix A was only reused once in the first and second loops; other loops would trigger a copying process.

[0199] The above rearrangement process can be represented by the following pseudocode:

[0200] The implementation schemes for various group matrix multiplication tasks described above are mainly for scenarios with variable M dimensions. However, in future training scenarios, scenarios with variable K dimensions will be encountered, which will lead to a very unbalanced load.

[0201] Figure 16 schematically illustrates a situation of unbalanced load. As shown in the figure, suppose that according to the task distribution, the i-th computing unit is assigned to compute Groups 0, 2, and 4, all of which have relatively small K values, while the j-th computing unit is assigned to compute Groups 1 and 3, both of which have very large K values. In this case, the i-th computing unit will remain idle for a long time, resulting in very low overall hardware utilization.

[0202] More generally, during the MoE network inference process, in the encoder phase, when a Group GEMM task encounters a mixed bottleneck of computation and memory access—for example, in a scenario where M is variable, some experts might be allocated a very large number of tokens, reaching thousands or even tens of thousands, while others are allocated only a few or dozens—they will also face the problem mentioned above with K being variable. That is, some computational units quickly complete their computational tasks, remaining idle in the later stages, while other computational units are continuously performing matrix multiplication calculations at full load. These situations can all be categorized as load imbalance.

[0203] High-performance multi-core systems are typically designed to handle highly parallel computing tasks. To simplify processor design, complex task scheduling logic is generally not required. Therefore, to address the aforementioned uneven load distribution, conventional design approaches involve analyzing the impact of various factors before allocating tasks to computational units, and then splitting the tasks to distribute them as evenly as possible. However, in scenarios where the K-dimensional variable changes, the K value varies with different groups of tasks, requiring separate analysis and design of task splitting schemes for each group, increasing the workload. Furthermore, if the K value is not known beforehand, effective task splitting cannot be performed in advance.

[0204] In light of this, the inventors deviated from conventional design thinking. Instead of attempting to construct complex task balancing algorithms to implement task splitting schemes, which then distribute tasks to computing units that passively accept and execute them, the inventors allowed computing units to proactively request the next computing task based on their own performance—that is, to achieve adaptive scheduling. This fully utilizes the parallel computing power of multiple computing units, thereby completing group matrix multiplication tasks at a faster speed.

[0205] In some embodiments, until the group matrix multiplication task is completed, the individual operational units in the multi-core system execute cyclically: completing the current task block and requesting the next task block.

[0206] Specifically, a counter can be maintained in global memory to record the number of task blocks that have been executed. After each computational unit finishes its current task block computation (e.g., completing the loading of the last K-dimensional block), it immediately requests the next task block. This ensures that each computational unit always has computational tasks available, maximizing the overall performance of Group GEMM.

[0207] Figure 17 shows a schematic logic flowchart of adaptive scheduling for executing group matrix multiplication tasks according to some embodiments of this disclosure.

[0208] As shown in Figure 17, in step 1700, a space can first be allocated in global memory to store the task block counter, and this space is initialized. The task block counter is used to record the number of executed task blocks. In some implementations, the "allocation" process of the computation unit to allocate a task block is implemented using the atomic operation atomicAdd. The function of atomicAdd is to increment the counter by 1 and return the result of the counter before the increment to the current block identifier block_id. Therefore, the counter needs to be initialized to 0. For example, memset can be used or a kernel can be started to perform this initialization operation counter = 0.

[0209] Then the kernel for the group matrix multiplication task is started, and the process enters the matrix multiplication task block execution loop.

[0210] In step 1701, relevant parameters are initialized. These parameters include, for example, the number of executed task blocks (excuted_blocks) described earlier, the group_id identifier of the currently processed matrix multiplication group, and the block_id identifier of the currently processed task block. Furthermore, in the presence of shared memory, to allow all processor cores within a computational unit to share the allocated task blocks, a counter (smem_task_counter) can be set on the shared memory to count the tasks in the shared memory.

[0211] Then, each group is looped through. For example, in step 1702, it is determined whether there are any unprocessed matrix multiplication groups. If so, proceed to step 1703; otherwise, skip directly to step 1708 to end.

[0212] In step 1703, the information of the group is obtained, including M, K, N and the main dimension size, the starting address of matrices A, B, and C, and the upper limit of the total task blocks, specifically including calculating the total number of task blocks BLKS of the current matrix multiplication and splitting, and updating the current upper limit of the total task blocks to upper_bound = extracted_blocks + BLKS.

[0213] In step 1704, if the current task block identifier block_id is valid (block_id is greater than the executed_block and less than upper_bound), then enter a task loop and proceed to step 1705; otherwise, jump to step 1707 to prepare for the next matrix multiplication and update the number of executed task blocks executed_blocks.

[0214] In step 1705, at the beginning of the loop, the corresponding task block in the current matrix multiplication group is located and executed. Specifically, the position of the block to be calculated in the grid space of the output matrix block in the group is calculated, and then the product results are accumulated in the K-dimensional loop to obtain the final output matrix block, thus completing the calculation of the task block.

[0215] After a computational unit completes the current block matrix multiplication task, a processor core within that unit can request the next computation task using the `atomicAdd` instruction. Specifically, in step 1706, the next task block is requested, the task loop parameter `block_id = atomicAdd(&counter, 1)` is updated, and then the process returns to step 1704 to continue processing the next task loop. Updating the task loop parameter may include accessing the counter maintained in global memory, updating the task block counter by incrementing its result by 1, thereby updating the number of executed task blocks. However, based on the updated result of the task block counter, the information for the next task block is determined. For example, the updated result of the task block counter serves as the `block_id` of the next task block.

[0216] The process of requesting a task block can only be initiated by one processor core within the computation unit, and the result of the request must be written to shared memory and the result must be notified to other processor cores within the computation unit.

[0217] In step 1704, if the block_id of the requested task block is less than upper_bound, it indicates that the current group has not yet completed the calculation, and the process continues to step 1705 to perform loop processing. If the block_id of the requested task block has exceeded upper_bound, the loop is exited and the process proceeds to step 1707, the number of executed task blocks (executed_blocks) is updated, and then the calculation of the next group begins.

[0218] Within a processing unit, a processor core uses the atomicAdd instruction to request a task, i.e., block_id = atomicAdd(&counter, 1), where block_id is the block id of the output matrix block corresponding to the requested task block. Then, the processor core writes the task to smem_task_counter[] on the shared memory SMEM and inserts a synchronization instruction (the synchronization range may vary slightly depending on the multi-core system; for example, in the multi-core system in Figure 3, all processor cores within the cluster synchronize). Then, each processor core retrieves the current task id from smem_task_counter.

[0219] In some embodiments, considering that pipelined computation of task blocks can be used, the timing of requesting a task block does not have to wait until the current task block is completed. Instead, it can be requested by the processor core immediately after the last split block in the K-dimensional loop is loaded (that is, in response to the execution of loading data from off-chip memory in the last K-dimensional loop of the current task block, a request can be initiated immediately), thus maintaining seamless pipeline continuity and allowing the process to continue. The above process of requesting a task block can be represented by the following pseudocode:

[0220] As can be seen from the device-side group matrix multiplication task execution scheme described above, by considering the task blocks split from each group uniformly, the task blocks processed in a single task loop can span groups. Therefore, the computing power of available computational units is fully utilized in each loop, avoiding idle computational units and improving the execution efficiency of the entire group matrix multiplication task. Furthermore, in some embodiments, a four-stage pipeline can be used to process the block matrix multiplication task. The pipeline approach can improve processing parallelism and increase processing efficiency. During the pipeline process, the head and tail of the K-dimensional loop can be identified, achieving seamless transition between the beginning and end. Matrix residency technology can also be adopted in some scenarios during the pipeline process to reduce memory access. In some embodiments, the size of the split blocks can be adjusted in real time according to the actual dimensionality of the currently processed group to better adapt to the current group and improve processing performance. In some embodiments, the characteristics of LLC can be fully utilized to rearrange the output matrix blocks, thereby improving the reuse rate in LLC, increasing the computation-to-memory ratio, or reducing memory access. In some embodiments, for unbalanced load situations, an adaptive scheduling scheme can be provided to fully utilize the computing power of computational units, avoid idleness, and improve utilization.

[0221] Accordingly, this disclosure also provides a multi-core device, including a processor and a memory, wherein: the processor is configured to execute program instructions; the memory is configured to store the program instructions; and when the program instructions are loaded and executed by the processor, the processor performs the method for performing computational tasks on a multi-core system as described in any of the preceding embodiments. This disclosure also provides a computer-readable storage medium storing program instructions that, when loaded and executed by a processor, cause the processor to perform the method for performing computational tasks on a multi-core system as described in any of the preceding embodiments. This disclosure also provides a computer program product, including a computer program or instructions that, when executed by a processor, implement the method for performing computational tasks on a multi-core system as described in any of the preceding embodiments. This disclosure also provides a chip, including the multi-core system described above. This disclosure also provides a board, including the chip described above.

[0222] It should be noted that, for the sake of brevity, this disclosure describes some methods and their embodiments as a series of actions and combinations thereof. However, those skilled in the art will understand that the solutions disclosed herein are not limited by the order of the described actions. Therefore, based on the disclosure or teachings of this document, those skilled in the art will understand that some steps can be performed in a different order or simultaneously. Furthermore, those skilled in the art will understand that the embodiments described in this disclosure can be considered optional embodiments, that is, the actions or modules involved are not necessarily essential for the implementation of one or more solutions disclosed herein. In addition, depending on the solution, the description of some embodiments in this disclosure may have different emphases. In view of this, those skilled in the art will understand that parts not described in detail in a certain embodiment of this disclosure can also be referred to the relevant descriptions of other embodiments.

[0223] In terms of specific implementation, based on the disclosure and teachings of this document, those skilled in the art will understand that several embodiments disclosed herein can also be implemented in other ways not disclosed herein. For example, regarding the various units in the electronic device or apparatus embodiments described above, this document divides them based on logical functions, but in actual implementation, there may be other division methods. As another example, multiple units or components can be combined or integrated into another system, or some features or functions in a unit or component can be selectively disabled. Regarding the connection relationships between different units or components, the connections discussed above in conjunction with the accompanying drawings can be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect couplings involve communication connections utilizing interfaces, where the communication interface can support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

[0224] In this disclosure, the units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units. The aforementioned components or units may be located in the same location or distributed across multiple network units. Furthermore, depending on actual needs, some or all of the units can be selected to achieve the purpose of the solution described in the embodiments of this disclosure. Additionally, in some scenarios, multiple units in the embodiments of this disclosure may be integrated into one unit or each unit may exist physically independently.

[0225] In some implementation scenarios, the integrated unit described above can be implemented as a software program module. If implemented as a software program module and sold or used as an independent product, the integrated unit can be stored in a computer-readable storage device (CMSDD). Therefore, when the disclosed solution is embodied in a software product (e.g., a computer-readable storage medium), the software product can be stored in a memory, which may include several instructions to cause a computer device (e.g., a personal computer, server, or network device) to execute some or all of the steps of the method described in the embodiments of this disclosure. The aforementioned memory may include, but is not limited to, various media capable of storing program code, such as USB flash drives, flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.

[0226] In other implementation scenarios, the integrated units described above can also be implemented in hardware, i.e., as specific hardware circuits, which may include digital circuits and / or analog circuits. The physical implementation of the circuit's hardware structure may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors. Therefore, the various devices described herein (e.g., computing devices or other processing devices) can be implemented using appropriate hardware processors, such as CPUs, GPUs, FPGAs, DSPs, and ASICs. Furthermore, the aforementioned storage units or storage devices can be any suitable storage medium (including magnetic storage media or magneto-optical storage media), such as resistive random access memory (RRAM), dynamic random access memory (DRAM), static random access memory (SRAM), enhanced dynamic random access memory (EDRAM), high-bandwidth memory (HBM), hybrid memory cube (HMC), ROM, and RAM.

[0227] The foregoing can be better understood in accordance with the following terms:

[0228] Clause 1. A method for performing computational tasks on a multi-core system, the tasks including group matrix multiplication tasks, the group matrix multiplication tasks including multiple sets of matrix multiplications, the matrix size of each set of matrix multiplications being variable, the multi-core system including multiple computational units and a global cache, each computational unit including one or more processor cores, the global cache being used to cache data so that the computational unit can preferentially load the data to be cached from the global cache; the method includes:

[0229] Based on the predetermined task splitting information, each group matrix multiplication in the group matrix multiplication operation task is split into multiple task blocks;

[0230] The multi-core system adjusts the grid order of the task blocks that output the matrix in the corresponding dimension in group matrix multiplication operations.

[0231] The multi-core system manages the task blocks of each group of matrix multiplications in a unified manner. Each operation unit performs the operation in the order of the adjusted task blocks of the output matrix in the M-dimensional and N-dimensional grids. This allows the multi-core system to obtain Sm×Sn grid-sized output matrix blocks after each loop calculates the task blocks of the input matrix. Here, M is the row dimension of the output matrix, N is the column dimension of the output matrix, Sm is the number of output matrix blocks in the M-dimensional dimension, and Sn is the number of output matrix blocks in the N-dimensional dimension.

[0232] Clause 2. The method described in Clause 1, wherein the computational unit executes task blocks as follows:

[0233] The process iterates through the K-dimensional matrix to generate the corresponding output matrix block, where K is the column dimension of the left-multiplied matrix and the row dimension of the right-multiplied matrix.

[0234] Clause 3. The method according to Clause 1, wherein the step of adjusting the grid order of the task blocks of the output matrix in the corresponding dimension in the group matrix multiplication type operation task of the multi-core system further includes:

[0235] When the K-dimensional value of the task block is greater than block_k, the grid order of the task blocks outputting the matrix in the group matrix multiplication operation task in dimensions M and N is adjusted so that the computation-to-memory ratio of the multi-core system corresponding to the adjusted output matrix block reaches its maximum. The grid order of the task blocks outputting the matrix in the group matrix multiplication operation task in the corresponding dimension is determined based on the maximum computation-to-memory ratio of the multi-core system. The expression for the computation-to-memory ratio is: Where x*y=num cluster ,num cluster num represents the number of operational units. cluster Factorize to obtain x and y;

[0236] When the K-dimensional value of the task block is less than or equal to block_k, after adjusting the grid order of the output matrix block in the N-dimensional dimension, in num cluster From the combinations of x and y obtained by factorization, determine the minimum memory access amount of the input data for the group matrix multiplication task corresponding to the left multiplication matrix A; after adjusting the grid order of the output matrix block in the M dimension, in num cluster Among the combinations of x and y obtained by factorization, determine the minimum memory access amount of the input data for the group matrix multiplication operation task corresponding to the right multiplication matrix B; select the minimum memory access amount of the multi-core system from the minimum memory access amount corresponding to the right multiplication matrix A and the minimum memory access amount corresponding to the right multiplication matrix B; determine the grid order of the task blocks of the output matrix in the group matrix multiplication operation task in the corresponding dimension based on the minimum memory access amount of the multi-core system.

[0237] Clause 4. The method according to Clause 3, wherein the method further comprises:

[0238] When the K-dimensional value of the task block is greater than block_k, the data that the global cache needs to cache is determined based on the maximum computation-to-memory ratio of the multi-core system.

[0239] When the K-dimensional value of the task block is less than or equal to block_k, the data that the global cache needs to cache is determined based on the minimum memory access amount of the multi-core system.

[0240] Clause 5. The method according to Clause 2, wherein the computational unit includes shared memory, and the execution of the task block is implemented using a four-stage pipeline, wherein the four-stage pipeline is implemented through the off-chip memory of the multi-core system, the shared memory, the local memory, and the processor core.

[0241] Clause 6. The method according to Clause 5, wherein the shared memory is configured with at least two shared memory regions in the four-stage pipeline:

[0242] While accessing data between one of the shared memory areas and the off-chip memory via the first DMA interface, accessing data between another shared memory area and the local memory via a second DMA interface different from the first DMA interface.

[0243] Clause 7. The method according to Clause 6, wherein the local memory is configured with a plurality of local memory areas in the four-level pipeline:

[0244] While accessing data between one of the local storage areas and the shared memory via the second DMA interface, the processor performs matrix multiplication on data in the other local storage area.

[0245] Clause 8. The method according to Clause 7, wherein, in the fourth-stage flow:

[0246] While loading data from off-chip memory into the shared memory during the first loop of the K-dimensional current task block, the last loop calculation of the K-dimensional previous task block is performed and the result is stored back into the off-chip memory.

[0247] Clause 9. The method according to any one of Clauses 1-8, wherein the step of each computational unit of the multi-core system performing computations in an M-dimensional and N-dimensional grid order according to the task blocks of the adjusted output matrix includes:

[0248] The operation unit performs the following operation cyclically until the group matrix multiplication operation task is completed:

[0249] Complete the current task block; and

[0250] Request the next task block.

[0251] Clause 10. The method according to Clause 9, wherein the computational unit requests the next task block including:

[0252] The request is initiated by one of the processor cores in the computational unit through an atomic operation.

[0253] Clause 11. The method according to Clause 10, wherein the processor core initiates the request via an atomic operation comprising:

[0254] The request is initiated in response to the execution of loading data from off-chip memory in the last loop of the K-dimensional current task block.

[0255] Clause 12. The method according to any one of Clauses 10-11, wherein the processor core initiates the request via an atomic operation comprising:

[0256] Update the task block counter in the global memory of the multi-core system, wherein the task block counter is used to record the number of executed task blocks; and

[0257] Based on the update result of the task block counter, the information of the next task block is determined.

[0258] Clause 13. An apparatus for performing computational tasks on a multi-core system, comprising a processor and memory, wherein:

[0259] The processor is configured to execute program instructions;

[0260] The memory is configured to store the program instructions;

[0261] When the program instructions are loaded and executed by the processor, the processor performs the method of performing computational tasks on a multi-core system as described in any of Clauses 1-12.

[0262] Clause 14. A computer-readable storage medium storing program instructions that, when loaded and executed by a processor, cause the processor to perform a method for performing a computational task on a multi-core system according to any one of Clauses 1-12.

[0263] Clause 15. A computer program product comprising a computer program or instructions which, when executed by a processor, implement the method of performing a computational task on a multi-core system as described in any one of Clauses 1-12.

[0264] Clause 16. A chip including means for performing computational tasks on a multi-core system as described in Clause 13.

[0265] Clause 17. A board including the chip described in Clause 16.

[0266] While numerous embodiments of this disclosure have been shown and described herein, it will be apparent to those skilled in the art that such embodiments are provided by way of example only. Many modifications, alterations, and alternatives will occur to those skilled in the art without departing from the spirit and intent of this disclosure. It should be understood that various alternatives to the embodiments of this disclosure described herein may be employed in the practice of this disclosure. The appended claims are intended to define the scope of this disclosure and therefore cover equivalents or alternatives within the scope of these claims.

Claims

1. A task scheduling method based on a multi-core system, wherein the tasks include group matrix multiplication operations, the group matrix multiplication operations include multiple groups of matrix multiplications, the matrix size of each group of matrix multiplications is variable, the multi-core system includes multiple computing units, each computing unit includes one or more processor cores, each processor core has local memory, the method includes: Obtain the layout information of the group matrix multiplication operation task; Based on the layout information, the splitting size block_k in the K-dimensional space and the splitting strategy within a single operation unit are determined; where K-dimensional space is the column dimension of the left-multiplied matrix and the row dimension of the right-multiplied matrix. Based on the given block_k, determine the split size block_n in N dimensions and the split size block_m in M dimensions to satisfy the optimization objective; where M is the row dimension of the left multiplication matrix and N is the column dimension of the right multiplication matrix. The optimization objective includes any of the following: memory access, processor core utilization; and Based on the determined split sizes block_k, block_n, and block_m, the group matrix multiplication operations are scheduled to be executed in parallel on the computational units of the multi-core system.

2. The method according to claim 1, wherein, Obtaining the layout information includes: Obtain the maximum values m_max, k_max, and n_max of each dimension of M, K, and N in the multiple sets of matrix multiplications.

3. The method according to claim 2, wherein, Obtaining the maximum value includes any of the following: Responding to the dimensional data on the host side, search for the maximum value from it; In response to dimensional data being available on the device, retrieve dimensional data from the device and search for the maximum value; or In response to dimensional data on the device, the maximum value of each dimension is determined based on prior knowledge.

4. The method according to any one of claims 1-3, wherein, Determining the split size block_k in the K-dimensional dimension includes: In response to the fact that the group matrix multiplication operation is a computational bottleneck task, block_k is set to meet the efficiency requirements of the instructions supporting matrix multiplication operations in the multi-core system; or In response to the fact that the group matrix multiplication operation task is a memory access bottleneck task, block_k is set to prioritize improving the memory access efficiency of the largest matrix; wherein, the largest matrix is determined based on the size of the left multiplication matrix and the size of the right multiplication matrix.

5. The method according to claim 4, wherein, The method further includes: When the storage dimension containing dimension K is the lowest dimension, set block_k to a multiple of the cache line size; where the multiple is a natural number; or When the storage dimension containing dimension K is not the lowest dimension, block_k is set to the alignment granularity align_k of the instruction that implements matrix multiplication, where the multiplier is a natural number.

6. The method according to any one of claims 4-5, wherein, The group matrix multiplication task is determined to be a computational bottleneck task when the following conditions are met; otherwise, it is a memory access bottleneck task: in, CGMA DEV The multi-core system's compute-to-memory access ratio is TFLOPS, which stands for one trillion floating-point operations per second. (Bandwidth) DRAM The bandwidth of the off-chip DRAM is expressed in GB / s; group is the number of groups in the matrix multiplication; and sizeof() represents the data bit width.

7. The method according to any one of claims 1-6, wherein determining the splitting strategy within a single operational unit comprises: Determine the number of processor cores, Ncore, within a single computational unit; The split block is split into pm*pn parts; where pm*pn = Ncore, pm is the number of splits in the M dimension, and pn is the number of splits in the N dimension.

8. The method of claim 7, wherein the splitting comprises any of the following: Distribute pm and pn evenly across the M and N dimensions; or In response to either dimension M or dimension N in the split block being less than the corresponding threshold, the other dimension is split.

9. The method according to any one of claims 1-8, wherein, Determine the partition size block_n in N dimensions and the partition size block_m in M dimensions, including: In an N-dimensional search space, search for block_n that satisfies the optimization objective and determine the corresponding block_m; or In the M-dimensional search space, search for block_m that satisfies the optimization objective and determine the corresponding block_n.

10. The method according to claim 9, wherein, The search space is constructed as follows: The maximum search space in N or M dimensions is determined based on the split size block_k in the K dimension, the capacity of the local memory, and the splitting strategy within a single computational unit.

11. The method according to any one of claims 9-10, wherein, The search steps include: Based on the capacity of the local memory, block_k, and block_n or block_m in the current search loop, determine the corresponding block_m or block_n; Calculate the corresponding optimization objective based on block_m or block_n corresponding to the current search loop; and Based on the optimization objective corresponding to each search loop within the search space, determine the block_n and block_m corresponding to the optimal optimization objective.

12. The method according to any one of claims 1-11, wherein, The optimization objectives include: When the group matrix multiplication operation task is a memory access bottleneck task, the optimization objective is memory access volume, which is calculated as follows: Here, sizeof() represents the data bit width; or When the group matrix multiplication task is a computational bottleneck task, the optimization objective is processor core utilization, and the processor core utilization DEV is... utils Calculate as follows: Among them, CGMA DEV The multi-core system's compute-to-memory access ratio is TFLOPS, which stands for one trillion floating-point operations per second. (Bandwidth) DRAM The bandwidth of the off-chip DRAM is expressed in GB / s; group is the number of groups in the matrix multiplication, num. cluster The total is the number of the operational units. block PAD_UP(total) represents the total number of split blocks. block ,num cluster ) equals total block Divide by num cluster Round the result up and multiply by num cluster .

13. The method according to any one of claims 1-12, wherein, Scheduling the group matrix multiplication operations to be performed in parallel on the operational units includes: The instructions specify that each group matrix multiplication in the group matrix multiplication task should be split into multiple split blocks according to the determined split sizes block_k, block_n, and block_m; and According to the grid order of the M and N dimensions of the output matrix block, the corresponding split blocks are cyclically distributed to each operation unit for execution.

14. A task scheduling device based on a multi-core system, comprising a processor and a memory, wherein: The processor is configured to execute program instructions; The memory is configured to store the program instructions; When the program instructions are loaded and executed by the processor, the processor performs the task scheduling method according to any one of claims 1-13.

15. A computer-readable storage medium storing program instructions that, when loaded and executed by a processor, cause the processor to perform the task scheduling method for a multi-core system according to any one of claims 1-13.

16. A computer program product comprising a computer program or instructions which, when executed by a processor, implement the task scheduling method for a multi-core system according to any one of claims 1-13.

17. A processing apparatus comprising the task scheduling apparatus based on a multi-core system according to claim 14.

18. A chip comprising the processing apparatus according to claim 17.

19. A circuit board comprising the chip according to claim 18.