Methods for performing task scheduling and related products

By employing a dual-thread architecture in the artificial intelligence processor, tasks are split into prefetching and actual tasks, enabling parallel execution of tasks. This solves the problems of high complexity and large thread switching overhead in multithreading technology, and improves task execution efficiency and performance.

CN117234674BActive Publication Date: 2026-06-30CAMBRICON TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CAMBRICON TECH CO LTD
Filing Date
2022-06-07
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing multithreading technology is highly complex in artificial intelligence processors, with large thread switching overhead and unstable performance gains.

Method used

A dual-threaded architecture is adopted, which splits tasks into prefetch tasks and actual tasks, and starts executing the prefetch task of the next task while the actual task of the current task is being executed, thus achieving parallel execution of tasks.

Benefits of technology

It reduces the complexity of thread switching, improves the parallelism and speed of task execution, and achieves stable performance gains.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117234674B_ABST
    Figure CN117234674B_ABST
Patent Text Reader

Abstract

This disclosure relates to a method for performing task scheduling and related products, wherein the related products include a task scheduler, an artificial intelligence processor, a device, a board, and a computer-readable storage medium. The device may be included in a computing processing unit of a combined processing apparatus, which may include one or more data processing units. The aforementioned combined processing apparatus may also include interface devices and other processing units. The computing processing unit interacts with other processing units to jointly complete user-specified computational operations. The combined processing apparatus may also include a storage device connected to the device and other processing units respectively, for storing data from the device and other processing units. The solution of this disclosure can optimize scheduling operations and achieve parallel processing of tasks in multi-tasking scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure generally relates to the field of computers. More specifically, this disclosure relates to a method for performing task scheduling, a task scheduler for performing the aforementioned method, an artificial intelligence processor, a board, an apparatus, and a computer-readable storage medium. Background Technology

[0002] To improve parallel processing performance, traditional central processing units ("CPUs") typically employ multithreading techniques in their microarchitecture design, and the same applies to graphics processing units ("GPUs") in the field of artificial intelligence. The advantage of multithreading lies in its full utilization of parallelism between threads, providing a higher level of parallelism. However, its disadvantages include increased hardware complexity and increased thread switching overhead. Due to the high complexity of multithreading, more threads mean more complex control logic. Consequently, the overhead of thread switching also increases, and the benefits are not always positive. Therefore, reducing the complexity of multithreading while achieving stable performance gains is a pressing issue that needs to be addressed. Summary of the Invention

[0003] In view of the technical problems mentioned in the background section above, this disclosure proposes a scheme for efficient task scheduling. Using the scheme of this disclosure, a dual-threaded architecture with relatively low complexity and good performance gains can be achieved. Therefore, this disclosure provides solutions for task scheduling in the following aspects.

[0004] In a first aspect, this disclosure provides a task scheduler disposed in an artificial intelligence processor, the artificial intelligence processor further comprising execution circuitry for performing tasks, the task scheduler comprising: a first sending circuitry configured to send a prefetch task of a subsequent task to the execution circuitry during the execution of the actual task of the current task, wherein tasks in the task scheduler are split into prefetch tasks and actual tasks that are associated with each other; and a second sending circuitry configured to send the actual task of the subsequent task to the execution circuitry after the execution circuitry has completed the execution of the prefetch task of the subsequent task, such that the execution circuitry executes the actual task of the subsequent task after the actual task of the current task has been completed.

[0005] In a second aspect, this disclosure provides an artificial intelligence processor, comprising: an execution circuit configured to perform a plurality of tasks; and a task scheduler according to the first aspect, configured to interact with the execution circuit so that the execution circuit performs the scheduled plurality of tasks.

[0006] In a third aspect, this disclosure provides a board including an artificial intelligence processor as described in the second aspect.

[0007] In a fourth aspect, this disclosure provides a method for performing task scheduling, comprising: during the execution of an actual task of a current task by an execution circuit, sending a prefetch task of a subsequent task to the execution circuit, wherein the task is split into prefetch tasks and actual tasks that are associated with each other; and after the execution circuit has completed the execution of the prefetch task of the subsequent task, sending the actual task of the subsequent task to the execution circuit, such that the execution circuit executes the actual task of the subsequent task after the actual task of the current task has been completed.

[0008] In a fifth aspect, this disclosure provides an apparatus for scheduling and executing tasks, comprising: a processor; and a memory storing program instructions for scheduling tasks, wherein when the program instructions are executed by the processor, the various embodiments described above and discussed below are performed.

[0009] In a sixth aspect, this disclosure provides a computer-readable storage medium storing computer program instructions for task scheduling, which, when executed by a processor, cause the methods described above and several embodiments thereof to be implemented.

[0010] The solutions provided in the foregoing aspects of this disclosure enable a relatively simplified and stable dual-threaded architecture for task scheduling. Specifically, this disclosure divides tasks into prefetch tasks and actual tasks, and begins executing the prefetch task of the next task during the execution of the current task's actual task. This ensures that the corresponding prefetch task is completed before the actual task of the next task is executed, thereby improving the parallelism and execution speed of task execution. Furthermore, by simultaneously supporting the parallel execution of prefetch tasks and actual tasks, the processor can reduce thread switching overhead, achieve dual-threaded task scheduling, and obtain stable performance gains. Attached Figure Description

[0011] The above and other objects, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent upon reading the following detailed description with reference to the accompanying drawings. In the drawings, several embodiments of the present disclosure are illustrated by way of example and not limitation, and like or corresponding reference numerals denote like or corresponding parts, wherein:

[0012] Figure 1 This is a simplified block diagram schematically illustrating an artificial intelligence processor according to an embodiment of the present disclosure;

[0013] Figure 2This is a schematic diagram illustrating a detailed structural block diagram of a task scheduler according to an embodiment of the present disclosure;

[0014] Figure 3 This is a simplified flowchart illustrating a method for performing task scheduling according to the present disclosure;

[0015] Figure 4 This is a flowchart schematically illustrating details of a method for performing task scheduling according to an embodiment of the present disclosure;

[0016] Figure 5 This is a schematic flowchart illustrating a task scheduling method according to an embodiment of the present disclosure;

[0017] Figure 6 This is a schematic diagram illustrating a state transition diagram for performing task scheduling according to an embodiment of the present disclosure;

[0018] Figure 7 This is a schematic diagram illustrating the hardware and software architecture of dataflow programming according to embodiments of the present disclosure;

[0019] Figure 8 This is a structural diagram of a board according to an embodiment of the present disclosure;

[0020] Figure 9 This is a structural diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure;

[0021] Figure 10 This is a schematic diagram showing the internal structure of a computing device according to an embodiment of the present disclosure;

[0022] Figure 11 This is a schematic diagram illustrating the internal structure of a processor core according to an embodiment of the present disclosure; and

[0023] Figure 12 This is a schematic diagram illustrating the data writing process between processor cores of different clusters according to embodiments of the present disclosure. Detailed Implementation

[0024] The technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this disclosure, and not all of them. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.

[0025] It should be understood that the terms "first," "second," "third," and "fourth," etc., in the claims, specification, and drawings of this disclosure are used to distinguish different objects, rather than to describe a specific order. The terms "comprising" and "including" as used in the specification and claims of this disclosure indicate the presence of the described features, integrals, steps, operations, elements, and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or collections thereof.

[0026] It should also be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of this disclosure. As used in this disclosure and claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in this disclosure and claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes such combinations.

[0027] As used in this specification and claims, the term "if" may be interpreted, depending on the context, as "when," "once," "in response to determination," or "in response to detection." Similarly, the phrase "if determined" or "if [described condition or event] is detected" may be interpreted, depending on the context, as "once determined," "in response to determination," "once [described condition or event] is detected," or "in response to detection of [described condition or event]."

[0028] As mentioned earlier, to achieve efficient task scheduling and execution, this disclosure proposes a dual-thread mechanism. Specifically, by abstractly dividing the tasks run by the processor into prefetch tasks and real tasks, and completing the prefetch task of the next task while the real task of the current task is executing, a "pseudo" dual-thread task scheduling can be achieved. Therefore, this disclosure's solution can achieve a certain degree of parallel execution between the current task and the next task, thereby improving the speed and efficiency of task execution and reducing the overhead of thread switching and the complexity of control logic.

[0029] The specific embodiments of this disclosure will now be described in detail with reference to the accompanying drawings.

[0030] Figure 1 This is a simplified block diagram schematically illustrating an artificial intelligence (“AI”) processor 100 according to an embodiment of the present disclosure. It will be understood that the AI ​​processor described herein can be any of the following... Figure 7 The AI ​​processor 701 to be described Figure 9The computing device 901 shown has one or more processor cores so that multiple tasks can be executed in parallel.

[0031] like Figure 1 As shown, the worker intelligent processor 100 may include a task scheduler 102 and an execution circuit 108. Here, the task scheduler can receive one or more tasks from the upper layer of the computing platform and dispatch these tasks to the execution circuit 108 for execution. In some scenarios, task flows from different users (each of which may include one or more tasks) can be dispatched for execution by the task scheduler. For the context of this disclosure, the execution circuit 108 may be the arithmetic logic unit (or computing core) in an artificial intelligence processor, and it can cooperate with the task scheduler to execute the dispatched tasks. Although... Figure 1 Only one execution circuit 108 is shown in the diagram, but those skilled in the art will understand that the execution circuit 108 of this disclosure is not limited to one. In one implementation scenario, the artificial intelligence processor 100 of this disclosure may also have multiple execution circuits 108 to enable the smooth execution of tasks.

[0032] In one scenario, the task scheduler of this disclosure may include a first sending circuit 104 and a second sending circuit 106. Specifically, the first sending circuit is used to send a prefetch task of a subsequent task to the execution circuit while the execution circuit is executing the actual task of the current task, wherein the tasks in the task scheduler are split into prefetch tasks and actual tasks that are related to each other. Correspondingly, the second sending circuit is used to send the actual task of the subsequent task to the execution circuit after the execution circuit has completed executing the prefetch task of the subsequent task, so that the execution circuit executes the actual task of the subsequent task after the actual task of the current task has been completed.

[0033] In the context of this disclosure, the execution circuit will perform multiple tasks. For ease of description, the task that is about to be executed by the execution circuit is referred to as the current task, and the task immediately following the current task is referred to as the next task. Those skilled in the art will understand that after the current task has been completed by the execution circuit, the aforementioned next task becomes the current task to be executed by the execution circuit, and the task immediately following it becomes the next task.

[0034] As previously stated, to implement a dual-threaded task scheduling mechanism, this disclosure abstracts the tasks executed by the execution circuit into two types of tasks: one is the actual running task (“exe task”), and the other is the task that serves the actual running task (“prefetch task”). The former is referred to as the actual task, and the latter as the prefetch task. Thus, this disclosure divides a task executed by the execution circuit into two parts: the prefetch task and the actual task.

[0035] Regarding prefetch tasks and actual tasks, these two parts of a task can be divided in different ways. As an example, a task can be split into related prefetch tasks and actual tasks via program instructions. These prefetch tasks and actual tasks can be configured to have the same identifier to indicate their association. Alternatively, the task scheduler of this disclosure can be equipped with dedicated functional modules or circuitry for task partitioning to achieve the division of a task into prefetch tasks and actual tasks. In one implementation scenario, prefetch tasks and actual tasks can share a common task identifier to indicate their association and that they constitute a complete task.

[0036] In some scenarios, when a task includes execution steps such as instruction fetch, looking up the Translation Lookaside Buffer ("TLB"), virtual-to-physical address translation (e.g., looking up the page table to find the address mapping), and execution, this disclosure classifies instruction fetch, TLB lookup, and page table lookup (including parameter loading) as steps to be performed by the prefetch task, while classifying the execution steps as the actual task. In some scenarios, when address translation can be performed using the TLB stored on-chip, such as static random access memory ("SRAM"), the operation of looking up the page table on off-chip dynamic random access memory ("DRAM") can be omitted. By scheduling the corresponding prefetch task to be executed before the actual task, operations such as instruction fetch and lookup can be omitted when the actual task is executed, thereby improving the task execution speed and enabling parallel execution of the two types of tasks.

[0037] Figure 2 This is a schematic diagram illustrating a detailed structural block diagram of a task scheduler according to an embodiment of the present disclosure. It should be understood that... Figure 2 The task scheduler shown can be considered as Figure 1 The illustrated task scheduler is one implementation, therefore regarding Figure 1 The description also applies to Figure 2 .

[0038] like Figure 2As shown, the task scheduler 102 of this disclosure includes a first transmitting circuit 104 and a second transmitting circuit 106. The main functions of these two transmitting circuits have been previously described in conjunction with... Figure 1 The above has already been described, so I will not repeat it here.

[0039] In one embodiment, the task scheduler 102 further includes a first receiving circuit 108 for receiving tasks that have been split into prefetch tasks and actual tasks associated with each other via program instructions. As previously mentioned, the program instructions here can be code instructions written manually by programmers or users, and the execution of these code instructions causes a task to be split into prefetch tasks and actual tasks. For example, the instruction fetching and address lookup, parameter substitution, and other parts of a task can be assigned to prefetch tasks, while the remaining unexecuted parts of the task can be assigned to actual tasks. Additionally or alternatively, the task scheduler 102 may also include a partitioning circuit 112 for splitting the received task into prefetch tasks and actual tasks associated with each other. In other words, the task scheduler 102 of this disclosure can actively partition a task into prefetch tasks and actual tasks.

[0040] To achieve parallel execution of tasks, the second sending circuit 106 in the task scheduler 102 can be used to send the prefetch task of the next task to the execution circuit at a predetermined time before the actual execution of the current task is completed, so that the execution circuit can execute the prefetch task of the next task during the execution of the actual task of the current task. By executing the prefetch task of the next task simultaneously during the execution of the actual task of the current task, the scheme of this disclosure realizes parallel task execution under a dual-thread mechanism.

[0041] As previously stated, the prefetch task of this disclosure may include the translation of virtual addresses to physical addresses, and as an implementation, the aforementioned address translation can be achieved through page table lookups, where the page tables are typically stored on off-chip dynamic random access memory (“DRAM”). Based on this, the predetermined timing mentioned above can be determined based on the number of page table levels in the page table lookup and the latency of each page table level. For example, when the number of page table levels is 4, and the lookup time for each page table level is 500 nanoseconds (“ns”), then the predetermined timing of this disclosure can be determined as 4 × 500ns = 2µs (“microseconds”).

[0042] In one implementation scenario, the task scheduler 102 may further include a second receiving circuit 110 for receiving a pre-completion indication from the execution circuit for the actual task of the current task. In response to receiving the aforementioned pre-completion indication, the first sending circuit 104 may send a prefetch task for the next task to the execution circuit, so that the execution circuit can release hardware resources to execute the prefetch task of the next task.

[0043] To monitor the execution of actual tasks, the task scheduler may further include a third receiving circuit 114 and a timer (or timing circuit) 118. In operation, the third receiving circuit 114 can receive a completion indication of the prefetch task of the next task from the execution circuit 108. Upon receiving this indication from the execution circuit 108, the timer 118 can be started to time the execution of the current task by the execution circuit. In one scenario, if the timer 114 exceeds a predetermined threshold and the third receiving circuit 114 does not receive a completion indication of the prefetch task of the next task from the execution circuit 108, the first sending circuit 102 can resend the prefetch task of the next task to the execution circuit 108 for re-execution. Alternatively, the first sending circuit 102 can also send the prefetch task of the next task to another execution circuit different from the execution circuit 108, so that the other execution circuit can perform the execution of the prefetch task of the next task.

[0044] To ensure that the resent prefetch task can be executed as quickly as possible, a send queue for priority sending tasks can be set up in the task scheduler. In this case, when the timer exceeds a predetermined threshold and no indication is received from the execution circuit 108, the task scheduler 102 can place the prefetch task of the next task into the priority send queue so that the prefetch task of the next task can be resent to the execution circuit 108 or another execution circuit with the highest send priority.

[0045] To monitor and report task execution, the task scheduler of this disclosure may also include a recording circuit 120 and an error reporting circuit 122. In one implementation scenario, the recording circuit can be used to record errors that occur during the execution of the prefetch task. These errors may include, for example, an indication that a pre-completion instruction has not been received from the execution circuit 108, or various error messages fed back by the execution circuit 108 during execution. Subsequently, the error reporting circuit 122 can report the errors recorded by the recording circuit to the upper-level user, thereby taking appropriate measures regarding the execution errors of the prefetch task. In one scenario, the error reporting circuit 122 can report errors during the execution of the actual task associated with the prefetch task. Through such error reporting, the user can instruct the execution circuit 108 to correct the errors during the actual task execution to complete the entire task. Additionally, if the consequences of erroneous execution of the prefetch task cannot be overcome, the execution circuit can also report back to the task scheduler, instructing it to resend the prefetch task with the execution error for execution, or resend it to another execution circuit for execution.

[0046] In one scenario, when the execution circuit 108 is used as a processing unit in an artificial intelligence processor (such as...), Figure 10The cluster 1005 shown can include multiple processor cores (such as...) that operate on tasks in parallel. Figure 10 The processor core shown is 1006. In this case, a task of this disclosure can be divided into multiple subtasks, and each subtask can have related sub-prefetch tasks and sub-actual tasks. Similar to the previous description, the sub-prefetch tasks here can include tasks such as instruction fetching and address translation operations, while the sub-actual tasks can be specific actual task executions.

[0047] Based on the above subtask division, the task scheduler 102 of this disclosure can also be used to interact with multiple processor cores so that the multiple processor cores can execute the prefetch subtasks and actual subtasks of the corresponding subtasks in parallel. During the interaction with multiple processor cores to execute tasks, the first sending circuit can also be used to send the corresponding prefetch subtask of the next task to each of the multiple processor cores in response to receiving a pre-completion indication for the prefetch subtask of the current task from all multiple processor cores. Correspondingly, the second sending circuit can also be used to send the corresponding actual subtask of the next task to each of the multiple processor cores in response to receiving a completion indication for the actual subtask of the current task and a prefetch subtask pre-completion indication for the next task from all multiple processor cores, so that it can be executed in parallel by the multiple processor cores. In the scheme of this disclosure, when the task scheduler receives the pre-completion indication, it can release the computing resources of the corresponding processor core, thereby enabling it to flexibly schedule tasks according to the resource occupancy of the multiple processor cores.

[0048] The above combination Figure 2 The composition details of the task scheduler in the embodiments of this disclosure have been described. Based on the above description, those skilled in the art will understand that the task scheduler of this disclosure has various implementation methods and is not limited to these. Figure 2 The multiple circuits shown. Furthermore, although... Figure 2 The various components of the task scheduler disclosed herein are shown in the form of circuit modules, but the implementation of the task scheduler disclosed herein is not limited to... Figure 2 The form shown is illustrated. Based on the teachings of this disclosure, those skilled in the art will also realize that the task scheduler of this disclosure can also have other implementation forms, such as through software or a combination of software and hardware. When implemented in software, Figure 2 The circuits shown can be replaced by various program modules or units. Using the task scheduler of this disclosure, simplified dual-thread task scheduling can be implemented, thereby achieving inter-thread parallelism with minimal design complexity.

[0049] Figure 3 This is a simplified flowchart illustrating a method 300 for performing task scheduling according to this disclosure. Based on the foregoing... Figure 1 and Figure 2 As will be understood by those skilled in the art, method 300 can be executed by the task scheduler of this disclosure, thereby achieving parallelism in task execution with minimal thread switching overhead.

[0050] like Figure 3 As shown, in step S302, during the execution of the actual task of the current task by the execution circuit, a prefetch task for the next task is sent to the execution circuit, wherein the task is split into prefetch tasks and actual tasks that are related to each other. Then, in step S304, after the execution circuit has completed the prefetch task of the next task, the actual task of the next task is sent to the execution circuit, so that the execution circuit executes the actual task of the next task after the actual task of the current task has been completed. As mentioned above, the prefetch task and actual task can be divided by the user or programmer through written software instructions, or directly by the task scheduler of this disclosure. Furthermore, prefetch tasks and actual tasks belonging to the same task can be associated through task identifiers, so that the same execution circuit can complete the prefetch task and actual task of the same task.

[0051] It can be seen that, with the help of execution Figure 3 The method steps shown in this disclosure enable the task scheduler to innovatively achieve dual-threaded task scheduling and highly parallelized task processing by executing the prefetch task of the next task during the execution of the actual task of the current task, and executing the actual task of the next task after the execution of the actual task of the current task is completed.

[0052] Figure 4 This is a flowchart schematically illustrating details of a method 400 for performing task scheduling according to an embodiment of the present disclosure. It will be understood that method 400 illustrates further implementation steps and details of method 300, and therefore the description of method 300 also applies. Figure 4 The method steps. Additionally, since method 400 can also be executed by the task scheduler, the aforementioned steps, combined with 1- Figure 3 The description of the task scheduler also applies to the following text. Figure 4 The description will be repeated here, and the same content will not be elaborated upon again.

[0053] like Figure 4As shown, at step S402, tasks are received that have been split into prefetch tasks and actual tasks associated with each other via program instructions. Alternatively, at step S404, the received tasks are split into prefetch tasks and actual tasks associated with each other. The tasks here can be any task executed by the execution circuit, such as tensor-based computation tasks, including, for example, convolution operation tasks. As mentioned earlier, the task here can be one of many tasks in one or more task flows. Assuming it is the current task that the current execution circuit will execute, then the task immediately following it is also the next task.

[0054] Next, at step S406, at a predetermined time before the actual completion of the current task, a prefetch task for the next task is sent to the execution unit. Then, at step S408, a pre-finish instruction (“Pre finish”) for the actual task of the current task is received from the execution circuit. At step S410, in response to receiving the aforementioned pre-finish instruction, the hardware resources of the execution circuit are released for executing the prefetch task of the next task.

[0055] At step S412, in response to receiving a completion indication for the prefetch task of the next task from the execution circuit, the actual task of the next task is sent to the execution circuit. As an optional step, at step S414, the execution circuit can be timing the execution of the actual task of the current task; for example, it can use... Figure 2 The timer shown in the diagram keeps track of time, and a predetermined threshold is determined for the time. In response to the time exceeding the predetermined threshold, an incomplete indication indicating that the actual task execution has not been completed can be received from the execution circuit. Subsequently, in response to receiving the incomplete indication, a prefetch task for the next task can be sent to the execution circuit or another execution circuit. In other words, since the actual task of the current task has not been completed within the predetermined time, the scheme of this disclosure chooses to resend the prefetch task for the next task so that the execution circuit has sufficient time to complete the actual task of the current task. Alternatively, the prefetch task for the next task can also be sent to another execution circuit upon receiving the incomplete indication. This is particularly advantageous in scenarios with multiple execution circuits. By sending the prefetch task for the next task to another execution circuit, the parallel scheduling of this disclosure is not affected by the execution speed of a single execution circuit, but can maximize the advantages of multiple execution circuits.

[0056] The above combination Figure 4One implementation scheme and scenario of the present disclosure have been described, but the implementation form and scenario of the present disclosure are not limited thereto. For example, when the execution circuit completes the actual task of the current task within a predetermined threshold of the timer, the task scheduler can directly send the actual task of the next task to the execution circuit. In other words, the execution circuit has now completed the actual task of the current task, is temporarily in an idle state, and can execute the actual task of the next task.

[0057] Figure 5 This is a schematic flowchart illustrating a task scheduling 500 according to an embodiment of the present disclosure. It can be seen that, for the purpose of further understanding the scheduling scheme of the present disclosure, Figure 5 The processing flow of the task scheduler of this disclosure is illustrated in a manner similar to a sequence diagram. Given that operational details regarding the task scheduler of this disclosure have already been presented... Figures 1-4 A detailed description has been provided, and the same or similar technical content will be shown in a concise manner below.

[0058] like Figure 5 As shown, at S501 (indicated by the arrow), the task scheduler of this disclosure can send the prefetch task of the current task to the execution circuit. Then, after the execution circuit completes the prefetch task, at S502 (indicated by the arrow), the task scheduler can send the actual task to the execution circuit. Subsequently, at the predetermined time mentioned above, at S503 (indicated by the arrow), the task scheduler can receive a pre-completion indication from the execution circuit. Afterwards, while the execution circuit executes the actual task of the current task at S504 (indicated by the arrow), the task scheduler sends the prefetch task of the next task to the execution circuit from the time the pre-completion indication is received, and the execution circuit completes the prefetch task.

[0059] As shown in the figure, to ensure the execution circuit can successfully complete the actual task of the current task, even though the prefetch task of the next task has been completed, the actual task of the next task is not sent until the actual task of the current task is completed, as indicated by the arrow S504. In response to the completion of the actual task of the current task, for example, receiving a completion indication from the execution circuit, the task scheduler can send the actual task of the next task to the execution circuit along the arrow S506, so that the execution circuit can then execute the actual task of the next task. Although not further shown in the figure, based on the detailed description above, those skilled in the art will understand that for more than two tasks, the processing flow can be repeatedly executed in a similar manner until all tasks are scheduled and executed. For example, during the execution of the actual task of a later task, the task scheduler can send the prefetch task of the task immediately following the next task to the execution circuit for execution. And so on, the task scheduler of this disclosure ultimately dispatches all tasks to the execution circuit for execution.

[0060] Figure 6 This is a schematic diagram illustrating a state transition diagram for performing task scheduling 600 according to an embodiment of the present disclosure. It will be understood that... Figure 6 The state transition diagram is merely exemplary, and those skilled in the art, based on the foregoing description, will understand that the task scheduling scheme of this disclosure also includes state transitions not shown in the diagram. Furthermore, the foregoing description, in conjunction with... Figures 1-5 The description of task scheduler operations also applies to Figure 6 Furthermore, the same content will be described in a simplified way and will not be repeated.

[0061] As shown in the diagram, at the initial stage of task scheduling, specifically at state node 601, both the prefetch task (as indicated by "Prefetch") and the actual task (Exe) are idle. Then, as indicated by arrow S606, when the task scheduler sends the prefetch task for the current task to the execution circuit, the state transitions to state node 602. At this state node, the execution of the prefetch task is busy, while the execution of the actual task is idle because the actual task has not yet been sent from the task scheduler to the execution circuit. Next, after the execution circuit completes the prefetch task for the current task, the state transitions back to state node 601, as indicated by arrow S607. At this state node, since the prefetch task has been completed, both the prefetch task and the actual task will again be idle.

[0062] According to the scheme of this disclosure, the task scheduler can then send the actual task of the current task to the execution circuit, as shown by arrow S608. At this time, the state transitions from state node 601 to state node 603. At state node 603, the execution of the prefetch task of the current task remains idle while the pre-execution of the actual task (as shown by "Pre-Exe" in the figure) is busy. Here, pre-execution can be used to indicate the execution operation of the actual task of the current task by the execution circuit before the predetermined time mentioned above.

[0063] Next, as the execution circuit executes the actual task of the current task, the state transitions from state node 603 to state node 604 via arrow S609. During this state transition, since the prefetch task of the current task has been completed, the prefetch task remains idle, while the execution of the actual task enters a busy state from the predetermined time to the final stage, meaning the post-execution of the current task's actual task (as shown in "Post-Exe" in the figure) is still in progress. Afterward, the task scheduler sends the prefetch task of the next task to the execution circuit. Thus, the state transitions from state node 604 to state node 605 via arrow S610. During this state transition, since the execution unit executes the prefetch task of the next task, the execution of the next task's prefetch task becomes busy. Simultaneously, since the post-execution of the current task's actual task is still in progress, the post-execution remains busy.

[0064] Subsequently, as indicated by arrow S611, the state transitions from state node 605 back to state node 604. As previously mentioned, during this state transition, the execution of the prefetch task is idle because the execution circuit has completed the prefetch task for the next task; at this time, since the execution circuit is still in the post-execution of the actual task of the current task, the post-execution is still in a busy state. When the execution circuit completes the post-execution of the actual task of the current task at node state 605, as indicated by arrow S612, the execution circuit will send a completion indication of the actual task to the task scheduler, thereby causing the execution of the actual task to transition to an idle state at the transitioned state node 602.

[0065] The above text combined Figure 6The state transitions in parallel scheduling performed by the task scheduler of this disclosure are described exemplarily. It is understood that the description herein is merely exemplary and not restrictive. Those skilled in the art can also incorporate the error states mentioned above based on the description herein. This error can occur as an execution error of a prefetch task or an actual task; thus, the states described above can also include, for example, a situation where the prefetch task is idle while the actual task malfunctions, or a situation where the actual task is busy while the prefetch task malfunctions. In this scenario, the artificial intelligence processor of this disclosure may also include a control circuit connected to the execution circuit and collecting error information about the executed task from the execution circuit in order to notify the task scheduler. In some cases, for execution errors, options include rescheduling the malfunctioning task, having the user modify the task code, or restarting the execution circuit.

[0066] Figure 7 The diagram illustrates a hardware and software architecture design according to an embodiment of this disclosure. As shown in the diagram, the hardware and software architecture in this embodiment may include an AI processor 701, a driver and operating system 702, a compiler and programming language 703, a library 704, a framework layer 705, and an application layer 706. It is understood that this hardware and software architecture can be applied to the artificial intelligence computing system or computing platform of this application.

[0067] Specifically, the AI ​​processor 701 (which may, for example, be included in the board described below in conjunction with the accompanying drawings) incorporates both computational and data handling optimizations in its hardware design. To this end, it employs customized computing units to accelerate computation and on-chip memory to accelerate data handling, thereby achieving extremely high performance and energy efficiency. Furthermore, to support various algorithm optimizations, the AI ​​processor 701 can have customized computing units and instruction sets, where the instruction set can provide computational instructions of different granularities (scalar, vector, and / or matrix). Moreover, considering factors such as algorithm memory access characteristics, hardware cost, and verification difficulty, on-chip memory can be used, and data handling can be optimized. In practical operation, the AI ​​processor of this disclosure can achieve speeds tens of times faster than mainstream GPUs (Graphics Processing Units).

[0068] The driver and operating system 702 is primarily responsible for scheduling tasks on the AI ​​processor 701. This scheduling operation can, for example, perform scheduling based on task priority, communication and synchronization between multiple devices. For the compiled program, the operating system and driver can schedule and execute the tasks to be performed on a specific processor, including but not limited to the following operations: allocating and releasing device memory, enabling data transfer between devices, maintaining task queues, and scheduling tasks according to priority to achieve synchronization and cooperation between multiple devices.

[0069] The compiler and programming language 703 can be an assembly language developed for the instruction set of the AI ​​processor 701. In applications, it can translate deep learning operators developed for the AI ​​processor 701 into combinations of processor instructions, enabling efficient use of the AI ​​processor 701. In some application scenarios, the compiler can be used to optimize the compilation process by executing intermediate expression stages.

[0070] Library 704 may include runtime library 714 and machine learning library 724. In one implementation scenario, the aforementioned library 704 can use the instruction set of AI processor 701 and perform partial optimizations based on the instruction set of AI processor 701 to improve the running speed of operators. Runtime library 714 may be a high-performance operator library specifically developed for AI processor 701, and it can be used to complete the interaction between general-purpose processors and artificial intelligence processors. Furthermore, runtime library 714 can also provide a set of interfaces for artificial intelligence processors. As for machine learning library 724, it can be used to accelerate various machine learning or deep learning algorithms on artificial intelligence processors. Specifically, machine learning library 724 can provide a set of efficient, general-purpose, flexible and scalable programming interfaces. Its upper-layer machine learning applications can directly adopt the programming interfaces of various programming frameworks (such as PyTorch, TensorFlow, Caffe, MXNet, etc.), or can directly program using the interface provided by machine learning library 724. In addition, the machine learning library 724 disclosed herein can be easily called by hardware platforms, while runtime library 714 can implement some basic and commonly used operators, such as convolution, pooling and other operations.

[0071] Framework layer 705 can add encapsulation for operators developed for AI processors, primarily encapsulating operators from runtime library 714. In addition, framework layer 705 can modify related task scheduling or memory management components. In one application scenario, framework layer 705 can adopt the architecture of frameworks such as TensorFlow.

[0072] The device side in this embodiment may be an artificial intelligence chip or board, etc. Figure 8 A schematic diagram of the structure of a board 800 according to an embodiment of this disclosure is shown. Figure 8As shown, board 800 includes a chip (or "processing chip") 801, which is a system-on-a-chip (SoC) integrating one or more combined processing devices. These combined processing devices are artificial intelligence computing units used to support various deep learning and machine learning algorithms, meeting the intelligent processing needs of complex scenarios in fields such as computer vision, speech, natural language processing, and data mining. In particular, deep learning technology is widely used in cloud intelligence. A significant characteristic of cloud intelligence applications is the large volume of input data, placing high demands on the platform's storage and computing capabilities. Board 800 in this embodiment is suitable for cloud intelligence applications, possessing massive off-chip storage, on-chip storage, and substantial computing power.

[0073] Chip 801 is connected to external device 803 via external interface device 802. External device 803 may be, for example, a server, computer, camera, monitor, mouse, keyboard, network card, or Wi-Fi interface. Data to be processed can be transmitted from external device 803 to chip 801 via external interface device 802. The calculation results from chip 801 can be transmitted back to external device 803 via external interface device 802. Depending on the application scenario, external interface device 802 may have different interface forms, such as a PCIe interface.

[0074] The board 800 also includes a storage device 804 for storing data, which includes one or more storage units 805. The storage device 804 is connected to and transmits data with the controller 806 and the chip 801 via a bus. The controller 806 in the board 800 is configured to regulate the state of the chip 801. Therefore, in one application scenario, the controller 806 may include a microcontroller (MCU). In the application scenario of the scheduling scheme of this disclosure, the controller can run a driver program and includes a scheduler. When the aforementioned driver program is run under the control of the controller, the task scheduler performs the aforementioned combination. Figures 1-6 The aforementioned operation process distributes tasks to processing chips or processor cores for execution.

[0075] Figure 9 This is a structural diagram illustrating the combined processing device 900 in chip 801 of this embodiment. (As shown) Figure 9 As shown, the combined processing device 900 includes a computing device 901, an interface device 902, a processing device 903, and a DRAM 904.

[0076] The computing device 901 is configured to perform user-specified operations. It is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations. It can interact with the processing device 903 through the interface device 902 to jointly complete the user-specified operations.

[0077] Interface device 902 is used to transmit data and control commands between computing device 901 and processing device 903. For example, computing device 901 can obtain input data from processing device 903 via interface device 902 and write it to on-chip storage device of computing device 901. Further, computing device 901 can obtain control commands from processing device 903 via interface device 902 and write them to on-chip control cache of computing device 901. Alternatively or optionally, interface device 902 can also read data from storage device of computing device 901 and transmit it to processing device 903.

[0078] Processing device 903, as a general-purpose processing device, performs basic control including but not limited to data transfer, and starting and / or stopping computing device 901. Depending on the implementation, processing device 903 may be one or more types of processors, including but not limited to digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs. As mentioned above, for the purposes of this disclosure only, computing device 901 can be considered to have a single-core structure or a homogeneous multi-core structure. However, when computing device 901 and processing device 903 are considered together, they are considered to form a heterogeneous multi-core structure.

[0079] DRAM 904 is used to store data to be processed. It is DDR memory, typically 16GB or larger, and is used to store data in computing device 901 and / or processing device 903.

[0080] Figure 10A schematic diagram of the internal structure of computing device 901 is shown. Computing device 901 is used to process input data for computer vision, speech, natural language processing, data mining, etc. The computing device 901 in the diagram adopts a multi-core hierarchical architecture design. As a system-on-a-chip, computing device 901 includes multiple clusters, and each cluster includes multiple processor cores, which can be used to execute the tasks disclosed herein. In other words, computing device 901 is constructed in a hierarchical structure of system-on-a-chip, clusters, and processor cores.

[0081] From the perspective of system-on-a-chip hierarchy, such as Figure 10 As shown, the computing device 901 includes an external storage controller 1001, a peripheral communication module 1002, an on-chip interconnect module 1003, a synchronization module 1004, and multiple clusters 1005.

[0082] There can be multiple external storage controllers 1001; two are shown as an example in the figure. These controllers are used to respond to access requests from the processor core to access external storage devices, such as… Figure 9 The DRAM 904 in the chip allows data to be read from or written to external devices. The peripheral communication module 1002 receives control signals from the processing device 903 via the interface device 902, initiating the computing device 901 to execute tasks, such as the prefetch task and actual task mentioned above in this disclosure. The on-chip interconnect module 1003 connects the external storage controller 1001, the peripheral communication module 1002, and multiple clusters 1005 to transmit data and control signals between the modules. The synchronization module 1004 is a global barrier controller (GBC) used to coordinate the working progress of each cluster and ensure information synchronization. The multiple clusters 1005 are the computing core of the computing device 901. Four are shown exemplary in the figure; however, with hardware development, the computing device 901 disclosed herein may also include eight, sixteen, sixty-four, or even more clusters 1005.

[0083] From the perspective of cluster hierarchy, such as Figure 10 As shown, each cluster 1005 includes multiple processor cores (IPU cores) 1006 and one memory core (MEM core) 1007.

[0084] Four processor cores 10006 are shown in the figure as an example; this disclosure does not limit the number of processor cores 1006. Its internal architecture is as follows: Figure 10 As shown. Each processor core 1006 includes three main modules: a control module 91, an arithmetic module 92, and a storage module 93.

[0085] The control module 91 coordinates and controls the operation of the computation module 92 and the storage module 93 to complete the deep learning task. It includes an instruction fetch unit (IFU) 1111 and an instruction decode unit (IDU) 1112. The instruction fetch unit 1111 fetches instructions from the processing device 903, and the instruction decode unit 1112 decodes the fetched instructions and sends the decoding result as control information to the computation module 92 and the storage module 93. The instruction fetch and instruction decode operations here can be considered as the prefetching task of this disclosure.

[0086] The computation module 92 includes a vector operation unit 1121 and a matrix operation unit 1122. The vector operation unit 1121 is used to perform vector operations and can support complex operations such as vector multiplication, addition, and nonlinear transformations; the matrix operation unit 1122 is responsible for the core computations of deep learning algorithms, namely matrix multiplication and convolution.

[0087] Storage module 93 is used to store or move related data, including neuron RAM (NRAM) 1131, weight RAM (WRAM) 1132, input / output direct memory access (IODMA) 1133, and move direct memory access (MVDMA) 1134. NRAM 1131 is used to store input, output data, and intermediate results for the processor core 1006 to calculate; WRAM 1132 is used to store the weights of the deep learning network; IODMA 1133 controls the memory access of NRAM 1131 / WRAM 1132 and DRAM 904 through broadcast bus 1009; MVDMA 1134 controls the memory access of NRAM 1131 / WRAM 1132 and SRAM 1008.

[0088] Back Figure 10 The storage core 1007 is primarily used for storage and communication, namely storing shared data or intermediate results among processor cores 1006, and performing communication between cluster 1005 and DRAM 904, communication between clusters 1005, and communication between processor cores 1006. In other embodiments, the storage core 1007 has scalar operation capabilities and is used to perform scalar operations.

[0089] Storage core 1007 includes a shared memory unit (SRAM) 1008, a broadcast bus 1009, a cluster direct memory access (CDMA) module 1010, and a global direct memory access (GDMA) module 1011. SRAM 1008 acts as a high-performance data relay station. Data multiplexed between different processor cores 1006 within the same cluster 1005 does not need to be obtained from DRAM 904 by each processor core 1006 individually. Instead, it is relayed between processor cores 1006 via SRAM 1008. Storage core 1007 only needs to quickly distribute the multiplexed data from SRAM 1008 to multiple processor cores 1006, thereby improving inter-core communication efficiency and significantly reducing on-chip and off-chip I / O access.

[0090] Broadcast bus 1009, CDMA 1010, and GDMA 1011 are used to perform communication between processor cores 1006, communication between clusters 1005, and data transfer between cluster 1005 and DRAM 904, respectively. These will be explained separately below.

[0091] The broadcast bus 1009 is used to complete high-speed communication between the processor cores 1006 within the cluster 1005. In this embodiment, the broadcast bus 1009 supports inter-core communication methods including unicast, multicast, and broadcast. Unicast refers to point-to-point (i.e., data transmission from one processor core to another) data transmission. Multicast is a communication method that transmits a piece of data from SRAM 1008 to several specific processor cores 1006. Broadcast is a communication method that transmits a piece of data from SRAM 1008 to all processor cores 1006, and is a special case of multicast.

[0092] CDMA 1010 is used to control SRAM 1008 access between different clusters 1005 within the same computing device 901. Figure 12 This diagram illustrates the operation of CDMA 1010 when one processor core attempts to write data to another processor core in a different cluster. In this application scenario, the same computing device comprises multiple clusters. For simplicity, only clusters 0 and 1 are shown in the diagram. Both clusters 0 and 1 contain multiple processor cores; similarly, for ease of explanation, only processor core 0 is shown in cluster 0, and only processor core 1 is shown in cluster 1. Processor core 0 intends to write data to processor core 1.

[0093] First, processor core 0 sends a unicast write request to write data into its local SRAM 0. CDMA 0 acts as the master and CDMA 1 acts as the slave. The master pushes the write request to the slave, that is, the master sends the write address AW and the write data W to transmit the data to SRAM 1 of cluster 1. Then, the slave sends a write response B as a response. Finally, processor core 1 of cluster 1 sends a unicast read request to read the data from SRAM 1.

[0094] Back Figure 10 The GDMA 1011 works in conjunction with the external memory controller 1001 to control memory access from the SRAM 1008 to the DRAM 904 in the cluster 1005, or to read data from the DRAM 904 into the SRAM 1008. As described above, communication between the DRAM 904 and the NRAM 1131 or WRAM 1132 can be achieved through two channels. The first channel is a direct connection between the DRAM 904 and the NRAM 1131 or WRAM 1132 via the IODAM 1133; the second channel involves first transmitting data between the DRAM 904 and the SRAM 1008 via the GDMA 1011, and then transmitting data between the SRAM 1008 and the NRAM 1131 or WRAM 1132 via the MVDMA 1134. Although the second channel appears to require more components and has a longer data flow, in some embodiments, the bandwidth of the second channel is actually much greater than that of the first channel. Therefore, communication between DRAM 904 and NRAM 1131 or WRAM 1132 may be more efficient via the second channel. Embodiments of this disclosure may select the data transmission channel based on their hardware capabilities.

[0095] In other embodiments, the functions of GDMA 1011 and IODMA 1133 can be integrated into the same component. For ease of description, this disclosure treats GDMA 1011 and IODMA 1133 as different components. For those skilled in the art, any component whose implemented functions and achieved technical effects are similar to this disclosure is within the scope of protection of this disclosure. Furthermore, the functions of GDMA 1011, IODMA 1133, CDMA 1010, and MVDMA 1134 can also be implemented by the same component. Similarly, any component whose implemented functions and achieved technical effects are similar to this disclosure is within the scope of protection of this disclosure.

[0096] The above combination Figures 7-12The hardware and software architecture and internal structure of this disclosure are described in detail. It is understood that the above description is merely exemplary and not restrictive. Depending on different application scenarios and hardware specifications, those skilled in the art may also make changes to the board (or artificial intelligence device) and its internal structure disclosed herein, and such changes shall still fall within the protection scope of this disclosure.

[0097] Based on the foregoing description, those skilled in the art will understand that this application also discloses a device including a processor and a memory. Specifically, the memory can store program instructions for scheduling tasks, which, when executed by the processor, implement the functionality of this application. Figures 1-6 The described scheduling operation steps. Furthermore, since the solution of this application can be implemented using computational program instructions, this application also discloses a computer-readable storage medium or computer program product storing a computer program / instructions for task scheduling, thereby achieving a combination... Figures 1-6 The described scheduling operation steps.

[0098] The solutions disclosed herein have been described in detail above with reference to the accompanying drawings. Depending on the application scenario, the devices or apparatus disclosed herein may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC devices, IoT terminals, mobile terminals, mobile phones, dashcams, navigators, sensors, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and / or medical devices. The vehicles include airplanes, ships, and / or vehicles; the home appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, lights, gas stoves, and range hoods; the medical devices include MRI scanners, ultrasound machines, and / or electrocardiographs. The devices or apparatus disclosed herein can also be applied in fields such as the Internet, IoT, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and healthcare.

[0099] Furthermore, the devices or apparatuses disclosed herein can also be used in application scenarios related to artificial intelligence, big data, and / or cloud computing, such as cloud computing, edge computing, and terminals. In one or more embodiments, the high-power devices or apparatuses according to the disclosed scheme can be applied to cloud devices (e.g., cloud servers), while the low-power devices or apparatuses can be applied to terminal devices and / or edge devices (e.g., smartphones or cameras). In one or more embodiments, the hardware information of the cloud devices and the hardware information of the terminal devices and / or edge devices are compatible with each other, so that suitable hardware resources can be matched from the hardware resources of the cloud devices to simulate the hardware resources of the terminal devices and / or edge devices based on the hardware information of the terminal devices and / or edge devices, so as to complete the unified management, scheduling, and collaborative work of end-to-cloud or cloud-edge-end integration.

[0100] It should be noted that, for the sake of brevity, this disclosure describes some methods and their embodiments as a series of actions and combinations thereof. However, those skilled in the art will understand that the solutions disclosed herein are not limited by the order of the described actions. Therefore, based on the disclosure or teachings of this document, those skilled in the art will understand that some steps can be performed in a different order or simultaneously. Furthermore, those skilled in the art will understand that the embodiments described in this disclosure can be considered optional embodiments, that is, the actions or modules involved are not necessarily essential for the implementation of one or more solutions disclosed herein. In addition, depending on the solution, the description of some embodiments in this disclosure may have different emphases. In view of this, those skilled in the art will understand that parts not described in detail in a certain embodiment of this disclosure can also be referred to the relevant descriptions of other embodiments.

[0101] In terms of specific implementation, based on the disclosure and teachings of this document, those skilled in the art will understand that several embodiments disclosed herein can also be implemented in other ways not disclosed herein. For example, regarding the various units in the device or apparatus embodiments described above, this document divides them based on logical functions, but in actual implementation, there may be other division methods. As another example, multiple units or components can be combined or integrated into another system, or some features or functions in a unit or component can be selectively disabled. Regarding the connection relationships between different units or components, the connections discussed above in conjunction with the accompanying drawings can be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect couplings involve communication connections utilizing interfaces, where the communication interface can support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

[0102] In this disclosure, the units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units. The aforementioned components or units may be located in the same location or distributed across multiple network units. Furthermore, depending on actual needs, some or all of the units can be selected to achieve the purpose of the solution described in the embodiments of this disclosure. Additionally, in some scenarios, multiple units in the embodiments of this disclosure may be integrated into one unit or each unit may exist physically independently.

[0103] In some implementation scenarios, the integrated unit described above can be implemented as a software program module. If implemented as a software program module and sold or used as an independent product, the integrated unit can be stored in a computer-readable storage device (CMSDD). Therefore, when the disclosed solution is embodied in a software product (e.g., a computer-readable storage medium), the software product can be stored in a memory, which may include several instructions to cause a computer device (e.g., a personal computer, server, or network device) to execute some or all of the steps of the method described in the embodiments of this disclosure. The aforementioned memory may include, but is not limited to, various media capable of storing program code, such as USB flash drives, flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.

[0104] In other implementation scenarios, the integrated units described above can also be implemented in hardware, i.e., as specific hardware circuits, which may include digital circuits and / or analog circuits. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors. Therefore, the various devices described herein (e.g., computing devices or other processing devices) can be implemented using appropriate hardware processors, such as CPUs, GPUs, FPGAs, DSPs, and ASICs. Furthermore, the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), such as resistive random access memory ("RRAM"), dynamic random access memory ("DRAM"), static random access memory ("SRAM"), enhanced dynamic random access memory ("EDRAM"), high bandwidth memory ("HBM"), hybrid memory cube ("HMC"), ROM, and RAM, etc.

[0105] The foregoing can be better understood in accordance with the following terms:

[0106] Clause A1. A task scheduler disposed in an artificial intelligence processor, the artificial intelligence processor further comprising execution circuitry for performing tasks, the task scheduler comprising:

[0107] A first transmitting circuit is configured to transmit a prefetch task of a subsequent task to the execution circuit during the execution of the actual task of the current task, wherein tasks in the task scheduler are split into prefetch tasks and actual tasks that are associated with each other; and

[0108] The second sending circuit is used to send the actual task of the next task to the execution circuit after the execution circuit has completed the prefetching task of the next task, so that the execution circuit executes the actual task of the next task after the actual task of the current task has been completed.

[0109] Clause A2, the task scheduler described in Clause A1, further includes:

[0110] A first receiving circuit is configured to receive the tasks, which are split into prefetch tasks and actual tasks associated with each other via program instructions; or

[0111] A partitioning circuit is used to split the received task into prefetch tasks and actual tasks that are related to each other.

[0112] Clause A3. The task scheduler according to Clause A1, wherein during the actual execution of the current task by the execution circuit, in sending a prefetch task of the next task to the execution circuit, the second sending circuit is further configured to:

[0113] At a predetermined time before the actual execution of the current task is completed, a prefetch task for the next task is sent to the execution circuit, so that the execution circuit executes the prefetch task for the next task during the execution of the actual task of the current task.

[0114] Clause A4. The task scheduler described in Clause A1 further includes:

[0115] A second receiving circuit is configured to receive a pre-completion instruction from the execution circuit for the actual task of the current task; and

[0116] The first transmitting circuit is configured to send a prefetch task of the next task to the execution circuit in response to receiving the pre-completion indication, so that the execution circuit can release hardware resources to execute the prefetch task of the next task.

[0117] Clause A5. The task scheduler according to any one of Clauses A1-A4 further includes:

[0118] A third receiving circuit is configured to receive from the execution circuit a completion indication of the prefetch task for the next task; and

[0119] A timer is activated in response to receiving a completion indication of a prefetch task for a subsequent task from the execution circuit to time the actual task being performed by the execution circuit for the current task.

[0120] Clause A6. The task scheduler described in Clause A5 further includes:

[0121] A fourth receiving circuit is configured to receive from the execution circuit an incomplete indication indicating that the actual task has not been completed; and

[0122] The first transmitting circuit is configured to, in response to receiving the incomplete indication, transmit the prefetch task of the next task to the execution circuit or another execution circuit.

[0123] Clause A7, the task scheduler according to Clause A5, wherein the first sending circuit is further configured to send a prefetch task of the next task to the execution circuit or another execution circuit in response to the timer exceeding a predetermined threshold and not receiving any indication from the execution circuit.

[0124] Clause A8. The task scheduler according to Clause A6 or A7, wherein in sending the prefetch task of the subsequent task to the execution circuit or another execution circuit, the first sending circuit is further configured to:

[0125] The prefetch task of the next task is placed in the priority sending queue so that the prefetch task of the next task can be resent to the execution circuit or another execution circuit with the highest sending authority.

[0126] Clause A9. The task scheduler described in Clause A1 further includes:

[0127] A recording circuit is used to record errors that occur during the execution of the prefetch task.

[0128] Clause A10, the task scheduler as described in Clause A9, further includes:

[0129] An error reporting circuit is used to report the error when the actual task associated with the prefetch task is executed.

[0130] Clause A11. A task scheduler according to Clause A1, wherein the execution circuitry includes multiple processor cores operating on tasks to execute in parallel, wherein the tasks are divided into multiple subtasks and each subtask is executed by a corresponding processor core, and the task scheduler is further configured to:

[0131] It interacts with the multiple processor cores so that the multiple processor cores execute the prefetch subtasks and actual subtasks of the corresponding subtasks in parallel.

[0132] Clause A12, the task scheduler according to Clause A11, wherein in interacting with the plurality of processor cores to execute a task, the first transmitting circuit is further configured to:

[0133] In response to receiving a pre-completion indication for a prefetch subtask of the current task from all of the plurality of processor cores, the corresponding prefetch subtask of the next task is sent to each of the plurality of processor cores; and

[0134] The second transmitting circuit is further configured to, in response to receiving from all the plurality of processor cores a completion indication of the actual subtask of the current task and a pre-completion indication of the prefetched subtask of the next task, transmit the corresponding actual subtask of the next task to each of the plurality of processor cores so that it may be executed in parallel by the plurality of processor cores.

[0135] Clause A13, a task scheduler according to any one of Clauses A1-A12, wherein the prefetch task includes at least one of instruction fetching, querying a bypass translation buffer, and / or virtual address to physical address translation.

[0136] Clause A14, the task scheduler according to Clause A13, wherein the virtual address to physical address translation is achieved by page table lookup, and the predetermined time is determined based on the number of page table levels in the page table lookup and the latency of each page table level.

[0137] Clause A15, the task scheduler as described in Clause A13, wherein the actual task includes executing the instructions.

[0138] Clause A16, an artificial intelligence processor, comprising:

[0139] Execution circuitry, configured to perform multiple tasks; and

[0140] A task scheduler according to any one of Clauses A1-A15, configured to interact with the execution circuitry to execute the scheduled plurality of tasks by the execution circuitry.

[0141] Clause A17, a board including the artificial intelligence processor described in Clause A16.

[0142] Clause A18. A method for performing task scheduling, comprising:

[0143] During the execution of the actual task of the current task by the execution circuit, a prefetch task for the next task is sent to the execution circuit, wherein the task is split into prefetch tasks and actual tasks that are related to each other; and

[0144] After the execution circuit completes the prefetch task of the next task, it sends the actual task of the next task to the execution circuit, so that the execution circuit executes the actual task of the next task after the actual task of the current task has been completed.

[0145] Clause A19, the method described pursuant to Clause A18, further includes:

[0146] Receive the task that has been split into prefetch tasks and actual tasks that are related to each other via program instructions; or

[0147] The received tasks are broken down into prefetch tasks and actual tasks that are related to each other.

[0148] Clause A20, the method according to Clause A18, wherein during the actual execution of the current task by the execution circuit, in which a prefetch task of the next task is sent to the execution circuit, the method further comprises:

[0149] At a predetermined time before the actual execution of the current task is completed, a prefetch task for the next task is sent to the execution circuit, so that the execution circuit executes the prefetch task for the next task during the execution of the actual task of the current task.

[0150] Clause A21, the method described pursuant to Clause A18, further includes:

[0151] Receive a pre-completion instruction from the execution circuit for the actual task of the current task; and

[0152] In response to receiving the pre-completion indication, the hardware resources of the execution circuit are released to be used for the prefetching of the next task.

[0153] Clause A22, the method described pursuant to any one of Clauses A18-A21, further includes:

[0154] In response to receiving a completion indication of the prefetch task of the next task from the execution circuit, the execution circuit performs a timing of the actual task of the next task.

[0155] Clause A23, the methods described pursuant to Clause A22, further include:

[0156] In response to the timing exceeding a predetermined threshold, an incomplete indication is received from the execution circuit to indicate that the actual task execution has not been completed; and

[0157] In response to receiving the incomplete indication, a prefetch task for the next task is sent to the execution circuit or another execution circuit.

[0158] Clause A24, the methods described pursuant to Clause A22, further include:

[0159] In response to the timing exceeding a predetermined threshold and no indication being received from the execution circuit, a prefetch task for the next task is sent to the execution circuit or another execution circuit.

[0160] Clause A25, the method according to Clause A23 or A24, wherein in sending the prefetch task of the subsequent task to the execution circuit or another execution circuit, the method further comprises:

[0161] The prefetch task of the next task is placed in the priority sending queue so that the prefetch task of the next task can be resent to the execution circuit or another execution circuit with the highest sending authority.

[0162] Clause A26, the method described pursuant to Clause A18, further includes:

[0163] Record any errors that occur during the execution of the prefetch task.

[0164] Clause A27, the methods described pursuant to Clause A26, further include:

[0165] The error will be reported when the actual task associated with the prefetch task is executed.

[0166] Clause A28. The method according to Clause A18, wherein the execution circuitry includes multiple processor cores operating on parallel execution tasks, wherein the tasks are divided into multiple subtasks and each subtask is executed by a corresponding processor core, the method further comprising:

[0167] It interacts with the multiple processor cores so that the multiple processor cores execute the prefetch subtasks and actual subtasks of the corresponding subtasks in parallel.

[0168] Clause A29, the method according to Clause A28, wherein in interacting with the plurality of processor cores to perform a task, the method further comprises:

[0169] In response to receiving a pre-completion indication for a prefetched subtask of the current task from all of the plurality of processor cores, a corresponding prefetched subtask of the next task is sent to each of the plurality of processor cores; and in response to receiving a completion indication for an actual subtask of the current task and a pre-completion indication for a prefetched subtask of the next task from all of the plurality of processor cores, a corresponding actual subtask of the next task is sent to each of the plurality of processor cores so that it may be executed in parallel by the plurality of processor cores.

[0170] The method described in any one of Clauses A18-A29, as specified in Clause A30, wherein the prefetch task includes at least one of instruction fetching, querying a bypass translation buffer, and / or virtual address to physical address translation.

[0171] Clause A31, the method according to Clause A30, wherein the virtual address to physical address translation is achieved by page table lookup, and the predetermined timing is determined based on the number of page table levels in the page table lookup and the delay of each page table level.

[0172] Clause A32, the method described in Clause A30, wherein the actual task includes executing the instructions.

[0173] Clause A33, an apparatus for scheduling and executing tasks, comprising:

[0174] A processor; and a memory storing program instructions for scheduling tasks, which, when executed by the processor, cause the method described under any one of clauses A18-A32 to be implemented.

[0175] Clause A34. A computer-readable storage medium storing program instructions for scheduling tasks, which, when executed by a processor, cause the implementation of the method according to any one of clauses A18-A32.

[0176] While the embodiments of this disclosure are described above, the content is merely an example for the purpose of facilitating understanding of this disclosure and is not intended to limit the scope or application scenarios of this disclosure. Any person skilled in the art can make any modifications and changes in form and detail of the implementation without departing from the spirit and scope disclosed herein; however, the patent protection scope of this disclosure shall still be determined by the scope defined in the appended claims.

Claims

1. A task scheduler disposed in an artificial intelligence processor, the artificial intelligence processor further comprising execution circuitry for performing tasks, the task scheduler comprising: A first sending circuit is configured to send a prefetch task of a subsequent task to the execution circuit during the execution of the actual task of the current task by the execution circuit, wherein the tasks in the task scheduler are split into prefetch tasks and actual tasks that are associated with each other. as well as The second sending circuit is used to send the actual task of the next task to the execution circuit after the execution circuit has completed the prefetching task of the next task, so that the execution circuit executes the actual task of the next task after the actual task of the current task has been completed.

2. The task scheduler according to claim 1, further comprising: A first receiving circuit is configured to receive the tasks, which are split into prefetch tasks and actual tasks associated with each other via program instructions; or A partitioning circuit is used to split the received task into prefetch tasks and actual tasks that are related to each other.

3. The task scheduler according to claim 1, wherein during the actual task execution of the current task by the execution circuit, in sending a prefetch task of the next task to the execution circuit, the second sending circuit is further configured to: At a predetermined time before the actual execution of the current task is completed, a prefetch task for the next task is sent to the execution circuit, so that the execution circuit executes the prefetch task for the next task during the execution of the actual task of the current task.

4. The task scheduler according to claim 1, further comprising: The second receiving circuit is used to receive a pre-completion instruction from the execution circuit for the actual task of the current task; as well as The first transmitting circuit is configured to send a prefetch task of the next task to the execution circuit in response to receiving the pre-completion indication, so that the execution circuit can release hardware resources to execute the prefetch task of the next task.

5. The task scheduler according to any one of claims 1-4, further comprising: A third receiving circuit is used to receive a completion indication of the prefetch task of the next task from the execution circuit. as well as A timer is activated in response to receiving a completion indication of a prefetch task for a subsequent task from the execution circuit to time the actual task being performed by the execution circuit for the current task.

6. The task scheduler according to claim 5, further comprising: A fourth receiving circuit is configured to receive from the execution circuit an incomplete indication indicating that the actual task has not been completed. as well as The first transmitting circuit is configured to, in response to receiving the incomplete indication, transmit the prefetch task of the next task to the execution circuit or another execution circuit.

7. The task scheduler of claim 5, wherein the first sending circuit is further configured to send a prefetch task of the next task to the execution circuit or another execution circuit in response to the timer exceeding a predetermined threshold and not receiving any indication from the execution circuit.

8. The task scheduler according to claim 6 or 7, wherein in sending the prefetch task of the subsequent task to the execution circuit or another execution circuit, the first sending circuit is further configured to: The prefetch task of the next task is placed in the priority sending queue so that the prefetch task of the next task can be resent to the execution circuit or another execution circuit with the highest sending authority.

9. The task scheduler according to claim 1, further comprising: A recording circuit is used to record errors that occur during the execution of the prefetch task.

10. The task scheduler according to claim 9, further comprising: An error reporting circuit is used to report the error when the actual task associated with the prefetch task is executed.

11. The task scheduler of claim 1, wherein the execution circuitry includes a plurality of processor cores operating on parallel execution of tasks, wherein the tasks are divided into a plurality of subtasks and each subtask is executed by a corresponding processor core, and the task scheduler is further configured to: It interacts with the multiple processor cores so that the multiple processor cores execute the prefetch subtasks and actual subtasks of the corresponding subtasks in parallel.

12. The task scheduler of claim 11, wherein, in interacting with the plurality of processor cores to execute a task, the first transmitting circuit is further configured to: In response to receiving a pre-completion indication for a prefetch subtask of the current task from all of the plurality of processor cores, the corresponding prefetch subtask of the next task is sent to each of the plurality of processor cores; and The second transmitting circuit is further configured to, in response to receiving from all the plurality of processor cores a completion indication of the actual subtask of the current task and a pre-completion indication of the prefetched subtask of the next task, transmit the corresponding actual subtask of the next task to each of the plurality of processor cores so that it may be executed in parallel by the plurality of processor cores.

13. The task scheduler according to any one of claims 1-12, wherein the prefetch task includes at least one of instruction fetching, querying a bypass translation buffer, and / or virtual address to physical address translation.

14. The task scheduler of claim 13, wherein the virtual address to physical address translation is implemented by page table lookup, and the predetermined time is determined based on the number of page table levels in the page table lookup and the delay of each page table level.

15. The task scheduler of claim 13, wherein the actual task includes executing the instructions.

16. An artificial intelligence processor, comprising: An execution circuit, configured to perform multiple tasks; as well as The task scheduler according to any one of claims 1-15 is configured to interact with the execution circuitry so that the execution circuitry executes the scheduled plurality of tasks.

17. A board comprising the artificial intelligence processor according to claim 16.

18. A method for performing task scheduling, comprising: During the execution of the actual task of the current task by the execution circuit, a prefetch task of the next task is sent to the execution circuit, wherein the task is split into prefetch tasks and actual tasks that are related to each other. as well as After the execution circuit completes the prefetch task of the next task, it sends the actual task of the next task to the execution circuit, so that the execution circuit executes the actual task of the next task after the actual task of the current task has been completed.

19. The method of claim 18, further comprising: Receive the task that has been split into prefetch tasks and actual tasks that are related to each other via program instructions; or The received tasks are broken down into prefetch tasks and actual tasks that are related to each other.

20. The method of claim 18, wherein during the actual execution of the current task by the execution circuit, in which a prefetch task of a subsequent task is sent to the execution circuit, the method further comprises: At a predetermined time before the actual execution of the current task is completed, a prefetch task for the next task is sent to the execution circuit, so that the execution circuit executes the prefetch task for the next task during the execution of the actual task of the current task.

21. The method of claim 18, further comprising: Receive a pre-completion instruction from the execution circuit for the actual task of the current task; as well as In response to receiving the pre-completion indication, the hardware resources of the execution circuit are released to be used for the prefetching of the next task.

22. The method according to any one of claims 18-21, further comprising: In response to receiving a completion indication of the prefetch task of the next task from the execution circuit, the execution circuit performs a timer for the actual task of the current task.

23. The method of claim 22, further comprising: In response to the timing exceeding a predetermined threshold, the execution circuit receives an incomplete indication indicating that the actual task execution has not been completed. as well as In response to receiving the incomplete indication, a prefetch task for the next task is sent to the execution circuit or another execution circuit.

24. The method of claim 22, further comprising: In response to the timing exceeding a predetermined threshold and no indication being received from the execution circuit, a prefetch task for the next task is sent to the execution circuit or another execution circuit.

25. The method of claim 23 or 24, wherein in sending the prefetch task of the subsequent task to the execution circuit or another execution circuit, the method further comprises: The prefetch task of the next task is placed in the priority sending queue so that the prefetch task of the next task can be resent to the execution circuit or another execution circuit with the highest sending authority.

26. The method of claim 18, further comprising: Record any errors that occur during the execution of the prefetch task.

27. The method of claim 26, further comprising: The error will be reported when the actual task associated with the prefetch task is executed.

28. The method of claim 18, wherein the execution circuitry includes a plurality of processor cores operating on a parallel execution task, wherein the task is divided into a plurality of subtasks and each subtask is executed by a corresponding processor core, the method further comprising: It interacts with the multiple processor cores so that the multiple processor cores execute the prefetch subtasks and actual subtasks of the corresponding subtasks in parallel.

29. The method of claim 28, wherein in interacting with the plurality of processor cores to perform a task, the method further comprises: In response to receiving a pre-completion indication for a prefetch subtask of the current task from all of the plurality of processor cores, the corresponding prefetch subtask of the next task is sent to each of the plurality of processor cores. as well as In response to receiving completion indications for the actual subtasks of the current task and pre-completion indications for the prefetched subtasks of the next task from all of the plurality of processor cores, the corresponding actual subtask of the next task is sent to each of the plurality of processor cores so that it can be executed in parallel by the plurality of processor cores.

30. The method according to any one of claims 18-29, wherein the prefetch task includes at least one of instruction fetching, querying a bypass translation buffer, and / or virtual address to physical address translation.

31. The method of claim 30, wherein the virtual address to physical address translation is implemented by page table lookup, and the predetermined time is determined based on the number of page table levels in the page table lookup and the delay of each page table level.

32. The method of claim 30, wherein the actual task includes executing instructions.

33. An apparatus for scheduling and executing tasks, comprising: processor; as well as A memory storing program instructions for scheduling tasks, which, when executed by a processor, cause the method according to any one of claims 18-32 to be implemented.

34. A computer-readable storage medium storing program instructions for scheduling tasks, which, when executed by a processor, cause the method according to any one of claims 18-32 to be implemented.