A processor, processing method, and related device

By introducing a graph computation flow unit and a general-purpose arithmetic unit into the processor core to execute instructions in parallel, the communication latency problem between the graph computation accelerator and the general-purpose processor is solved, improving the parallelism and computational efficiency of the processor and achieving the effect of graph computation acceleration.

CN115668142BActive Publication Date: 2026-06-26HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2020-05-30
Publication Date
2026-06-26

Smart Images

  • Figure CN115668142B_ABST
    Figure CN115668142B_ABST
Patent Text Reader

Abstract

The application discloses a processor, a processing method and related equipment, wherein the processor comprises a processor core, the processor core comprises an instruction scheduling unit, a graph computing flow unit connected with the instruction scheduling unit and at least one general-purpose computing unit; wherein the instruction scheduling unit is configured to distribute general-purpose computing instructions in decoded to-be-executed instructions to the at least one general-purpose computing unit and distribute graph computing control instructions in the decoded to-be-executed instructions to the graph computing unit, the general-purpose computing instructions are used to instruct to execute general-purpose computing tasks, and the graph computing control instructions are used to instruct to execute graph computing tasks; the at least one general-purpose computing unit is configured to execute the general-purpose computing instructions; and the graph computing flow unit is configured to execute the graph computing control instructions. By using the application, the processing efficiency of the processor can be improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of directed graph computation technology, and in particular to a processor, processing method and related equipment. Background Technology

[0002] As the scale and complexity of data continue to increase across various fields, the demands on processor computing power and processing performance are also rising. Superscalar processor (Central Processing Unit, CPU) architecture refers to a type of parallel computing that implements instruction-level parallelism within a single processor core. It improves performance by issuing multiple instructions per clock cycle and uses hardware logic units to resolve dependencies between instructions after parallelization. This technology can achieve higher CPU throughput at the same CPU clock speed.

[0003] However, the dependencies between numerous hardware logic units in superscalar CPUs increase the difficulty of hardware design verification, consuming significant amounts of energy and hardware space. Furthermore, as it becomes increasingly difficult to improve the operating frequency, instruction width, and complexity of superscalar CPUs, their performance cannot scale linearly, resulting in power consumption increases exceeding performance gains and deteriorating energy efficiency.

[0004] To address the aforementioned technical issues, existing technologies have proposed a graph computing accelerator scheme called the Specialization Engine for Explicit Dataflow (SEED) architecture. Its core idea is to explicitly describe instruction dependencies at the instruction set level, directly presenting the parallelism between instructions to the hardware for execution, thereby achieving processor acceleration. This architecture combines a graph computing accelerator and a superscalar processor into a hybrid architecture. The graph computing accelerator and the superscalar processor share a common cache and establish a communication bus to transmit the liveIn data from the superscalar processor's registers to the graph computing accelerator. After the graph computing accelerator completes its computation, it transmits the liveOut data back to the superscalar processor via a separate communication bus. After the graph computation is complete, the result is transmitted back to the superscalar processor's registers. Through the SEED architecture, program segments suitable for execution on the graph computing architecture can be scheduled to execute on the graph computing accelerator, while program segments unsuitable for the graph computing architecture can be scheduled to execute on the superscalar processor. Therefore, the ability to switch between graph architecture and superscalar architecture is achieved.

[0005] However, because the SEED architecture uses an accelerator model, meaning the graph computing accelerator and the general-purpose processor are completely independent systems, with the graph computing accelerator having its own independent data and instruction input channels, while the superscalar processor communicates with the graph computing accelerator through a message channel or shared memory, this results in significant communication latency between the superscalar processor and the graph computing accelerator. Furthermore, since the graph computing accelerator cannot handle interrupts, cannot run an operating system, and cannot be shared by multiple processes, the SEED architecture cannot further improve the parallelism between hardware components (between the graph computing accelerator and the superscalar processor), ultimately failing to improve the architecture's operational efficiency and overall performance, and significantly reducing the availability of the graph computing accelerator.

[0006] In summary, how to provide a more efficient graph computing model to accelerate the operation of general-purpose processors has become an urgent technical problem to be solved. Summary of the Invention

[0007] This invention provides a processor, a processing method, and related devices that enable graph computing to accelerate the operation of general-purpose processors.

[0008] In a first aspect, embodiments of the present invention provide a processor, including a processor core, wherein the processor core includes an instruction scheduling unit, a graph computation flow unit connected to the instruction scheduling unit, and at least one general-purpose arithmetic unit; wherein;

[0009] The instruction scheduling unit is configured to: allocate general computation instructions from the decoded instructions to be executed to the at least one general computation unit, and allocate graph computation control instructions from the decoded instructions to be executed to the graph computation unit, wherein the general computation instructions are used to instruct the execution of a general computation task, and the graph computation control instructions are used to instruct the execution of a graph computation task; the at least one general computation unit is configured to execute the general computation instructions; and the graph computation flow unit is configured to execute the graph computation control instructions.

[0010] This invention provides a processor that implements graph computing to accelerate the operation of a general-purpose processor, specifically including hardware and software design. From a hardware perspective, this invention adds a hardware graph computation flow unit to the processor core and positions it along with other general-purpose arithmetic units (such as arithmetic logic units and floating-point units) in the processor's execution pipeline. This allows the processor to accelerate graph computing by either independently executing instructions through the graph computation flow unit or concurrently executing instructions with other general-purpose arithmetic units. From a software perspective, this invention designs extended instructions (such as graph computation control instructions) specifically for graph computation acceleration based on the general-purpose processor's instruction set. During the instruction scheduling phase, the instruction scheduling unit in the processor core directly schedules the graph computation control instructions to the graph computation flow unit for execution, thereby achieving graph computation acceleration. In this application, since the graph computation flow unit is located within the processor core, the instruction scheduling unit within the core can connect to and communicate directly with this graph computation flow unit. This allows graph computation control instructions to be directly scheduled to the graph computation flow unit without needing to communicate through other message channels or memory read / write methods, significantly reducing communication latency. Simultaneously, because the graph computation flow unit is located within the processor core, its synchronous or asynchronous operation with other computational units can be controlled, improving the processor's parallelism and computational efficiency. Furthermore, for some repetitive instruction sequences, they can be repeatedly executed within the graph computation architecture (i.e., the graph computation flow unit), reducing the number of times the processor core fetches instructions from memory and the bandwidth required. It also reduces the overhead of instruction dependency checks, jump prediction, and register access, effectively utilizing the computational resources of the graph computation flow unit and further improving the processor's operating efficiency and performance. In summary, based on the microarchitecture design of the aforementioned processor and the extension of related instruction sets, this application integrates the graph computing architecture into a general-purpose processor and uses it as an execution unit within the general-purpose processor core. It can execute graph computing tasks independently during the execution pipeline stage or concurrently with other general-purpose computing units, thereby realizing the function of efficiently executing computing tasks in the same processor through the collaboration of graph computing flow units and one or more general-purpose computing units.

[0011] In one possible implementation, the processor core further includes: an instruction fetching unit for fetching a target program to be executed; and an instruction decoding unit for decoding the target program to obtain the decoded instructions to be executed.

[0012] In this embodiment of the invention, the processor core further includes an instruction fetching unit and an instruction decoding unit, while the processor core also includes a memory unit. The memory unit outside the processor core stores the target program to be executed. The instruction fetching unit inside the processor core retrieves the target program to be executed from the memory unit and decodes it through the instruction decoding unit inside the core to obtain instructions that can be directly executed by the execution units in the processor (such as general arithmetic units, graph computation flow units, etc.), so as to schedule it to the corresponding execution unit for execution.

[0013] In one possible implementation, the processor core further includes a result write-back unit; the graph computation flow unit and the at least one general-purpose arithmetic unit are respectively connected to the result write-back unit; the at least one general-purpose arithmetic unit is further configured to send a first execution result of the general-purpose computation task to the result write-back unit, the first execution result of the general-purpose computation task being the result obtained by executing the general-purpose computation instruction; the graph computation flow unit is further configured to send a second execution result of the graph computation task to the result write-back unit, the second execution result of the graph computation task being the result obtained by executing the graph computation control instruction; the result write-back unit is configured to write back part or all of the first execution result and the second execution result to the instruction scheduling unit.

[0014] In this embodiment of the invention, the processor core also includes a result write-back unit. This result write-back unit can temporarily store the results calculated by various general-purpose arithmetic units or graph computation flow units, and write some or all of the calculation results back to the instruction scheduling unit for scheduling of relevant parameters. Furthermore, the result write-back unit can also reorder the calculation results obtained from out-of-order execution. For example, it can reorder the calculation results of instructions according to the order in which they are fetched, until all the instructions at the front have been executed, before submitting the instruction and completing the entire instruction's calculation result. Since the instruction scheduling unit in the processor core has the authority and conditions to obtain the relevant computational status of the graph computation flow units (i.e., intermediate or final computational results temporarily stored in the result write-back unit), it can better control and access the graph computation flow units, thereby controlling their synchronous or asynchronous operation with other execution units, improving the processor's parallelism and operating efficiency.

[0015] In one possible implementation, the processor further includes a memory unit; the graph computation flow unit includes N computation nodes; the graph computation control instructions include a graph initiation instruction, which carries a target address in the memory unit; the graph computation flow unit is specifically used to receive the graph initiation instruction and read graph block information from the memory unit according to the target address, the graph block information including the operation method of each of the N computation nodes, and the connection and order information between the N computation nodes.

[0016] In this embodiment of the invention, if the graph computation control instruction received by the graph computation flow unit is specifically a graph construction start instruction, and this instruction is used to instruct the graph computation flow unit to read the graph block information stored in the memory unit according to the target address in the memory unit outside the processor core carried in the instruction, the graph block information includes the corresponding operation method in each computing node in the graph computation flow unit, and the dependency relationship between multiple computing nodes, that is, the relationship between the calculation results and input conditions between related computing nodes (that is, the two computing nodes corresponding to the edge in the graph computation). Based on the above information, the graph computation flow unit can complete the calculation of a complete graph block. It should be noted that the above-mentioned graph block can be one or all graph blocks in the graph computation, that is, a complete graph computation task can include one or multiple graph blocks after splitting.

[0017] In one possible implementation, the graph computation control instruction includes a parameter passing instruction, which carries the identifiers of M computation nodes and the input parameters corresponding to the identifiers of the M computation nodes, wherein the M computation nodes are some or all of the N nodes; the graph computation flow unit is used to receive the parameter passing instruction and input the input parameters corresponding to the identifiers of the M computation nodes to the M computation nodes respectively.

[0018] In this embodiment of the invention, if the graph computation control instruction received by the graph computation flow unit is specifically a parameter passing instruction, the parameter passing instruction contains the initial input parameters required by multiple computation nodes during a single graph block computation process. After the multiple computation nodes obtain the corresponding parameters from outside the graph computation flow unit, the graph computation flow unit meets the conditions for starting to execute the graph computation task, that is, it can start to perform graph computation.

[0019] In one possible implementation, the connection and order information between the N computing nodes includes source nodes and destination nodes corresponding to L edges respectively; the graph computation flow unit is specifically used to: monitor whether the input parameters required by each of the N computing nodes are ready; for a target computing node whose input parameters are ready, input the input parameters of the target computing node into the operation method corresponding to the target computing node for calculation to obtain the calculation result; and according to the source nodes and destination nodes corresponding to the L edges respectively, input the calculation result of the source node in each edge as an input parameter to the corresponding destination node.

[0020] In this embodiment of the invention, for each computation node in the graph computation flow unit, as long as the computation method for each computation node has been loaded and the input parameters have been obtained, the computation node can begin graph computation. Some computation nodes (such as the source node corresponding to an edge) obtain their initial input parameters from outside the graph computation flow unit, while other computation nodes (such as the destination node corresponding to an edge) may need to wait for the computation of their related computation nodes (such as the source node) to complete before using their computation results as their input parameters to begin graph computation. Therefore, the computation start time for each computation node may be inconsistent, but for each computation node, computation can begin once the computation method and input parameters (which may include left input parameters, right input parameters, or conditional parameters) are prepared.

[0021] In one possible implementation, the graph computation control instruction includes a graph computation start instruction; the graph computation flow unit is specifically configured to: after receiving the graph computation start instruction, check whether the graph block information read by the graph computation flow unit is consistent with the pre-started graph block address, and determine whether the input parameters in the M computation nodes have been input; if the graph block information is consistent with the pre-started graph block address and the input parameters in the M computation nodes have been input, then start the execution of the graph computation task.

[0022] In this embodiment of the invention, the graph computation flow unit is triggered to perform relevant checks before starting computation by the graph computation control instruction to start graph computation (e.g., checking whether the graph block information is correct and whether the initial input parameters are in place). After the graph computation flow unit completes the above checks, it determines that the graph has been completed and can start executing the graph computation task.

[0023] In one possible implementation, the instruction scheduling unit is further configured to: control the processor core to enter a blocked state after the graph computation flow unit receives the start graph computation instruction but before completing the graph computation task. Further optionally, the instruction scheduling unit is further configured to: control the processor core to exit the blocked state after the graph computation flow unit completes the graph computation task.

[0024] In this embodiment of the invention, the processor core can synchronously initiate graph computation functionality (i.e., tasks can be executed serially between the graph computation stream unit and other general-purpose computing units). Specifically, while the graph computation stream unit is executing a graph computation task, the processor core's pipeline is blocked until the graph computation stream unit completes its task, thus ensuring that only the graph computation stream unit is executing tasks during this period, while other computing units are temporarily unable to execute tasks, thereby reducing processor power consumption. This instruction can switch the computation mode between other computing units and the graph computation stream unit within the processor core and can be applied to synchronous computation programs.

[0025] In one possible implementation, the instruction scheduling unit is further configured to: send a synchronous execution result instruction to the graph computation flow unit, and, after the graph computation flow unit receives the synchronous execution result instruction but before completing the graph computation task, control the processor core to enter a blocked state. Further optionally, the instruction scheduling unit is further configured to: control the processor core to exit the blocked state after the graph computation flow unit completes the graph computation task.

[0026] In this embodiment of the invention, the processor core can initiate graph computation functionality asynchronously (i.e., graph computation flow units and other general-purpose computing units can execute tasks in parallel). Specifically, while the graph computation flow unit is executing its graph computation task, the processor core's pipeline is not blocked, and other computing units can operate normally. This blocking continues until the processor sends a synchronous execution result instruction to the graph computation flow unit via the instruction scheduling unit (e.g., when the computation of other computing units depends on the execution result of this graph computation flow unit). If the graph computation flow unit has not yet completed its graph computation task, the processor core's pipeline is blocked until the graph computation flow unit completes its task and provides the execution result. This blocking state is then lifted, ensuring that other computing units can wait for the graph computation flow unit's execution result before continuing execution, thus improving the parallelism of the processor core. This instruction can implement a parallel computation mode between other computing units within the processor and the graph computation flow unit, and can be applied to asynchronous computation programs.

[0027] In one possible implementation, the processor core further includes a result write-back unit, which includes multiple registers; the graph computation flow unit and the at least one general-purpose arithmetic unit are respectively connected to the result write-back unit; the graph computation control instruction includes a return parameter instruction, which carries the identifiers of K computation nodes and the registers corresponding to the identifiers of the K computation nodes; the graph computation flow unit is specifically used to control the sending of the computation results of the K computation nodes to the corresponding registers in the result write-back unit.

[0028] In this embodiment of the invention, for the N computing nodes of the graph computing flow unit, some computing nodes may need to output the calculation results to the result write-back unit outside the graph computing flow unit after the final calculation is completed. That is, the graph computing flow unit can control the final calculation results of the K computing nodes as the calculation results of the entire graph block, and output them to the result write-back unit outside the graph computing flow unit, so that the subsequent execution unit can perform further calculations based on the above calculation results.

[0029] In one possible implementation, the general-purpose arithmetic instructions include general-purpose arithmetic logic instructions; the at least one general-purpose arithmetic unit includes an arithmetic logic unit (ALU) for receiving general-purpose arithmetic logic instructions sent by the instruction scheduling unit and performing logical operations; optionally, the general-purpose arithmetic instructions include memory read / write instructions; the at least one general-purpose arithmetic unit includes a memory read / write unit (LSU) for receiving memory read / write instructions sent by the instruction scheduling unit and performing memory read / write operations.

[0030] In this embodiment of the invention, the at least one arithmetic unit may further include an arithmetic logic unit or a memory read / write unit. The arithmetic logic unit is mainly used for input-related logical operations, while the memory read / write unit is used to perform memory read / write operations. That is, the above-mentioned units are all in the execution pipeline stage with the graph computation flow unit, and together complete various types of computation tasks after decoding in the CPU. They can be executed in parallel, in serial order, or in a combination of parallel and serial execution, so as to complete the processor's computation tasks more efficiently.

[0031] In one possible implementation, the graph computation control instructions include data read / write instructions, which carry memory read / write addresses; the graph computation stream unit is further configured to: read data from or write data to the memory read / write unit LSU according to the memory read / write address in the data read / write instructions.

[0032] In this embodiment of the invention, the graph computation flow unit in the processor core can reuse the function of the memory read / write unit in the processor core, and read or write data from the memory read / write unit LSU according to the read / write address in the relevant data read / write instructions.

[0033] In one possible implementation, the at least one general-purpose computing unit further includes a floating-point unit (FPU); the graph computation task includes floating-point operations; the graph computation stream unit is further configured to: send the data from the floating-point operations to the FPU for computation, and receive the computation result fed back by the FPU. Optionally, the at least one general-purpose computing unit further includes a vector operation unit (SIMD); the graph computation task includes vector operations; the graph computation stream unit is further configured to: send the data from the vector operations to the SIMD for computation, and receive the computation result fed back by the SIMD.

[0034] In this embodiment of the invention, the general-purpose computing unit may further include a floating-point arithmetic unit (FPU) and / or a vector arithmetic unit (SIMD). The FPU is used for floating-point arithmetic tasks that require higher data precision, while the SIMD is used for single-instruction multiple-data (SIMD) arithmetic. Since the general-purpose computing unit and the graph computing unit are both in the same execution pipeline stage and have data transmission channels with each other, when the graph computing unit is processing graph computing tasks, if there are floating-point arithmetic tasks or SIMD arithmetic tasks, they can be sent to the corresponding general-purpose computing unit for processing through the corresponding data transmission channels. This eliminates the need to repeatedly set up corresponding processing units in the graph computing unit to handle the corresponding types of arithmetic tasks, thereby greatly saving hardware area and overhead.

[0035] In a second aspect, embodiments of the present invention provide a processing method applied to a processor, the processor including a processor core, the processor core including an instruction scheduling unit, a graph computation flow unit connected to the instruction scheduling unit, and at least one general-purpose arithmetic unit; the method includes:

[0036] The instruction scheduling unit allocates the general computing instructions from the decoded instructions to be executed to the at least one general computing unit, and allocates the graph computing control instructions from the decoded instructions to be executed to the graph computing unit. The general computing instructions are used to instruct the execution of general computing tasks, and the graph computing control instructions are used to instruct the execution of graph computing tasks.

[0037] The general computing instructions are executed through the at least one general-purpose arithmetic unit;

[0038] The graph computation control instructions are executed by the graph computation flow unit.

[0039] In one possible implementation, the processor core further includes an instruction fetch unit and an instruction decoding unit; the method further includes:

[0040] The target program to be executed is obtained through the instruction acquisition unit;

[0041] The target program is decoded by the instruction decoding unit to obtain the decoded instruction to be executed.

[0042] In one possible implementation, the processor core further includes a result write-back unit; the graph computation flow unit and the at least one general-purpose arithmetic unit are respectively connected to the result write-back unit; the method further includes:

[0043] The first execution result of the general computing task is sent to the result write-back unit through the at least one general computing unit, and the first execution result of the general computing task is the result obtained by executing the general computing instruction;

[0044] The graph computation flow unit sends the second execution result of the graph computation task to the result write-back unit, whereby the second execution result of the graph computation task is the result obtained by executing the graph computation control instructions.

[0045] The result write-back unit writes back part or all of the first execution result and the second execution result to the instruction scheduling unit.

[0046] In one possible implementation, the processor further includes a memory unit; the graph computation flow unit includes N computation nodes; the graph computation control instructions include a graph initiation instruction, the graph initiation instruction carrying a target address in the memory unit; the execution of the graph computation control instructions through the graph computation flow unit includes:

[0047] The graph computation flow unit receives the start graph construction instruction and reads the construction block information from the memory unit according to the target address. The construction block information includes the operation method of each of the N computing nodes and the connection and order information between the N computing nodes.

[0048] In one possible implementation, the graph computation control instruction includes a parameter passing instruction, which carries identifiers of M computation nodes and input parameters corresponding to the identifiers of the M computation nodes, wherein the M computation nodes are some or all of the N nodes; executing the graph computation control instruction through the graph computation flow unit includes:

[0049] The graph computation flow unit receives the parameter transmission instruction and inputs the input parameters corresponding to the identifiers of the M computation nodes into the M computation nodes respectively.

[0050] In one possible implementation, the connection and order information between the N computing nodes includes the source and destination nodes corresponding to the L edges respectively; the execution of the graph computing control instructions through the graph computing flow unit includes:

[0051] The graph computation flow unit monitors whether the input parameters required for each of the N computation nodes are ready; for a target computation node whose input parameters are ready, the input parameters of the target computation node are input into the corresponding operation method of the target computation node for calculation to obtain the calculation result; according to the source node and destination node corresponding to the L edges respectively, the calculation result of the source node in each edge is used as the input parameter and input to the corresponding destination node.

[0052] In one possible implementation, the graph computation control instructions include a graph computation initiation instruction; the step of executing the graph computation control instructions through the graph computation flow unit to obtain the execution result of the graph computation task includes:

[0053] After receiving the start graph computation instruction through the graph computation flow unit, it checks whether the graph block information read by the graph computation flow unit is consistent with the pre-started graph block address, and determines whether the input parameters in the M computation nodes have been input. If the graph block information is consistent with the pre-started graph block address and the input parameters in the M computation nodes have been input, then the graph computation task is started.

[0054] In one possible implementation, the method further includes:

[0055] The instruction scheduling unit controls the processor core to enter a blocked state after the graph computation flow unit receives the start graph computation instruction but before the graph computation task is completed.

[0056] In one possible implementation, the method further includes:

[0057] The instruction scheduling unit sends a synchronous execution result instruction to the graph computation flow unit, and after the graph computation flow unit receives the synchronous execution result instruction but before completing the graph computation task, it controls the processor core to enter a blocked state.

[0058] In one possible implementation, the method further includes:

[0059] After the graph computation flow unit completes the graph computation task, the instruction scheduling unit controls the processor core to exit the blocked state.

[0060] In one possible implementation, the processor core further includes a result write-back unit, which includes multiple registers; the graph computation flow unit and the at least one general-purpose arithmetic unit are respectively connected to the result write-back unit; the graph computation control instruction includes a return parameter instruction, which carries the identifiers of K computation nodes and the registers corresponding to the identifiers of the K computation nodes; the execution of the graph computation control instruction through the graph computation flow unit to obtain the execution result of the graph computation task includes:

[0061] The computation results of the K computation nodes are sent to the corresponding registers in the result write-back unit through the graph computation flow control.

[0062] In one possible implementation, the general-purpose arithmetic instructions include general-purpose arithmetic logic instructions; the at least one general-purpose arithmetic unit includes an arithmetic logic unit (ALU); executing the general-purpose arithmetic instructions through the at least one general-purpose arithmetic unit includes:

[0063] The arithmetic logic unit (ALU) receives general arithmetic logic instructions sent by the instruction scheduling unit and performs logical operations; or

[0064] In one possible implementation, the general-purpose arithmetic instructions include memory read / write instructions; the at least one general-purpose arithmetic unit includes a memory read / write unit (LSU); and the step of executing the general-purpose arithmetic instructions through the at least one general-purpose arithmetic unit to obtain the execution result of the general-purpose arithmetic task includes:

[0065] The memory read / write unit (LSU) receives memory read / write instructions sent by the instruction scheduling unit and performs memory read / write operations.

[0066] In one possible implementation, the graph computation control instructions include data read / write instructions, which carry memory read / write addresses; the method further includes:

[0067] The graph computation stream unit reads data from or writes data to the memory read / write unit LSU according to the memory read / write address in the data read / write instruction.

[0068] In one possible implementation, the at least one general-purpose arithmetic unit further includes a floating-point arithmetic unit (FPU); the graph computation task includes floating-point operations; the method further includes:

[0069] The graph computation flow unit sends the floating-point operation data to the floating-point arithmetic unit (FPU) for calculation, and receives the calculation results from the FPU; or

[0070] In one possible implementation, the at least one general-purpose computation unit further includes a vector computation unit (SIMD); the graph computation task includes vector operations; and the method further includes:

[0071] The graph computation flow unit sends the data for vector operations to the vector operation unit SIMD for computation, and receives the computation results fed back by the SIMD.

[0072] Thirdly, this application provides a semiconductor chip that may include a processor provided by any of the implementations of the first aspect described above.

[0073] Fourthly, this application provides a semiconductor chip that may include: a processor provided by any of the implementations of the first aspect above, an internal memory coupled to the multi-core processor, and an external memory.

[0074] Fifthly, this application provides a System-on-a-Chip (SoC) chip, which includes a processor provided by any of the implementations of the first aspect above, internal memory coupled to the processor, and external memory. This SoC chip can be composed of chips or may include chips and other discrete devices.

[0075] Sixthly, this application provides a chip system including a multi-core processor provided by any implementation of the first aspect described above. In one possible design, the chip system further includes a memory for storing program instructions and data necessary or related to the operation of the multi-core processor. This chip system may be composed of chips or may include chips and other discrete devices.

[0076] Seventhly, this application provides a processing apparatus that has the function of implementing any of the processing methods described in the second aspect above. This function can be implemented in hardware or by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the aforementioned functions.

[0077] Eighthly, this application provides a terminal including a processor, which is a processor provided in any of the implementations of the first aspect described above. The terminal may further include a memory coupled to the processor, which stores necessary program instructions and data for the terminal. The terminal may also include a communication interface for communicating with other devices or communication networks.

[0078] Ninthly, this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the processing method flow described in any one of the second aspects above.

[0079] In a tenth aspect, embodiments of the present invention provide a computer program including instructions that, when executed by a processor, cause the processor to perform the processing method flow described in any of the second aspects above. Attached Figure Description

[0080] Figure 1 This is a schematic diagram of the structure of a processor provided in an embodiment of the present invention.

[0081] Figure 2 This is a schematic diagram of another processor provided in an embodiment of the present invention.

[0082] Figure 3 This is a schematic diagram of another processor provided in an embodiment of the present invention.

[0083] Figure 4 This is a schematic diagram illustrating the process of source code synthesis, compilation, and execution provided in an embodiment of the present invention.

[0084] Figure 5 This is a schematic diagram of a computational model for a graph computational flow unit provided in an embodiment of the present invention.

[0085] Figure 6 This is a schematic diagram of a graph computation flow control instruction provided in an embodiment of the present invention.

[0086] Figure 7 This is a schematic diagram of an abstract model of computational nodes in a graph block, provided as an embodiment of the present invention.

[0087] Figure 8 This invention provides an abstract model of graph computation flow instructions.

[0088] Figure 9 This is a schematic diagram illustrating how code is abstracted into a data flow graph, as provided in an embodiment of the present invention.

[0089] Figure 10 This is a flowchart illustrating a processing method provided in an embodiment of the present invention. Detailed Implementation

[0090] The embodiments of the present invention will now be described with reference to the accompanying drawings.

[0091] The terms "first," "second," "third," and "fourth," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or apparatuses.

[0092] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0093] As used in this specification, the terms "component," "module," "system," etc., are used to refer to computer-related entities, hardware, firmware, combinations of hardware and software, software, or software in execution. For example, a component can be, but is not limited to, a process running on a processor, a processor, an object, an executable file, an execution thread, a program, and / or a computer. As illustrated, applications running on computing devices and computing devices can both be components. One or more components may reside in a process and / or an execution thread, and components may be located on a single computer and / or distributed among two or more computers. Furthermore, these components can be executed from various computer-readable media on which various data structures are stored. Components can communicate, for example, via local and / or remote processes based on signals having one or more data packets (e.g., data from two components interacting with another component between a local system, a distributed system, and / or a network, such as the Internet interacting with other systems via signals).

[0094] First, some of the terms used in this application will be explained to facilitate understanding by those skilled in the art.

[0095] (1) A graph is an abstract data structure used to represent the relationships between objects. It is described using vertices and edges: vertices represent objects and edges represent the relationships between objects.

[0096] (2) Superscalar processor architecture refers to a type of parallel operation that performs instruction-level parallelism within a single processor core. This technology can achieve higher CPU throughput at the same CPU clock speed.

[0097] (3) Single Instruction Multiple Data (SIMD) is capable of copying multiple operands and packing them into a set of instructions in a large register.

[0098] (4) Instruction pipelining is a method to improve the efficiency of processor instruction execution by dividing the operation of an instruction into multiple small steps, each of which is completed by a dedicated circuit. For example, the execution of an instruction requires three stages: instruction fetch, decode, and execution. Each stage takes one machine cycle. Without pipelining, the execution of this instruction would take three machine cycles. With instruction pipelining, when the instruction completes "fetch" and enters "decode", the next instruction can be "fetched" at the same time, thus improving the execution efficiency of the instruction.

[0099] (5) Execution Unit (EU) It is responsible for the execution of instructions and actually has both the functions of a controller and an arithmetic unit.

[0100] (6) The register file, also known as the register stack, is an array of multiple registers in the CPU, usually implemented by a fast static random access memory (SRAM). This type of RAM has dedicated read and write ports, allowing multiple concurrent accesses to different registers.

[0101] (7) An integrated circuit (IC) is a miniature electronic device or component. Using certain processes, the transistors, resistors, capacitors, inductors, and other components required for a circuit, along with their interconnections, are fabricated on a small piece or several small pieces of semiconductor wafers or dielectric substrates, and then packaged in a casing to form a miniature structure with the required circuit function; that is, an IC chip is an integrated circuit formed by placing a large number of microelectronic components (transistors, resistors, capacitors, etc.) on a plastic substrate to make a chip.

[0102] First, to facilitate understanding of the embodiments of the present invention, the architecture of the processor and the instruction set involved in this application are further analyzed and proposed.

[0103] Currently, in general-purpose processors based on the von Neumann architecture (also known as the control-flow architecture), the core idea is instruction-driven computation. The processor reads instructions sequentially according to their execution sequence and then calls data for processing based on the control information contained within the instructions. The challenge of this control-flow architecture is how to ensure continuous instruction execution without interruption while maintaining the processor's clock speed, thereby improving performance. Against this backdrop, techniques such as superscalar, very long instruction word (VLIW), dynamic scheduling algorithms, and instruction prefetching have emerged to enhance processor performance. However, these techniques still suffer from high performance overhead. Simultaneously, dataflow architecture has emerged to address these issues. Dataflow architecture explicitly describes instruction dependencies at the instruction set level, directly presenting the parallelism between instructions to the hardware for execution. Dataflow architecture can be abstracted as a directed graph consisting of N nodes. Connections between nodes represent a dataflow. Once the input of each node is ready, the current node can perform computation and pass the result to the next node. Therefore, nodes not on the same path within the same graph can run concurrently, thereby improving the parallelism of processing. Currently, traditional dataflow architectures also require support for control flow. Therefore, in this application, (dataflow + control flow) will be uniformly referred to as graph computing architecture. It should be noted that the control flow in graph computing architecture is not entirely equivalent to the control flow of a general-purpose processor. The control flow in a general-purpose processor architecture mainly refers to the execution instructions for general-purpose operations, while the control flow in graph computing architecture mainly refers to various graph computing control instructions within the graph (such as switch / gate / predicate / gate instructions, etc.).

[0104] To address the shortcomings of existing technologies, this application proposes integrating a graph computing architecture (data flow + control flow) into a general-purpose processor architecture, where it functions as an execution unit within a processor core (the Graph Flow Unit (GFU) in this application), executing computational tasks synchronously or asynchronously with other execution units. Furthermore, this application designs functions for the processor to execute general-purpose computational functions and control the operation of the GFU based on the control flow architecture of the general-purpose processor, and designs the computational functions within the GFU based on a (data flow + control flow) architecture suitable for graph computing. That is, the general-purpose computational task portion still uses a control flow approach, while the graph computational task portion (e.g., hot loops and hot instruction sequences) uses a (data flow + control flow) approach, thereby achieving the function of accelerating the operation of a general-purpose processor using a graph computing architecture. Because the graph computation flow unit is located within the processor core, it can communicate directly with other functional modules or execution units within the processor core, without needing to communicate through other message channels or memory read / write methods, significantly reducing communication latency. Simultaneously, since the graph computation flow unit is located within the processor core, the processor core can better control and access the graph computation flow unit, thereby controlling its synchronous or asynchronous operation with other hardware units, improving the processor's parallelism and computational efficiency. Furthermore, for some repetitive instruction sequences, they can be repeatedly executed within the graph computation architecture (i.e., the graph computation flow unit), reducing the number of times the processor core fetches instructions from memory and the bandwidth required. It also reduces the overhead of instruction dependency checks, jump prediction, and register access, effectively utilizing the computational resources of the graph computation flow unit and further improving the processor's operating efficiency and performance.

[0105] It should be noted that the graph computations involved in this application all refer to directed graph computations, which will not be elaborated further hereafter.

[0106] The processor architecture provided in this application can schedule instructions suitable for graph computing architecture to be executed on graph computing flow units in the processor core, and schedule instructions unsuitable for graph computing architecture to be executed on other general-purpose computing units in the processor core. Moreover, the processor can call GFU to execute independently, or call GFU and other execution units concurrently. This solves the problems of high switching overhead, poor parallelism, and low processor operating efficiency caused by the inability to be shared by multiple processes in the existing graph acceleration processor (such as SEED) architecture. It realizes high parallelism, low power consumption, and high energy efficiency processing operation functions, thereby achieving performance and energy efficiency improvements.

[0107] Based on the processor architecture provided in this application, this embodiment of the invention also provides a pipeline structure suitable for the aforementioned processor architecture. In this pipeline structure, the lifecycle of an instruction may include instruction fetch pipeline → decode pipeline → scheduling (issue) pipeline → execution pipeline → memory access pipeline → write-back pipeline. That is, this pipeline structure divides the execution process of an instruction into at least the following six stages, wherein...

[0108] Instruction fetch pipeline: Instruction fetch refers to the process of reading instructions from memory.

[0109] Decoding Pipeline: Instruction decoding refers to the process of translating instructions fetched from memory.

[0110] The scheduling (issue) pipeline: Instruction dispatch and issue reads registers to obtain operands and, based on the instruction type, sends the instruction to the corresponding execution unit (EU) for execution.

[0111] Execution Pipeline: After instruction decoding, the required computation type is known, and the necessary operands have been read from the general-purpose register set. The next step is instruction execution (InstructionExecute) to complete the computation task. Instruction execution refers to the actual computation process performed on the instruction. For example, if the instruction is an addition instruction, the operands are added; if it is a subtraction instruction, the subtraction operation is performed; if it is a graph computation, the graph computation operation is performed.

[0112] Memory access pipeline: Memory access refers to the process by which memory access instructions read data from or write data to memory, mainly by executing load / store instructions.

[0113] Write-back pipeline: Write-back refers to the process of writing the result of instruction execution back to the general-purpose register set. For ordinary arithmetic instructions, the result comes from the calculation result in the "execution" stage; for memory read instructions, the result comes from the data read from memory in the "memory access" stage.

[0114] In the above pipelined architecture, each instruction in the processor must undergo the aforementioned operation steps. However, different operation steps of multiple instructions can be executed simultaneously, thus accelerating the overall instruction flow and shortening the program execution time. It is understood that the above processor architecture and pipelined architecture are merely exemplary implementations provided by embodiments of the present invention, and the processor architecture and pipelined architecture in these embodiments include, but are not limited to, the above implementations.

[0115] Based on the processor architecture and pipeline structure described above, this application provides a processor. Please refer to... Figure 1 , Figure 1 This is a schematic diagram of a processor according to an embodiment of the present invention. The processor 10 can be located in any electronic device, such as a computer, mobile phone, tablet, personal digital assistant, smart wearable device, smart vehicle, or smart home appliance. Specifically, the processor 10 can be a chip, a chipset, or a circuit board carrying a chip or chipset. The chip, chipset, or circuit board carrying the chip or chipset can operate under necessary software drivers. Specifically,

[0116] The processor 10 may include at least one processor core 101, and the processor core 101 may include an instruction scheduling unit 1011, a graph computation flow unit 1012 connected to the instruction scheduling unit 1011, and at least one general-purpose arithmetic unit 1013. The instruction scheduling unit 1011 operates in the issue pipeline stage of the processor core 101 to schedule and distribute instructions to be executed; while the graph computation flow unit 1012 and the at least one general-purpose arithmetic unit 1013 both operate as execution units (EUs, also referred to as functional units FUs) of the processor 10 in the execution pipeline stage to complete various types of computational tasks. Specifically, the processor 10 can directly assign graph computation tasks in the instructions to be executed to the graph computation flow unit 1012 for execution through the instruction scheduling unit 1011 to accelerate the function of the general-purpose processor through graph computation mode; and schedule the general-purpose computation tasks in the instructions to be executed to the at least one general-purpose arithmetic unit 1013 for execution to realize general-purpose computing functions. Optionally, depending on the computational task, the processor 10 may call only the graph computation flow unit 1012 to execute the task, or call at least one general-purpose arithmetic unit 1013 to execute the task, or it may call the graph computation flow unit 1012 and the at least one general-purpose arithmetic unit 1013 to execute the task in parallel. It is understood that the instruction scheduling unit 1011 can be connected to the graph computation flow unit 1012 and the at least one general-purpose arithmetic unit 1013 via a bus or other means for direct communication. Figure 1 The connections shown do not impose restrictions on the connections between them.

[0117] In one possible implementation, please see Figure 2 , Figure 2 This is a schematic diagram of another processor structure provided in an embodiment of the present invention. The processor 10 may include multiple processor cores ( Figure 2 Taking F processor cores (where F is an integer greater than 1) as an example, such as processor core 101, processor core 102, processor core 103... processor core 10F. The processor cores can be homogeneous or heterogeneous; that is, the structures of processor cores (102, 103... 10F) and processor core 101 can be the same or different. This embodiment of the invention does not specifically limit this. Optionally, processor core 101 can serve as the main processing core, and processor cores (102, 103... 10F) can serve as slave processing cores. The main processing core and (F-1) slave processing cores can be located in one or more chips (ICs). It is understood that the main processing core 101 and the (F-1) slave processing cores can communicate via a bus or other coupling methods, which is not specifically limited here. It should be noted that the pipeline structure can vary depending on the structure of each processor core. Therefore, the pipeline structure referred to in this application refers to the pipeline structure of processor core 101, without specifically limiting the pipeline structures of other processor cores.

[0118] In one possible implementation, please see Figure 3 , Figure 3 This is a schematic diagram of another processor structure provided in an embodiment of the present invention. The processor core 101 may further include an instruction fetch unit 1015 and an instruction decoding unit 1016, which operate in the instruction fetch pipeline stage and the instruction decoding pipeline stage, respectively, and complete the corresponding instruction fetch and instruction decoding functions. Optionally, such as Figure 3As shown, the at least one general-purpose arithmetic unit 1013 may specifically include one or more of the following: a memory read / write unit (LSU) 1013A, a floating-point arithmetic unit (FPU) 1013B, a vector arithmetic unit (SIMD) 1013C, and an arithmetic logic unit (ALU) 1013D. The aforementioned general-purpose arithmetic units (including 1013A, 1013B, 1013C, and SIMD) and the graph computation flow unit 1012 are all connected to the instruction scheduling unit 1011 and operate as execution units (EUs) of the processor in the execution pipeline stage. The aforementioned execution units respectively receive different types of instructions scheduled by the instruction scheduling unit 1011, and then execute the type of computational task they are good at based on their different hardware structures. Optionally, the processor core 101 of the processor 10 also includes a memory unit 1017 outside the core. The aforementioned memory read / write unit (LSU) reads and writes data from the memory unit 1017 during the memory access pipeline stage. Further optionally, the processor core 101 also includes a result write-back unit 1014, which operates during the write-back pipeline stage and is responsible for writing the calculation results of instructions back to the destination register. Optionally, the memory unit 1017 is typically a power-loss volatile memory, whose stored contents are lost when power is off; it can also be called main memory or RAM. This memory unit 1017 can serve as a temporary data storage medium for the operating system or other running programs in the processor 10. For example, the operating system running on the processor 10 retrieves data that needs to be calculated from the memory unit 1017 to the processor core 101 for calculation, and after the calculation is completed, the processor core 101 then sends the result back. The memory unit 1017 may include one or more of the following: dynamic random access memory (DRAM), static random access memory (SRAM), synchronous dynamic random access memory (SDRAM), level 1 cache (L1 cache), level 2 cache (L2 cache), and level 3 cache (L3 cache).

[0119] It should be noted that, Figure 3 The various functional modules within the processor can communicate via a bus or other connection methods. Figure 3 The connections shown do not constitute a limitation on the connections between them. Further explanation of each functional module will follow in subsequent embodiments, and will not be detailed here.

[0120] Understandable, Figure 1 , Figure 2 and Figure 3 The processor structure described above is merely one of the exemplary implementations provided in this embodiment of the invention. The processor structure in this embodiment includes, but is not limited to, the above implementations.

[0121] Based on the above in this application Figure 1 , Figure 2 and Figure 3 In the following embodiments of the invention, the microarchitecture of the provided processor 10 may specifically implement the following functions:

[0122] Instruction acquisition unit 1015 acquires the target program to be executed from memory unit 1017; instruction decoding unit 1016 decodes the target program according to a predetermined instruction format to obtain the decoded instruction to be executed. Instruction scheduling unit 1011 receives the decoded instruction to be executed, which includes general computing instructions and graph computing control instructions. The general computing instructions are used to instruct the execution of a general computing task, and the graph computing control instructions are used to instruct the execution of a graph computing task. The general computing instructions are sent to at least one general computing unit, and the graph computing control instructions are sent to the graph computing stream unit. At least one general computing unit 1013 receives and executes the general computing instructions to obtain the execution result of the general computing task. Graph computing stream unit 1012 receives and executes the graph computing control instructions to obtain the execution result of the graph computing task. At least one general-purpose computing unit 1013 also sends the first execution result of the general-purpose computing task to the result write-back unit 1014; the graph computing flow unit 1012 also sends the second execution result of the graph computing task to the result write-back unit 1014; the result write-back unit 1014 stores the first execution result and the second execution result, and writes some or all of the first execution result and the second execution result back to the instruction scheduling unit 1011.

[0123] First, based on the structure and function of the processor 10 described above, the compilation and execution process of the target program involved in this application will be explained. Please refer to... Figure 4 , Figure 4 This is a schematic diagram illustrating the process of source code synthesis, compilation, and execution provided in an embodiment of the present invention.

[0124] 1. Provide source code written in a high-level language, such as source code written by developers in various programming languages ​​(such as C, JAVA, etc.).

[0125] 2. Based on the cost estimation model, determine which parts of the source code are suitable for compilation using the general operation mode and which parts are suitable for compilation using the graph computation flow mode. Then, compile the source code into either a general operation object file or a graph computation flow object file (both in binary format) according to the different compilation modes. For example, an application (APP) may have millions of instructions, and these instructions actually have input and output relationships. For instance, if the input condition for executing one instruction is the output result of another instruction, then these two instructions can constitute the basic elements (vertices and edges) in graph computation. Therefore, during the source code compilation stage, based on the cost estimation model, complex instruction sequences (such as those with complex relationships, indirect jumps, or interruptions) or instruction sequences used only once can be compiled using the general operation mode; while for instruction sequences suitable for repeated execution, such as loops or repeatedly called functions (whose relationships can be complex or simple, but are usually required to be executed repeatedly), the graph computation flow mode can be used for compilation. Graph computation flow compilation refers to abstracting the logic between code into a graph architecture. Operations such as checks, jumps, and predictions, which were originally performed by the processor, are all generated as binary machine instructions within the graph architecture during the program compilation stage (i.e., through a graph computation flow compiler). Because these instructions within the graph architecture contain the input and output relationships between various instructions, the logical judgments between instructions can be greatly reduced during actual operation in the processor, significantly saving CPU core overhead, resulting in better performance and lower power consumption.

[0126] 3. The compiled general-purpose computational object files and graph computational flow object files are linked together by the linker into a synthesized program (executable file). For example, the object files are .o files, etc. When the program is to be executed, it still needs to be linked. During the linking process, the above object files (such as .o files) are mainly linked with libraries to create an executable file. It can be understood that the compilation stages corresponding to 1, 2, and 3 above can be completed on a device other than the device where processor 10 is located (such as a server, compiler, etc.), or they can be pre-compiled on the device where processor 10 is located, or they can be compiled and executed on the device where processor 10 is located. No specific limitation is made here.

[0127] 4. After the executable file is executed on the processor 10, the processor 10 will load the target program to be executed (such as code segment, data segment, BSS segment or stack, etc.) in the executable file into memory unit 1017 through a series of instruction loading, instruction prefetching, instruction predecoding and instruction prediction operations.

[0128] 5. The instruction fetching unit 1015 can fetch the target program from the memory unit 1017 in a continuous manner, fetching one instruction at a time. Then, each instruction enters the instruction decoding unit 1016 from the instruction fetching unit 1015 for decoding.

[0129] 6. The instruction decoding unit 1016 will split and interpret the instruction to be executed according to the predetermined instruction format to further obtain the micro-operation instruction, that is, the decoded instruction to be executed in this application, and send it to the instruction scheduling unit 1011.

[0130] 7. After receiving the decoded instructions to be executed, the instruction scheduling unit 1011 distributes them to various execution units for computation according to their type. For example, it may schedule them to the general-purpose arithmetic unit 1013 or the graph computation stream unit 1012 for computation. Since the graph computation stream unit 1012 is located in the processor core 101 of the processor 10, the instruction scheduling unit 1011 can directly connect and communicate with the graph computation stream unit 1012, thereby directly scheduling the identified graph computation control instructions to the graph computation stream unit 1012 without communicating through other message channels or memory read / write methods, greatly reducing communication latency. In one possible implementation, the general-purpose computation instructions and graph computation control instructions in this application can be identified by different flag bits (which can be added during the compilation stage), that is, different types of instructions can correspond to different instruction IDs, so that the instruction scheduling unit 1011 can identify them according to the instruction ID.

[0131] 8. The graph computation flow unit 1012 receives and executes graph computation control instructions to obtain the execution result of the graph computation task; one or more general-purpose arithmetic units 1013 receive and execute general-purpose computation instructions to obtain the execution result of the general-purpose computation task. Optionally, the graph computation flow unit 1012 and the general-purpose arithmetic unit 1013 can execute instructions in parallel or serially, depending on the logical relationship between the instructions executed by these execution units in the target program. This embodiment of the invention does not specifically limit this.

[0132] 9. Finally, both the graph computation unit 1012 and the general-purpose computing unit 1013 can send the calculation results to the result write-back unit 1014, and the result write-back unit 1014 can feed back part or all of the calculation results to the instruction scheduling unit 1011. For example, as parameters in the instructions scheduled by the subsequent instruction scheduling unit 1011. Optionally, the first execution result or the second execution result can be directly written to the memory unit 1017, or written to the memory unit 1017 through the memory read / write unit 1013A, so that the relevant execution units (such as the graph computation unit 1012 or the memory read / write unit 1013A) can obtain the required parameters from the corresponding storage location. Since the graph computation unit 1012 is located in the processor core 101 of the processor 10, the processor core 101 has the authority and conditions to obtain the relevant calculation status (such as the first execution result and the second execution result mentioned above) of the graph computation unit 1012 and other general-purpose computing units 1013, and can control its synchronous or asynchronous operation with other computing units, thereby improving the processor's operating efficiency.

[0133] In summary, the graph computation flow unit 1012, like other general-purpose arithmetic units, receives graph liveIn data from the instruction scheduling unit 1011 (including instruction issue and reservation stations) and passes this input to the corresponding computation node of the graph computation flow unit 1012. Similarly, the graph computation flow unit 1012 also writes the graph liveOut output data back to the result write-back unit 1014 (including registers and reorder buffers), thereby writing the graph output to the corresponding registers and instruction reservation stations that depend on the graph output.

[0134] Next, we will further explain the computational model of the graph computation flow units involved in the above execution phase when performing graph computation. Please refer to [link to relevant documentation]. Figure 5 , Figure 5 This is a schematic diagram of a computational model for a graph computational flow unit provided in an embodiment of the present invention.

[0135] The theoretical computational model of Graphflow in this application can be abstracted into N fully connected computation nodes (corresponding to vertices of the graph). Each node can contain one instruction, perform one operation, and can pass the result to itself or other nodes. The Graphflow theoretical computational model can be divided into two iteratively switching stages:

[0136] 1. Graph Build Phase: This phase involves creating a graph from instruction memory (...). Figure 5 Read N instructions for the composition block from (1-a), and the composition block ( Figure 5In step 1-b), each node is configured with one operation instruction and at most two target nodes. Assuming N equals 16, then... Figure 1 There are 16 computation nodes in the range 1-b: 0, 1, 2, 3, 4...15. Once the graph is constructed ( Figure 5 In section 1-b), the operations and connections of each node are fixed (read-only). For example, the operation instruction in compute node 0 is the `add` instruction, which performs addition; the operation instruction in compute node 2 is the `sll` instruction, which performs shift operations; and the operation instruction in compute node 3 is the `xor` instruction, which performs XOR operations. For compute node 5, the operation is performed by using the result of the operation between compute node 1 and compute node 2 as input; for compute node 6, the operation is performed by using the result of the operation between compute node 2 and compute node 3 as input, and so on. The operation processes of other compute nodes will not be described in detail.

[0137] 2. Execution Phase (Graph Execute): The external module inputs (LiveIn) to initiate data flow. All compute nodes run in parallel. For each node ( Figure 5 As long as the input arrives (1-d), the operation can be performed and the result passed to the next computing node; if the input does not arrive, it is in an idle state. The operation continues until the data stream reaches the end node (tm). Because the input parameters of some computing nodes (such as computing nodes 0, 1, 2, and 3) are obtained externally, i.e., from external memory unit 1017 (… Figure 5 Input the startup data in (1-e); while another part of the computing nodes (such as computing nodes 5, 6, 8, 9, 10, 11, 12, 13, 14, 15) need to obtain the calculation results output by the computing nodes connected to them from the inside, then they can perform calculations and input the results of the calculations to the computing nodes associated with them.

[0138] Based on the graph computation flow model provided in this application, when the instruction scheduling unit 1011 in the processor 10 schedules graph computation control instructions to the controller in the graph computation flow unit 1012 to execute graph computation tasks, it includes various control instructions with different functions, thereby instructing the graph computation flow unit 1012 to execute the corresponding graph computation function. In terms of timing, the graph computation control instructions provided in this application mainly include: graph construction start instruction → parameter passing instruction → graph computation start instruction → parameter return instruction. The features and functions of the above instructions are described in detail below:

[0139] In one possible implementation, the processor 10 further includes a memory unit 1017; the graph computing flow unit 1012 includes N computing nodes; the graph computing control instruction includes a graph initiation instruction, which carries a target address in the memory unit 1017; the graph computing flow unit 1012 receives the graph initiation instruction and reads graph block information from the memory unit 1017 according to the target address, the graph block information including the operation method of each of the N computing nodes, and the connection and order information between the N computing nodes. In this embodiment of the invention, if the graph computation control instruction received by the graph computation flow unit is specifically a graph construction start instruction, and this instruction is used to instruct the graph computation flow unit to read the graph construction block information stored in the memory unit according to the target address in the memory unit 1017 outside the processor core 101 carried in the instruction, wherein the graph construction block information includes the operation method corresponding to each of the N computation nodes in the graph computation flow unit, and the dependency relationship between the N computation nodes, that is, the relationship between the computation results and input conditions between the related computation nodes (that is, the two computation nodes corresponding to the edge in the graph computation), which corresponds to the above. Figure 5 The graph computation model contains N fixed-flow instructions. Based on the above-mentioned graph block information, the graph computation flow unit 1012 can complete the computation of a complete graph block. It should be noted that the above-mentioned graph block can be one or all graph blocks in the graph computation, that is, a complete graph computation task can include one or multiple graph blocks after being split.

[0140] For example, such as Figure 6 As shown, Figure 6 This is a schematic diagram of a graph computation flow control instruction provided in an embodiment of the present invention. The graph construction start instruction is gfb 0x600960, where gfb is the opcode, 0x600960 is the operand and an address in memory unit 1017. Graph computation flow unit 1012 can start graph construction by obtaining the graph block information corresponding to address 0x600960 from memory unit 1017 according to the graph construction start instruction. That is, instruction scheduling unit 1011 sends a pointer to the address of the relevant graph computation instruction to be executed to graph computation flow unit 1012, and reads the graph information block from memory unit 1017.

[0141] In one possible implementation, the graph computation control instruction includes a parameter passing instruction, which carries the identifiers of M computation nodes and the input parameters corresponding to the identifiers of the M computation nodes, wherein the M computation nodes are some or all of the N nodes; the graph computation flow unit is used to receive the parameter passing instruction and input the input parameters corresponding to the identifiers of the M computation nodes to the M computation nodes respectively.

[0142] For example, such as Figure 6 As shown, the parameter passing instructions include `gfmov x0, 1r`, which means using the parameter value in register x0 as the right input parameter in compute node 1; `gfmov x1, 10l`, which means using the parameter value in register x1 as the left input parameter in compute node 10, and so on, not listed here. In this embodiment of the invention, the graph computation control instructions received by the graph computation flow unit include parameter passing instructions, which contain the initial input parameters required by multiple compute nodes during a single graph block computation (e.g., the above-mentioned...). Figure 5 Once the multiple computing nodes (0, 1, 2, 3) obtain the corresponding parameters from outside the graph computing flow unit, the graph computing flow unit meets the conditions to start executing the graph computing task, that is, it can start graph computing.

[0143] In one possible implementation, the graph computation control instruction includes a graph computation start instruction; the graph computation flow unit 1012, upon receiving the graph computation start instruction, determines whether the current graph construction has been completed; if completed, it starts executing the graph computation task. Specifically, in one possible implementation, after receiving the graph computation start instruction, the graph computation flow unit 1012 checks whether the graph block information read by the graph computation flow unit is consistent with the pre-started graph block address, and determines whether the input parameters in the M computation nodes have been input; if consistent and input completed, it starts executing the graph computation task.

[0144] Furthermore, the processor 10 controls the graph computation flow unit 1012 to begin executing the graph computation task through the aforementioned startup graph computation instruction, specifically including the following two control methods:

[0145] Method 1: Synchronously start parallel graph computation

[0146] After receiving the start graph computation instruction, the graph computation flow unit 1012 determines whether the current graph construction has been completed. If it has, the graph computation task is started. Further, after the graph computation flow unit 1012 receives the start graph computation instruction but before the graph computation task is completed, the instruction scheduling unit 1011 controls the processor core 101 to enter a blocked state, and after the graph computation flow unit 1012 completes the graph computation task, it controls the processor core 101 to exit the blocked state.

[0147] Specifically, processor 10 can initiate the execution phase of graph flow unit 1012 via the gfe (graph flow execute) instruction. If the graph flow unit 1012 has not completed its graph construction, gfe will wait until the graph construction is complete before initiating the execution of graph flow unit 1012. During the execution phase of graph flow unit 1012, other units of processor core 101 are in a power-gate phase and do not perform other operations; the only running units are the interrupt and exception units of processor core 101. Therefore, processor core 101 enters a blocking state after executing gfe. If there is a graph construction error or an execution error, gfe will generate a corresponding exception. CPU instructions following gfe, including the parameter return instruction gfmov, can only continue to execute after the graph flow unit 1012 has finished executing.

[0148] For example, such as Figure 6 As shown, during the graph computation execution phase, the graph computation command gflow is initiated.<GBB_address> The trigger graph computation flow unit 1012 checks whether the GFU graph construction is complete, and the address of the previously pre-started graph construction block.<GBB_address> Check if it matches the executed drawing block. If the drawing block does not match, the drawing unit needs to be restarted and redrawn. Figure 1 Upon execution, graph computation can be initiated immediately. Optionally, this instruction can block the pipeline of processor core 101 until the entire graph computation is complete, thus preventing other arithmetic units of processor core 101 from executing instructions following the gflow instruction. This instruction can be used to switch between general-purpose computation mode and graph computation mode. It can also be used to allow the processor to use only GFU for computation in order to reduce power consumption. During graph computation, data and control flows in the graph according to the program definition. When the graph flow reaches the end node gfterm, the graph computation ends. gfterm will then initiate the processor instruction gflow to enter the commit phase and restart the pipeline of processor core 101.

[0149] In this embodiment of the invention, the processor core can synchronously initiate graph computation functionality (i.e., tasks can be executed serially between the graph computation stream unit and other general-purpose computing units). Specifically, while the graph computation stream unit is executing a graph computation task, the processor core's pipeline is blocked until the graph computation stream unit completes its task, thus ensuring that only the graph computation stream unit is operating during this period, while other computing units cannot operate, thereby reducing CPU power consumption. This instruction can switch the computation mode between other computing units within the processor and the graph computation stream unit, and can be applied to programs with synchronous computation.

[0150] Method 2: Asynchronous Start of Parallel Graph Computation

[0151] After receiving the start graph computation instruction, the graph computation flow unit 1012 determines whether the current graph construction has been completed. If it has, it starts executing the graph computation task. Further, the instruction scheduling unit 1011 sends a synchronization execution result instruction to the graph computation flow unit 1012. After the graph computation flow unit 1012 receives the synchronization execution result instruction but before completing the graph computation task, it controls the processor core 101 to enter a blocked state. After the graph computation flow unit 1012 completes the graph computation task, it controls the processor core 101 to exit the blocked state.

[0152] Specifically, processor 10 can initiate the execution phase of asynchronous graph flow unit 1012 via the gff (graph flow fork) instruction. If the graph construction of graph flow unit 1012 is not completed, gff will wait for the graph construction to be completed before initiating the execution of graph flow unit 1012. While gff initiates the execution of graph flow unit 1012, other arithmetic units of processor core 101 can perform other operations, so gff does not occupy resources in ROB. After asynchronous execution, the processor uses the gfj (graph flow join) instruction to synchronize the execution result of graph flow unit 1012. Only after the Graphflow execution is completed can CPU instructions after gfj continue to be executed, including the return parameter instruction gfmov.

[0153] For example, embodiments of the present invention add two new CPU instructions to the instruction set to initiate parallel operations of GFU and other arithmetic units of processor core 101, including the instruction gfork.<GBB_address> and the command gfjoin<GBB_address> The `gffork` instruction first checks whether the GFU graph construction is complete, and the addresses of the previously pre-started graph block.<GBB_address> Check if it matches the executed drawing block. If the drawing block does not match, the drawing unit needs to be restarted and redrawn. Figure 1 Once the graph computation is complete, the `gffork` instruction can immediately initiate the graph computation. The `gfjoin` instruction does not block the CPU pipeline, allowing other CPU modules to execute asynchronously with the graph computation. `gfjoin` is executed before the CPU instructions require the results of the graph computation. If the graph computation has already completed, `gfjoin` will return immediately. If the graph computation is still not complete, `gfjoin` will block the CPU pipeline until the graph computation is finished.

[0154] In this embodiment of the invention, the processor core can initiate graph computation functionality asynchronously (i.e., graph computation flow units and other general-purpose computing units can execute tasks in parallel). Specifically, while the graph computation flow unit is executing its graph computation task, the processor core's pipeline is not blocked, and other computing units can operate normally. This blocking continues until the processor sends a synchronous execution result instruction to the graph computation flow unit via the instruction scheduling unit (e.g., when other computing units require the execution result of this graph computation flow unit). If the graph computation flow unit has not yet completed its graph computation task, the processor's pipeline is blocked until the graph computation flow unit completes its task and provides the execution result. This ensures that when other computing units need the execution result of the graph computation flow unit, they can wait for the graph computation flow unit to provide the result before continuing execution, thereby improving the parallelism of the processor core. This instruction can implement a parallel computation mode between other computing units within the processor and the graph computation flow unit, and can be applied to asynchronous computation programs.

[0155] In addition to the above-mentioned method of controlling the graph computation flow unit 1012 to start graph computation through graph computation control instructions, this embodiment of the invention also provides an implementation method that triggers the start of graph computation through the graph computation flow unit 1012's own judgment. Specifically, if the graph block information includes the operation method of each of the N computation nodes, and the connection and order information between the N computation nodes, and the connection and order information between the N computation nodes includes the source node and destination node corresponding to L edges respectively; the graph computation flow unit 1012 monitors whether the input parameters required for each of the N computation nodes are ready; for the target computation node whose input parameters are ready, the input parameters of the target computation node are input to the operation method corresponding to the target computation node for calculation to obtain the calculation result; according to the source node and destination node corresponding to the L edges respectively, the calculation result of the source node in each edge is used as the input parameter and input to the corresponding destination node. Since the graph includes multiple nodes and edges connecting each node, and an edge includes the source node, destination node, and the association relationship between the source node and the destination node constituting the edge. The graph computing architecture in this application abstracts data flow and control flow programs into a graph consisting of N nodes, where the connection between nodes represents a data flow or a control flow. Each node serves as a graph instruction. Once the input required for a graph instruction is ready, the current instruction can perform computation and pass the result to the corresponding input of the next instruction.

[0156] For example, such as Figure 7 As shown, Figure 7This is a schematic diagram of an abstract model of computation nodes in a graph block provided by an embodiment of the present invention. It is assumed that each graph instruction requires inputs of a left input (l), a right input (r), and a conditional input (p). Once the inputs required by the instruction are ready, the computation can be performed, and the result is passed to the input of the corresponding node below. For example, after the operation a+b in instruction 1 is completed, it can be passed to the left input of instruction 4. In the graph architecture instruction set, this application can represent it as "1add 4l", meaning that for instruction 1, once its input is ready, the result is passed to the left input of instruction 4. In the graph architecture instruction set, this application only needs to provide the output address and does not need to provide the instruction input information. The input only needs to ensure that each instruction's input has one or more instructions passed in. Therefore, using the graph architecture instruction set encoding in this application makes the graph computation process simpler and faster.

[0157] from Figure 7 As can be seen, the parallelism between instructions is obvious. Instructions without dependencies can run concurrently naturally. For example, instructions 0, 1, and 2 can run in one cycle, and instructions 3 and 4 in another. In terms of hardware implementation (i.e., the graph computation flow unit), the dependency check only needs to be performed by checking the ready and valid fields of each input. Compared with superscalar processors in the prior art, the graph computation process in this application does not require a large amount of hardware to check the dependencies between registers.

[0158] Optionally, assuming a graph consists of N nodes, the ideal hardware for executing this graph would be for each node to have a processing unit, namely the computing node (Process Engine, PE) in this application, and to pass the result to the corresponding next-level computing node in the next iteration via an ideal N-to-N shared bus (Crossbar). However, when N is very large, such an N-to-N Crossbar is difficult to implement. Therefore, in a real-world hardware design, in one possible implementation, this embodiment defines that P instructions share X computing nodes. That is, in each iteration, a computing node selects at most X instructions from the P instructions (instructions that must be ready) for simultaneous computation.

[0159] In this embodiment of the invention, for each computation node in the graph computation flow unit, as long as the computation method for each computation node has been loaded and the input parameters have been obtained, the computation node can begin graph computation. Some computation nodes (such as the source node corresponding to an edge) obtain their initial input parameters from outside the graph computation flow unit, while other computation nodes (such as the destination node corresponding to an edge) may need to wait for the computation of their related computation nodes (such as the source node) to complete before using their computation results as their input parameters to begin graph computation. Therefore, the computation start time for each computation node may be inconsistent, but for each computation node, computation can begin once the computation method and input parameters (which may include left input parameters, right input parameters, or conditional parameters) are prepared.

[0160] In one possible implementation, the processor core 101 further includes a result write-back unit; the graph computation flow unit 1012 and the at least one general-purpose arithmetic unit are respectively connected to the result write-back unit 1014; the graph computation control instruction includes a return parameter instruction, which carries the identifiers of K computation nodes and the result registers corresponding to the identifiers of the K computation nodes; the graph computation flow unit 1012 is specifically used to control the sending of the computation results of the K computation nodes to the result write-back unit 1014. In this embodiment of the invention, for the N computation nodes of the graph computation flow unit, some computation nodes may need to output their computation results to the result write-back unit outside the graph computation flow unit after the final computation is completed. That is, the graph computation flow unit can control the output of the final computation results of the K computation nodes as the computation results of the entire graph block, based on the identifiers of the K computation nodes carried in the return parameter instruction of the received graph computation control instruction, so as to facilitate further computation by the subsequent execution unit.

[0161] Optionally, the result write-back unit 1014 specifically includes a reorder buffer, used to store the instruction execution order before out-of-order execution. After the instruction set is executed out of order, the results are committed according to the original instruction order. Further optionally, the result write-back unit 1014 also includes a register set, such as general-purpose registers and special-purpose registers. The general-purpose register set is used to store operands and intermediate results participating in the operation; while special-purpose registers are usually status registers that cannot be changed by the program and are controlled by the processor itself to indicate a certain state.

[0162] The general-purpose arithmetic unit 1013 in the processor 10 of this application may include various types of hardware execution units to perform or accelerate different types of computing tasks. It may mainly include one or more of the following: a memory read / write unit 1013A (LSU), a floating-point arithmetic unit 1013B (FPU), a vector arithmetic unit 1013C (SIMD), and an arithmetic logic unit 1013D (ALU). The features and functions of the above-mentioned general-purpose arithmetic unit are described in detail below:

[0163] In one possible implementation, the general-purpose arithmetic instructions include general-purpose arithmetic logic instructions or memory read / write instructions; the at least one general-purpose arithmetic unit includes: an arithmetic logic unit 1013D (ALU) for receiving general-purpose arithmetic logic instructions sent by the instruction scheduling unit 1011 and performing logical operations; or a memory read / write unit 1013A (LSU) for receiving memory read / write instructions sent by the instruction scheduling unit 1011 and performing memory read / write operations.

[0164] The Arithmetic and Logic Unit (ALU) primarily performs fixed-point arithmetic operations (addition, subtraction, multiplication, and division), logical operations (AND, OR, NOT, and XOR), and shift operations on binary data. Mathematical operations such as addition, subtraction, multiplication, and division, as well as logical operations such as OR, AND, ASL, and ROL, are all executed within the ALU. The ALU influences various operations within the processor, including compression and decompression, process scheduling, compiler syntax analysis, computer-aided circuit design, and game AI processing.

[0165] The Load / Store Unit (LSU) is used to calculate addresses. Instructions that access memory (generally load / store) typically include the address of the memory they wish to use. The LSU processes these instructions and calculates the address carried in the instruction. Using a dedicated LSU to calculate the address of memory access instructions allows that LSU to execute instructions in parallel with other execution units, improving the execution efficiency of memory access instructions and enhancing processor performance.

[0166] In this embodiment of the invention, the at least one arithmetic unit may further include an arithmetic logic unit 1013D and a memory read / write unit. The arithmetic logic unit is mainly used for input-related logical operations, while the memory read / write unit is used to execute data read / write operation instructions. That is, both of these units are in the execution pipeline stage with the graph computation flow unit, jointly completing various types of computation tasks after decoding in the CPU. They can be executed in parallel, serially, or partially in parallel and partially serially, so as to complete the processor's computation tasks more efficiently. This embodiment of the invention embeds the directed graph flow computation architecture (Graphflow) into a module of a superscalar processor and reuses the existing arithmetic units in the superscalar processor core to achieve better performance and lower power consumption.

[0167] In one possible implementation, the graph computation control instructions include data read / write instructions, which carry read / write addresses in the memory read / write unit 1013A. The graph computation stream unit 1012 is further configured to: read data from or write data to the memory read / write unit 1013A (LSU) according to the memory read / write addresses in the data read / write instructions. For example, the graph computation stream unit 1012 can read the instructions, parameters, etc. required for graph computation from the memory read / write unit 1013A (LSU) through relevant load or store instructions, or write the execution results of graph computation to the memory read / write unit 1013A (LSU). Different operations can be performed depending on the specific instruction content in the target program. It is understandable that the data read from memory read / write unit 1013A (LSU) is actually the data read from memory unit 1017 by memory read / write unit 1013A; while the data written to memory read / write unit 1013A (LSU) is actually the data written to memory read / write unit 1013A first and then written from memory read / write unit 1013A to memory unit 1017.

[0168] Optionally, the graph computation flow unit 1012 can also directly read data from the memory unit 1017 according to the graph computation control instructions, or directly write the execution result to the memory unit 1017, depending on the specific instructions in the target program to be executed. That is, the graph computation flow unit 1012 can obtain data from the memory read / write unit 1013A or the memory unit 1017 according to the graph computation control instructions. Similarly, it can write data to the memory read / write unit 1013A or the memory unit 1017 according to the graph computation control instructions.

[0169] In this embodiment of the invention, the graph computing flow unit in the processor core 101 can reuse the function of the memory read / write unit in the processor core 101, and read data from or write data to the memory read / write unit LSU according to the read / write address in the relevant data read / write instructions.

[0170] In one possible implementation, the at least one general-purpose computing unit further includes a floating-point unit (FPU) or a vector computing unit (SIMD) 1013C; the graph computing task includes floating-point operations or vector operations; the graph computing flow unit 1012 is further configured to: send the data of the floating-point operations to the floating-point unit (FPU) for computation and receive the computation result fed back by the FPU; or send the data of the vector operations to the vector computing unit (SIMD) 1013C for computation and receive the computation result fed back by the SIMD.

[0171] The Floating Point Unit (FPU) 1013B is primarily responsible for floating-point operations and high-precision integer operations. Floating-point computing power is a crucial indicator of a CPU's performance in multimedia applications, audio / video encoding / decoding, and image / 3D graphics processing. It also affects the CPU's scientific computing performance, such as in fluid mechanics and quantum mechanics.

[0172] Single Instruction Multiple Data (SIMD), also known as the Vector Arithmetic Unit 1013C, is a technique for achieving data-level parallelism. The Vector Arithmetic Unit 1013C executes multiple operations simultaneously within a single instruction to increase processor throughput. Specifically, it uses a single vector instruction to initiate a group of data operations, where data loading, storage, and computation are performed in a pipelined manner. It is suitable for applications involving a large number of fine-grained, homogeneous, and independent data operations, such as multimedia, big data, and artificial intelligence applications.

[0173] Based on the above, the memory read / write operations of the graph computation flow unit 1012 (GFU) can reuse the memory read / write unit 1013A (LSU) in the processor 10, and floating-point and complex vector operations reuse the computation logic of the FPU and SIMD. This avoids the duplication of the computation logic inside the GFU, saves a lot of hardware area, and reduces the latency of switching from ordinary operations to graph operations.

[0174] In this embodiment of the invention, the general-purpose arithmetic unit may further include a floating-point arithmetic unit (FPU) and / or a vector arithmetic unit (SIMD). The FPU is used for floating-point arithmetic tasks that require higher data precision, while the SIMD unit is used for single-instruction multiple-data (SIMD) arithmetic. Since the general-purpose arithmetic units (including some dedicated arithmetic units) and the graph computation flow unit are in the same execution pipeline stage and have data transmission channels with each other, when the graph computation flow unit is processing graph computation tasks, if there are floating-point arithmetic tasks or SIMD arithmetic tasks, they can be sent to the corresponding general-purpose arithmetic units for processing through the corresponding data transmission channels. This eliminates the need to repeatedly set up corresponding processing units in the graph computation flow unit to process the corresponding types of arithmetic tasks, thereby greatly saving hardware area and overhead.

[0175] Based on the processor's structure and functional design described in this application, and the theoretical computational model of Graphflow, in one possible implementation, this application further defines the basic format of a flow instruction in the Graphflow Instruction-Set Architecture (Graphflow ISA). This format represents the computation method of each of the N computation nodes contained in the graph block information of this application, as well as the connection and sequence information between the N computation nodes. The format of an execution instruction executed by a single computation node can be represented as: [ID + opcode + dest0ID + dest1ID]

[0176] Please see Figure 8 , Figure 8 This invention provides an abstract model for graph computation flow instructions, where ID-based flow instructions are placed on computation nodes with corresponding IDs. The range of IDs is [0, N-1], where N is the total number of nodes in Graphflow. A flow instruction can express one or two dependencies, indicating that the result data is passed to dest0ID and dest1ID.

[0177] To better understand the graph computation flow architecture in this application, this application abstracts each computation node of Graphflow as follows: Figure 8As shown. Each abstract computation node can hold one instruction and at most two outputs. Each computation node has its own left input (l) and right input (r) buffers, operands (opcodes), and two destination pointers (dest0T, dest1T, where T represents the left and right inputs of the destination instruction). Since it is assumed that N nodes are fully connected, the range of dest is [0, N-1], which means that the output of any node can point to the left input (l) and right input (r) buffers of any node.

[0178] The (opcode, dest0T, dest1T) fields in the abstract node can be written during the graph construction phase, but are fixed as read-only during the execution phase. Once in the execution phase, all nodes need to check in parallel whether their left and right inputs have arrived. If both inputs are ready, then computation can be performed and the result passed to the left and right outputs of the next node. If no input has arrived, the node remains in the idle state.

[0179] For example, this application can connect all the variables in a piece of code into a graph, which can then be written as follows: Figure 9 As shown, Figure 9 A schematic diagram illustrating the abstraction of code into a data flow graph, provided as an embodiment of the present invention:

[0180] Instructions 0, 1, 2, 5, 6, and 9 are placed in their respective computation units according to their IDs. Instructions 0 and 5 calculate the address of A[i], while instructions 1, 2, and 6 calculate the data (a+b). (c+d). Each instruction represents the direction of data flow. The corresponding inputs and connections are configured during the diagramming phase.

[0181] During the execution phase, all computing nodes check their inputs for readiness in parallel. Therefore, the assembly code above is semantically concurrent, not sequential. `2add 6r` means "once all inputs 2l, 2r for the addition operation of instruction 2 arrive, perform the addition and pass the result to the right input (6r) of instruction 6". For example, `9st` means "once all inputs 9l, 9r for the store operation of instruction 9 arrive, perform the store operation". The store does not need to pass data to other instructions, therefore, a destination does not need to be declared in instruction 9.

[0182] As can be seen from the graph connections, the parallelism between instructions is obvious (e.g., instructions 0, 1, 2 and 5, 6). The only thing the hardware needs to do is check in parallel whether the required input for each node has arrived. This is why the Graphflow architecture does not require a lot of logic for hardware dependency analysis. During the execution phase, for each node, as long as its input arrives, computation can be performed. Therefore, there is no need to put the source information of the instruction into the encoding of the streaming instruction. The input of each streaming instruction may be dynamically passed in from different nodes or from other hardware modules. Each instruction does not need to care where it reads data from; as long as other instructions send it the data it needs, it can be computed. If its own input has not arrived, it will wait indefinitely. Therefore, Graphflow execution is out-of-order emission and concurrent execution, without requiring a fixed number of cycles for each computing node. Therefore, at any time interruption during Graphflow computation, there is no exact graph state, but the intermediate state of the graph is stored in the left and right buffers of each instruction, so the intermediate state can be stored in memory.

[0183] Please see Figure 10 , Figure 10 This is a flowchart illustrating a processing method provided in an embodiment of the present invention. The processing method is applied to a processor, which includes a processor core. The processor core includes an instruction scheduling unit, a graph computation flow unit connected to the instruction scheduling unit, and at least one general-purpose arithmetic unit. Furthermore, the processing method is applicable to the aforementioned... Figures 1-3 The method can include any type of processor and a device containing the processor (such as a mobile phone, computer, server, etc.). The method may include the following steps S201-S203, wherein...

[0184] Step S201: The instruction scheduling unit allocates the general computing instructions in the decoded instructions to be executed to the at least one general computing unit, and allocates the graph computing control instructions in the decoded instructions to be executed to the graph computing unit. The general computing instructions are used to instruct the execution of general computing tasks, and the graph computing control instructions are used to instruct the execution of graph computing tasks.

[0185] Step S202: Execute the general computing instructions through the at least one general-purpose computing unit;

[0186] Step S203: Execute the graph computation control instructions through the graph computation flow unit.

[0187] In one possible implementation, the processor core further includes an instruction fetch unit and an instruction decoding unit, and the above method further includes:

[0188] The target program to be executed is obtained through the instruction acquisition unit;

[0189] The target program is decoded by the instruction decoding unit to obtain the decoded instruction to be executed.

[0190] In one possible implementation, the processor core further includes a result write-back unit; the graph computation flow unit and the at least one general-purpose arithmetic unit are respectively connected to the result write-back unit; the method further includes:

[0191] The first execution result of the general computing task is sent to the result write-back unit through the at least one general computing unit, and the first execution result of the general computing task is the result obtained by executing the general computing instruction;

[0192] The graph computation flow unit sends the second execution result of the graph computation task to the result write-back unit, whereby the second execution result of the graph computation task is the result obtained by executing the graph computation control instructions.

[0193] The result write-back unit writes back part or all of the first execution result and the second execution result to the instruction scheduling unit.

[0194] In one possible implementation, the processor further includes a memory unit; the graph computation flow unit includes N computation nodes; the graph computation control instructions include a graph initiation instruction, the graph initiation instruction carrying a target address in the memory unit; the execution of the graph computation control instructions through the graph computation flow unit includes:

[0195] The graph computation flow unit receives the start graph construction instruction and reads the construction block information from the memory unit according to the target address. The construction block information includes the operation method of each of the N computing nodes and the connection and order information between the N computing nodes.

[0196] In one possible implementation, the graph computation control instruction includes a parameter passing instruction, which carries identifiers of M computation nodes and input parameters corresponding to the identifiers of the M computation nodes, wherein the M computation nodes are some or all of the N nodes; executing the graph computation control instruction through the graph computation flow unit includes:

[0197] The graph computation flow unit receives the parameter transmission instruction and inputs the input parameters corresponding to the identifiers of the M computation nodes into the M computation nodes respectively.

[0198] In one possible implementation, the connection and order information between the N computing nodes includes the source and destination nodes corresponding to the L edges respectively; the execution of the graph computing control instructions through the graph computing flow unit includes:

[0199] The graph computation flow unit monitors whether the input parameters required for each of the N computation nodes are ready; for a target computation node whose input parameters are ready, the input parameters of the target computation node are input into the corresponding operation method of the target computation node for calculation to obtain the calculation result; according to the source node and destination node corresponding to the L edges respectively, the calculation result of the source node in each edge is used as the input parameter and input to the corresponding destination node.

[0200] In one possible implementation, the graph computation control instructions include a graph computation initiation instruction; the step of executing the graph computation control instructions through the graph computation flow unit to obtain the execution result of the graph computation task includes:

[0201] After receiving the start graph computation instruction through the graph computation flow unit, it checks whether the graph block information read by the graph computation flow unit is consistent with the pre-started graph block address, and determines whether the input parameters in the M computation nodes have been input. If the graph block information is consistent with the pre-started graph block address and the input parameters in the M computation nodes have been input, then the graph computation task is started.

[0202] In one possible implementation, the method further includes:

[0203] After the graph computation flow unit receives the start graph computation instruction but before the graph computation task is completed, the instruction scheduling unit controls the processor core to enter a blocked state.

[0204] In one possible implementation, the method further includes:

[0205] The instruction scheduling unit sends a synchronous execution result instruction to the graph computation flow unit, and after the graph computation flow unit receives the synchronous execution result instruction but before completing the graph computation task, it controls the processor core to enter a blocked state.

[0206] In one possible implementation, the method further includes:

[0207] After the graph computation flow unit completes the graph computation task, the instruction scheduling unit controls the processor core to exit the blocked state.

[0208] In one possible implementation, the processor core further includes a result write-back unit, which includes multiple registers; the graph computation flow unit and the at least one general-purpose arithmetic unit are respectively connected to the result write-back unit; the graph computation control instruction includes a return parameter instruction, which carries the identifiers of K computation nodes and the registers corresponding to the identifiers of the K computation nodes; the execution of the graph computation control instruction through the graph computation flow unit to obtain the execution result of the graph computation task includes:

[0209] The computation results of the K computation nodes are sent to the corresponding registers in the result write-back unit through the graph computation flow control.

[0210] In one possible implementation, the general-purpose arithmetic instructions include general-purpose arithmetic logic instructions; the at least one general-purpose arithmetic unit includes an arithmetic logic unit (ALU); executing the general-purpose arithmetic instructions through the at least one general-purpose arithmetic unit includes:

[0211] The arithmetic logic unit (ALU) receives general arithmetic logic instructions sent by the instruction scheduling unit and performs logical operations; or

[0212] In one possible implementation, the general-purpose arithmetic instructions include memory read / write instructions; the at least one general-purpose arithmetic unit includes a memory read / write unit (LSU); and the step of executing the general-purpose arithmetic instructions through the at least one general-purpose arithmetic unit to obtain the execution result of the general-purpose arithmetic task includes:

[0213] The memory read / write unit (LSU) receives memory read / write instructions sent by the instruction scheduling unit and performs memory read / write operations.

[0214] In one possible implementation, the graph computation control instructions include data read / write instructions, which carry memory read / write addresses; the method further includes:

[0215] The graph computation stream unit reads data from or writes data to the memory read / write unit LSU according to the memory read / write address in the data read / write instruction.

[0216] In one possible implementation, the at least one general-purpose arithmetic unit further includes a floating-point arithmetic unit (FPU); the graph computation task includes floating-point operations; the method further includes:

[0217] The graph computation flow unit sends the floating-point operation data to the floating-point arithmetic unit (FPU) for calculation, and receives the calculation results from the FPU; or

[0218] In one possible implementation, the at least one general-purpose computation unit further includes a vector computation unit (SIMD); the graph computation task includes vector operations; and the method further includes:

[0219] The graph computation flow unit sends the data for vector operations to the vector operation unit SIMD for computation, and receives the computation results fed back by the SIMD.

[0220] It should be noted that the specific process of the processing method described in the embodiments of the present invention can be found in the above. Figures 1-9 The relevant descriptions in the embodiments of the invention described herein will not be repeated here.

[0221] This invention also provides a computer-readable storage medium, wherein the computer-readable storage medium may store a program, which, when executed by a processor, enables the processor to perform some or all of the steps described in any of the above method embodiments.

[0222] This invention also provides a computer program that includes instructions that, when executed by a multi-core processor, enable the processor to perform some or all of the steps described in the above method embodiments.

[0223] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0224] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this application. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this application.

[0225] In the several embodiments provided in this application, it should be understood that the disclosed apparatus can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of the units described above is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical or other forms.

[0226] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0227] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0228] If the aforementioned integrated units are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which can be a personal computer, server, or network device, specifically a processor in the computer device) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium may include various media capable of storing program code, such as a USB flash drive, portable hard drive, magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM).

[0229] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A processor, characterized in that, The processor core includes an instruction scheduling unit, a graph computation flow unit connected to the instruction scheduling unit, and at least one general-purpose arithmetic unit; wherein, The instruction scheduling unit is configured to: allocate general computing instructions from the decoded instructions to be executed to the at least one general computing unit, and allocate graph computing control instructions from the decoded instructions to be executed to the graph computing flow unit, wherein the general computing instructions are used to instruct the execution of a general computing task, and the graph computing control instructions are used to instruct the execution of a graph computing task; The at least one general-purpose arithmetic unit is used to execute the general-purpose computation instructions; The graph computation flow unit is used to execute the graph computation control instructions; The processor further includes a memory unit; the graph computation flow unit includes N computation nodes; the graph computation control instructions include a graph start instruction and a graph start instruction; the graph start instruction carries the target address in the memory unit; The graph computation flow unit is specifically used to receive the graph construction start instruction and read the graph block information from the memory unit according to the target address. The graph block information includes the operation method of each of the N computing nodes and the connection and order information between the N computing nodes. After receiving the graph computation start instruction, check whether the graph block information read by the graph computation flow unit is consistent with the pre-started graph block address, and determine whether the input parameters in the M computation nodes have been input. If the graph block information is consistent with the pre-started graph block address and the input parameters in the M computation nodes have been input, then start the execution of the graph computation task.

2. The processor according to claim 1, characterized in that, The processor core also includes: The instruction acquisition unit is used to acquire the target program to be executed. The instruction decoding unit is used to decode the target program to obtain the decoded instruction to be executed.

3. The processor according to claim 1, characterized in that, The processor core further includes a result write-back unit; the graph computation flow unit and the at least one general-purpose arithmetic unit are respectively connected to the result write-back unit; The at least one general-purpose computing unit is further configured to send the first execution result of the general-purpose computing task to the result write-back unit, wherein the first execution result of the general-purpose computing task is the result obtained by executing the general-purpose computing instruction; The graph computation flow unit is further configured to send the second execution result of the graph computation task to the result write-back unit, wherein the second execution result of the graph computation task is the result obtained by executing the graph computation control instruction; The result write-back unit is used to write back part or all of the first execution result and the second execution result to the instruction scheduling unit.

4. The processor according to claim 1, characterized in that, The graph computation control command includes a parameter passing command, which carries the identifiers of M computing nodes and the input parameters corresponding to the identifiers of the M computing nodes. The M computing nodes are some or all of the N computing nodes. The graph computation flow unit is used to receive the parameter transmission instruction and input the input parameters corresponding to the identifiers of the M computation nodes into the M computation nodes respectively.

5. The processor according to claim 4, characterized in that, The connection and order information between the N computation nodes includes the source and destination nodes corresponding to the L edges; the graph computation flow unit is specifically used for: Monitor whether the required input parameters for each of the N computing nodes are ready; For a target computing node with prepared input parameters, the input parameters of the target computing node are input into the corresponding operation method of the target computing node for calculation to obtain the calculation result; Based on the source node and destination node corresponding to the L edges, the calculation result of the source node in each edge is used as the input parameter and input to the corresponding destination node.

6. The processor according to claim 1, characterized in that, The instruction scheduling unit is further configured to: After the graph computation flow unit receives the start graph computation instruction but before completing the graph computation task, it controls the processor core to enter a blocked state.

7. The processor according to claim 1, characterized in that, The instruction scheduling unit is further configured to: A synchronous execution result instruction is sent to the graph computation flow unit, and after the graph computation flow unit receives the synchronous execution result instruction but before the graph computation task is completed, the processor core is controlled to enter a blocked state.

8. The processor according to claim 6 or 7, characterized in that, The instruction scheduling unit is further configured to: After the graph computation flow unit completes the graph computation task, it controls the processor core to exit the blocked state.

9. The processor according to any one of claims 1-7, characterized in that, The processor core further includes a result write-back unit, which includes multiple registers; the graph computation flow unit and the at least one general-purpose arithmetic unit are respectively connected to the result write-back unit; the graph computation control instruction includes a return parameter instruction, which carries the identifiers of K computation nodes and the registers corresponding to the identifiers of the K computation nodes; The graph computation flow unit is specifically used to: control the sending of the computation results of the K computation nodes to the corresponding registers in the result write-back unit.

10. The processor according to any one of claims 1-7, characterized in that, The general computing instructions include general arithmetic logic instructions; the at least one general arithmetic unit includes: The arithmetic logic unit (ALU) is used to receive the general arithmetic logic instructions sent by the instruction scheduling unit and perform logical operations.

11. The processor according to any one of claims 1-7, characterized in that, The general-purpose computing instructions include memory read / write instructions; the at least one general-purpose arithmetic unit includes: The memory read / write unit (LSU) is used to receive the memory read / write instructions sent by the instruction scheduling unit and perform memory read / write operations.

12. The processor according to claim 11, characterized in that, The graph computation control instructions include data read / write instructions, which carry memory read / write addresses; the graph computation flow unit is further configured to: Data is read from or written to the memory read / write unit LSU according to the memory read / write address in the data read / write instruction.

13. The processor according to claim 12, characterized in that, The at least one general-purpose computing unit further includes a floating-point arithmetic unit (FPU); the graph computation task includes floating-point operations; the graph computation flow unit is further used for: The data from the floating-point operation is sent to the floating-point arithmetic unit (FPU) for calculation, and the calculation result is received from the FPU.

14. The processor according to claim 12, characterized in that, The at least one general-purpose computing unit further includes a vector operation unit (SIMD); the graph computation task includes vector operations; the graph computation flow unit is further used for: The data from the vector operation is sent to the SIMD vector operation unit for calculation, and the calculation result is received from the SIMD.

15. A processing method, characterized in that, Applied to a processor, the processor includes a processor core, the processor core including an instruction scheduling unit, a graph computation flow unit connected to the instruction scheduling unit, and at least one general-purpose arithmetic unit; the processor also includes a memory unit; the graph computation flow unit includes N computing nodes; the method includes: The instruction scheduling unit allocates the general computation instructions from the decoded instructions to be executed to the at least one general computation unit, and allocates the graph computation control instructions from the decoded instructions to be executed to the graph computation flow unit. The general computation instructions are used to instruct the execution of general computation tasks, and the graph computation control instructions are used to instruct the execution of graph computation tasks. The general computation instructions are executed by the at least one general computation unit. The graph computation control instructions are executed through the graph computation flow unit; the graph computation control instructions include a graph construction start instruction and a graph computation start instruction; the graph construction start instruction carries the target address in the memory unit. The execution of the graph computation control instruction through the graph computation flow unit includes: receiving the start graph construction instruction through the graph computation flow unit, and reading graph block information from the memory unit according to the target address. The graph block information includes the operation method of each of the N computing nodes, as well as the connection and order information between the N computing nodes; after receiving the start graph computation instruction, checking whether the graph block information read by the graph computation flow unit is consistent with the pre-started graph block address, and determining whether the input parameters in the M computing nodes have been input; if the graph block information is consistent with the pre-started graph block address and the input parameters in the M computing nodes have been input, then starting the execution of the graph computation task.

16. The method according to claim 15, characterized in that, The processor core further includes an instruction fetch unit and an instruction decoding unit; the method further includes: The target program to be executed is obtained through the instruction acquisition unit; The target program is decoded by the instruction decoding unit to obtain the decoded instruction to be executed.

17. The method according to claim 15, characterized in that, The processor core further includes a result write-back unit; the graph computation flow unit and the at least one general-purpose arithmetic unit are respectively connected to the result write-back unit; the method further includes: The first execution result of the general computing task is sent to the result write-back unit through the at least one general computing unit, and the first execution result of the general computing task is the result obtained by executing the general computing instruction; The graph computation flow unit sends the second execution result of the graph computation task to the result write-back unit, whereby the second execution result of the graph computation task is the result obtained by executing the graph computation control instructions. The result write-back unit writes back part or all of the first execution result and the second execution result to the instruction scheduling unit.

18. The method according to claim 15, characterized in that, The graph computation control command includes a parameter passing command, which carries the identifiers of M computing nodes and the input parameters corresponding to the identifiers of the M computing nodes. The M computing nodes are some or all of the N computing nodes. The execution of the graph computation control instructions through the graph computation flow unit includes: The graph computation flow unit receives the parameter transmission instruction and inputs the input parameters corresponding to the identifiers of the M computation nodes into the M computation nodes respectively.

19. The method according to claim 18, characterized in that, The connection and order information between the N computing nodes includes the source and destination nodes corresponding to the L edges respectively; the execution of the graph computing control instructions through the graph computing flow unit includes: The graph computation flow unit monitors whether the input parameters required for each of the N computation nodes are ready; for a target computation node whose input parameters are ready, the input parameters of the target computation node are input into the corresponding operation method of the target computation node for calculation to obtain the calculation result; according to the source node and destination node corresponding to the L edges respectively, the calculation result of the source node in each edge is used as the input parameter and input to the corresponding destination node.

20. The method according to claim 15, characterized in that, The method further includes: The instruction scheduling unit controls the processor core to enter a blocked state after the graph computation flow unit receives the start graph computation instruction but before the graph computation task is completed.

21. The method according to claim 15, characterized in that, The method further includes: The instruction scheduling unit sends a synchronous execution result instruction to the graph computation flow unit, and after the graph computation flow unit receives the synchronous execution result instruction but before completing the graph computation task, it controls the processor core to enter a blocked state.

22. The method according to claim 16 or 17, characterized in that, The method further includes: After the graph computation flow unit completes the graph computation task, the instruction scheduling unit controls the processor core to exit the blocked state.

23. The method according to any one of claims 15-21, characterized in that, The processor core further includes a result write-back unit, which includes multiple registers; the graph computation flow unit and the at least one general-purpose arithmetic unit are respectively connected to the result write-back unit; the graph computation control instruction includes a return parameter instruction, which carries the identifiers of K computation nodes and the registers corresponding to the identifiers of the K computation nodes; The step of executing the graph computation control instructions through the graph computation flow unit to obtain the execution result of the graph computation task includes: The computation results of the K computation nodes are sent to the corresponding registers in the result write-back unit through the graph computation flow control.

24. The method according to any one of claims 15-21, characterized in that, The general computing instructions include general arithmetic logic instructions; the at least one general arithmetic unit includes an arithmetic logic unit (ALU); executing the general computing instructions through the at least one general arithmetic unit includes: The arithmetic logic unit (ALU) receives general arithmetic logic instructions sent by the instruction scheduling unit and performs logical operations.

25. The method according to any one of claims 15-21, characterized in that, The general-purpose computing instructions include memory read / write instructions; the at least one general-purpose arithmetic unit includes a memory read / write unit (LSU); the execution of the general-purpose computing instructions through the at least one general-purpose arithmetic unit to obtain the execution result of the general-purpose computing task includes: The memory read / write unit (LSU) receives memory read / write instructions sent by the instruction scheduling unit and performs memory read / write operations.

26. The method according to claim 25, characterized in that, The graph computation control instructions include data read / write instructions, which carry memory read / write addresses; the method further includes: The graph computation stream unit reads data from or writes data to the memory read / write unit LSU according to the memory read / write address in the data read / write instruction.

27. The method according to claim 26, characterized in that, The at least one general-purpose arithmetic unit further includes a floating-point arithmetic unit (FPU); the graph computation task includes floating-point operations; the method further includes: The graph computation flow unit sends the floating-point operation data to the floating-point arithmetic unit (FPU) for calculation and receives the calculation results fed back by the FPU.

28. The method according to claim 26, characterized in that, The at least one general-purpose computing unit further includes a vector operation unit (SIMD); the graph computation task includes vector operations; the method further includes: The graph computation flow unit sends the data for vector operations to the vector operation unit SIMD for computation, and receives the computation results fed back by the SIMD.

29. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the method described in any one of claims 15-28.

30. A computer program, characterized in that, The computer-readable program includes instructions that, when executed by a processor, cause the processor to perform the method as described in any one of claims 15-28.