Circuit and method for instruction execution dependent on trigger conditions

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The spatial architecture addresses the von-Neumann bottleneck by executing groups of program instructions in parallel based on trigger conditions, enhancing parallelism and reducing pipeline bubbles, thus improving execution efficiency.

JP7884006B2Active Publication Date: 2026-07-02ARM LTD

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Patents
Current Assignee / Owner: ARM LTD
Filing Date: 2022-01-19
Publication Date: 2026-07-02

AI Technical Summary

Technical Problem

Conventional architectures face challenges in maximizing instruction-level parallelism (ILP) due to the von-Neumann bottleneck, which limits performance in executing program instructions efficiently.

Method used

A spatial architecture is employed that utilizes a processing circuit with an instruction storage unit and a trigger circuit to execute groups of program instructions in parallel based on trigger conditions, leveraging a network-on-chip for data communication and distributed memory, and optimizing instruction bundling to enhance parallelism.

Benefits of technology

This approach significantly enhances parallel execution by reducing pipeline bubbles and branch prediction complexities, improving scalability and efficiency by allowing concurrent execution of multiple instructions without the need for traditional renaming operations or real-time operand tracking.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 0007884006000007
Figure 0007884006000008
Figure 0007884006000009

Patent Text Reader

Abstract

The circuit comprises processing circuitry configured to execute program instructions dependent on a respective trigger condition matching a current trigger state and to set a next trigger state in response to execution of the program instructions, the processing circuitry comprising an instruction store configured to selectively provide groups of two or more program instructions for execution in parallel, and a trigger circuit that controls the instruction store to provide program instructions of a given group of program instructions for execution in response to generation of a trigger state by execution of the program instructions and a trigger condition associated with the given group of program instructions.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to circuits and methods.

[0002] The so-called "spatial architecture" can accelerate an application by unfolding or unrolling specific calculations that form the time-consuming part of the application execution mainly in "space" rather than in time.

[0003] The calculations are unfolded in "space" by using a number of hardware units capable of concurrent operation. In addition to taking advantage of the concurrency opportunities provided by scattered applications spread on the chip, the spatial architecture also utilizes distributed on-chip memory, whereby each processing element is associated with one or more memory blocks in its proximity. As a result, the spatial architecture can affect many conventional architectures and alleviate the so-called von-Neumann bottleneck that potentially hinders performance.

[0004] This disclosure relates to potential improvements in such configurations.

Summary of the Invention

[0005] In an exemplary configuration, a circuit includes [[ID=2,6]]a processing circuit configured to execute program instructions depending on each trigger condition matching the current trigger state and set the next trigger state in response to the execution of the program instructions, the processing circuit including an instruction storage unit configured to selectively provide groups of two or more program instructions for parallel execution, A trigger circuit controls the instruction storage unit to provide program instructions from a given group of program instructions for execution in response to the generation of a trigger state by the execution of a program instruction and a trigger condition associated with a given group of program instructions. A circuit is provided that includes the following.

[0006] In another exemplary configuration, a processing array is provided, comprising an array of such circuits and a data communication circuit for communicating data between the circuits of the array.

[0007] In another exemplary configuration, the method is, Executing program instructions such that each trigger condition matches the current trigger state, and that the next trigger state is set in response to the execution of a program instruction, The instruction storage unit provides two or more groups of program instructions for parallel execution, Controlling the instruction storage to provide program instructions from a given group of program instructions for execution in response to the generation of a trigger state by the execution of a program instruction and the trigger condition associated with a given group of program instructions. A method is provided that includes this.

[0008] In another exemplary configuration, the computer implementation method is: The process involves generating program instructions for execution based on each trigger condition, where the execution of a program instruction sets or generates the next trigger condition. The division of a program instruction into groups of program instructions, wherein at least some groups contain two or more program instructions, one of which does not depend on the result of another program instruction within the given group. The process involves generating input trigger conditions and output trigger states for each group, wherein the input trigger conditions, when met, enable the execution of program instructions in that group, and the output trigger conditions are conditions to be generated in response to the completion of the execution of all program instructions within that group, including generating, A computer implementation method is provided.

[0009] In another exemplary configuration, when executed by a computer, a compiler is provided that contains computer program code that causes the computer to perform the actions defined above.

[0010] Further aspects and features of this technology are defined by the attached claims. [Brief explanation of the drawing]

[0011] The present technology will be further described, merely as an example, with reference to the embodiments shown in the attached drawings.

[0012] [Figure 1] A schematic diagram of an exemplary processing circuit array is shown. [Figure 2] The calculation tiles are shown in general terms. [Figure 3] A schematic representation of the memory tile is shown. [Figure 4] A schematic diagram of the data flow graph is shown. [Figure 5] A schematic example of memory usage is shown below. [Figure 6] A schematic diagram of each circuit is shown below. [Figure 7] A schematic diagram of each circuit is shown below. [Figure 8] A schematic diagram of each circuit is shown below. [Figure 9] A schematic diagram of an exemplary data processing device is shown below. [Figure 10] This provides a schematic representation of how the compiler works. [Figure 11] This is a schematic flowchart illustrating each method. [Figure 12] These are schematic flowcharts showing each method. [Figure 13] An embodiment of the simulation is schematically shown. **DETAILED DESCRIPTION OF THE INVENTION**

[0013] Exemplary processing array Referring to the drawings, an exemplary instance of the spatial architecture is schematically shown in FIG. 1.

[0014] In this exemplary configuration, a two-dimensional array 100 of data processing elements 110 is connected to a memory array 120 such as a cache hierarchy or main memory via a data transfer unit 130 called an Interface Tile (IT).

[0015] This example of the spatial architecture includes two types of data processing elements, namely, a so-called Compute Tile (CT) 112 that performs most of the data processing operations and arithmetic calculations, and a so-called Memory Tile (MT) 114 that is mainly responsible for data access to locally connected memories and data transfer to / from remotely located memory regions and other processing elements.

[0016] In an exemplary embodiment, a local memory block, also called a scratch pad, connected to or associated with each memory tile (MT) (not shown in FIG. 1 but described later with reference to FIG. 3) is provided, and each MT has a direct connection to one respective compute tile (CT).

[0017] Each MT-CT cluster represents a data processing element 110, which is connected via a switch 140 (also called a router in some examples) to a network-on-chip 150, which represents an example of a data communication circuit that communicates data between circuits 110 in the array 100. In this example, the network-on-chip is used to transfer data between MTs and between each MT and an interface tile (IT) 130. However, other configurations are possible, such as having a single scratchpad shared among several MTs, or having MTs with direct access to two or more scratchpads. A one-to-one correspondence between CTs and MTs is not mandatory in the architecture, and one MT may be connected to two or more CTs, and vice versa. In other words, each processing element is connected via a set of input and output channels to a network-on-chip with switches and data links between these switches, forming a two-dimensional torus layout as shown in Figure 1.

[0018] Although not shown in Figure 1, a first-in, first-out (FIFO) buffer called a "channel," which will be explained below with reference to Figures 2 and 3, is used to distribute data to the CT and MT and to transport processed data from them.

[0019] The architectures of CT and MT are based on a so-called Triggered Instruction Architecture (see Reference 1 cited below), which has been extended to support vector processing and more advanced data transfer operations.

[0020] Triggered actions In some examples, each instruction has or is associated with one or more sets of "trigger conditions," and is issued or sent to the execution unit only when those trigger conditions are valid, i.e., when the trigger conditions match a trigger state generated by the execution of another program instruction. In practice, the execution of a program instruction not only depends on each trigger condition that matches the current trigger state, but can also set the next trigger state itself.

[0021] The trigger conditions are specific to a particular circuit 110 and may depend on execution results, channel occupancy, or some other defined state of the processing element. Upon completion, each instruction can set one or more predicate registers that affect the trigger, and thus can subsequently be used to determine whether it is ready to execute other instructions. In particular, this type or architecture typically does not have an explicit program counter and dedicated branch instructions. Its main advantages are the subsequent simplification of the front-end circuitry of the processing element and the avoidance of pipeline bubbles caused by control flow hazards without relying on complex branch prediction mechanisms.

[0022] Triggered command - exemplary format Generally, triggered instructions have the following format:

[0023]

number

[0024] The destination and source operands may be vector registers, scalar registers, predicates, or channels. In other words, the executed instruction may be a scalar instruction or a vector instruction (in which case, circuit 110, or at least some of them, may include a vector processing circuit configured to execute a vector processing instruction, each vector processing instruction applying its respective processing operation to each vector of two or more data elements).

[0025] Therefore, the following command may be used as an example.

[0026]

number

[0027] The predicates may be maintained, for example, in the predicate store or register 212 in the execution circuit 210, and can be read by the trigger circuit 250, which also receives information from the queue 240 that defines the trigger conditions associated with the queued program instructions (these trigger conditions may, in turn, be generated by the compilation operation to populate the queue). Therefore, in the example given above, the detection of the condition "when p==1001" is performed as follows: Queue 240 provides the trigger condition "p==1001" to the trigger circuit 250 (along with other trigger conditions applicable to instructions placed in the queue). The trigger circuit 250 communicates with the predicate memory unit 212. The trigger circuit 250 detects when the predicate stored in the predicate storage unit 212 is equal to 1001. In response, the trigger circuit issues a control signal to queue 240, prompting the queue to issue the associated instruction "add z2,ich2,ich3;set p=1010" for decoding and execution.

[0028] In response to the completion of the execution of this instruction, the execution circuit 210 sets the predicate held by the predicate storage unit 212 to a new value 1010. The process outlined above again follows the trigger circuit detecting a match between this new predicate value and a trigger condition (communicated by queue 240) associated with another queued instruction. The process follows a chain of trigger conditions matching the next trigger condition, such that the trigger state is established at the time of program code compilation.

[0029] It should be noted that the predicate memory unit 212 may be provided to the trigger circuit 250, or it may be provided as a separate circuit item that is writable by the execution circuit 210 and readable by the trigger circuit 250.

[0030] There are many constraints on the amount of computation that can be performed at any given location within a spatial architecture. Such constraints may include the speed at which data can be delivered to or removed from a given location, as well as power constraints or thermal constraints. As a result, some exemplary embodiments may act to adapt the amount of processing performed at a given location, for example, depending on the available network or data transfer bandwidth.

[0031] The exemplary embodiments of this disclosure may provide additional configuration options that may offer potentially higher parallelism opportunities and may make it potentially easier to balance the ratio of computation to network or memory bandwidth.

[0032] Exemplary calculation tiles Figure 2 provides a typical example of a computation tile 112 that can act on scaler instructions and / or vector instructions, as described above.

[0033] As discussed above, one or more FIFO elements 200 act as input channels, providing input to the execution circuit 210, and one or more FIFO elements 220 act as output channels. Execution is performed by the execution circuit 210 with respect to one or more processor registers 230.

[0034] The instruction queue provides an example of an instruction storage unit that provides program instructions for execution. Program instructions are provided in response to a trigger circuit 250 that responds to the generation of a trigger state by the execution of a program instruction (e.g., a previously executed program instruction) and to a trigger condition associated with an instruction held by the queue 240, and controls the queue 240 to provide program instructions for execution. Program instructions issued by the queue 240 are decoded by a decoding circuit 260 before being passed to the execution circuit 210.

[0035] Exemplary memory tiles The general schematic diagram of MT114 in Figure 3 is substantially the same as the schematic diagram of CT in Figure 2, with the following exceptions. (a) The “memory” block or circuit 300 is connected to the execution circuit. (b) An "array interface" block or circuit 310 having an interface to the communication circuit 150 is connected to the execution circuit.

[0036] Otherwise, the operation of memory tile 114 corresponds to the operation of the compute tile discussed above. Note that MT uses triggered instructions like CT, and the execution path may be simpler in MT than in CT because (in some examples) MT does not necessarily need to perform bulk data processing, but MT still retains sufficient functionality to perform address calculation, data substitution, and data transfer between local storage 300 and the rest of the array via array interface 310.

[0037] Communication of data items between CT112 and MT114 of circuit 110 occurs via the output channels (one or more) of the tile that transmits the data items and the input channels (one or more) of the tile that receives the data items.

[0038] Dataflow graph In the techniques discussed below, it may be useful to first determine a dataflow graph (DFG) that represents the operations within the application in order to translate the computations within the application into space.

[0039] An example of such a DFG is schematically shown in Figure 4. The input channel (ichn) is represented along the top of the DFG, the output channel (och0) is represented at the bottom of the DFG, and computational operations such as multiplication (×), addition (+), subtraction (-), and maximum / minimum functions (max / min) are represented using input operands and output destinations of operations, along lines that schematically link the various operations.

[0040] This dataflow graph can then be divided and distributed across available hardware units. However, realistic DFGs tend to be larger than the available spatial resources, and therefore, some form of time slicing is ultimately performed to allow the DFG to be mapped to hardware. However, in some examples of the present invention of spatial design, the scope of such time slicing may be limited compared to conventional architectures.

[0041] Within a spatial architecture, the overall speedup observed over conventional architectures is derived from a mixture of instruction-level parallelism (ILP), data-level parallelism (DLP), and task-level parallelism (TLP). Task-level parallelism, or the construction of task pipelines (connected via data streams), can be considered orthogonal to the orchestration level, including ILP and / or DLP. Thus, for example, task A can be placed in one set of processing elements and task B in another (with the two sets connected so that B can consume the data generated by A), and the two sets of processing elements can operate simultaneously, while each set individually attempts to fully utilize the ILP and / or DLP present in those parts of the DFG.

[0042] Vectorization (enabling a functional unit to operate on a group of data elements simultaneously) and tiling (dividing a dataset into fixed chunks spread across a spatial fabric) typically enable DLP extraction. However, even if the system has improved opportunities for DLP and TLP extraction as described above, there may still be a shortage of ILP, which is a potential source of parallel processing that is not currently adequately addressed in the triggered architecture. Embodiments of the present invention propose a method for efficiently instructing processing elements to act on a group of operations in parallel in a way that can potentially better balance the trade-off between the computational intensity of each processing element and the available memory or network bandwidth. In any embodiment of the present invention, the processing circuit may comprise a vector processing circuit configured to execute two or more vector processing instructions in parallel, each vector processing instruction applying its respective processing operation to each vector of two or more data elements.

[0043] Summary of the proposed technique Next, we will describe an example in which the architectural extension proposed by this disclosure can potentially extract additional parallel execution.

[0044] This example concerns an application divided into multiple smaller parts, each running on a different processing element. Assume that one part of the application or kernel has the dataflow graph shown in Figure 4. An example is given of how this can be implemented in a triggered processing element.

[0045] It is assumed that vectorization and tiling have already been performed in this regard, and it should be noted that the operation shown in Figure 4 can process the vector data supplied from the input channel (where appropriate) and return the vector data to other processing elements or memory.

[0046] Observing the DFG in Figure 4, it can be noted that multiple operations can be executed in parallel. That is, nodes at the same "level" (vertical position as depicted) in the DFG do not have interdependencies and can be executed simultaneously without causing data hazards.

[0047] In the absence of the techniques proposed by this disclosure, one previously proposed implementation of this data flow may be as follows:

[0048]

number

[0049] Here, "mul" represents the multiplication operation, "sub" represents the subtraction operation, and "ne" represents the "not equal" case, in this case a test for the immediate value 0.

[0050] The operations represented by the instructions on lines 1-7 correspond to the nodes (labeled 1-7) in the DFG in Figure 4. Furthermore, on lines 3 and 4, a channel dequeue operation ("deq") is performed, which removes the data item at the beginning of each specified channel. This can occur once all associated data has been utilized.

[0051] Register r1 holds a special iteration count value that is updated on line 8 (decremented by an immediate value of 1), and the comparison instruction on line 9 is used to choose between triggering the instruction on line 1 to restart the sequence of state transitions that implement the DFG, or to set it to some other state that presumably indicates that all relevant values have been processed.

[0052] The letter "z" is used in line 9 to indicate that this particular bit is set at runtime depending on the result of the comparison. That is, the bit at position 3 is set to 1 if r1!=0, but to 0 if r1==0. Thus, the final predicate result is 1000 if the comparison (ne) is successful, and 0000 if the check fails.

[0053] The configuration proposed by this disclosure can utilize a configuration in which a programmer or compiler can specify a “group” or “bundle” (the two terms are considered equivalent for the purposes of this specification) of one or more instructions that can be processed in parallel. In such a configuration, compilation or other operation may involve generating program instructions for execution depending on their respective trigger conditions, generating the execution of a program instruction which sets the next trigger condition, dividing or bundling the program instructions into groups of program instructions, where at least some groups contain two or more program instructions, where one program instruction in a given group does not depend on the result of another program instruction in a given group, and generating input trigger conditions and output trigger states for each group, where the input trigger conditions, if met, enable the execution of the program instructions in that group, and the output trigger conditions are conditions to be generated in response to the completion of the execution of all program instructions in that group.

[0054] In other words, the following format may be defined to specify a bundle of instructions.

[0055]

number

[0056] In other words, rather than the trigger condition being associated with the start of each individual instruction and that single instruction generating a trigger state in response to its execution, in the proposed configuration, the trigger circuit controls queue 240 to provide that bundle of instructions for execution in response to the generation of a trigger state by the execution of the bundle of instructions and the trigger condition associated with a given bundle. In other words, predicate generation is performed on a bundle basis, and the testing of the predicate against the trigger condition is also performed on a bundle-by-bundle basis, so that the trigger condition is associated with the bundle rather than with individual instructions.

[0057] While a bundle can contain only one instruction, it should be noted that in many cases a bundle may contain two or more instructions, and in any case, the configuration of the present invention enables bundle-based trigger condition testing and trigger state generation.

[0058] Therefore, applying this method to the above code (and assuming that the circuit in use can execute up to, for example, four instructions in parallel), the following program code may be generated. Note that this code employs software pipelining, and in this case, there is a prologue section (lines 1-10) used to align the operations in time. Similar bundled instruction programs can also be generated by expanding the previous code and grouping the operations within the expanded body, but such expansion tends to result in code "bloat" (a term used to describe an undesirable growth in the total amount of program instructions that perform a particular set of tasks) and is better used in scenarios where instruction space is not a critical resource.

[0059]

number

[0060] Comparing this code to the DFG in Figure 4, the following bundles of behavior are defined. • The operations shown as 1, 2, 3, and 4 in Figure 4. These are defined by the bundle shown in lines 1-6 of the list. This bundle is executed when p==0010 and sets p=1010 in response to its execution. • The actions shown as 5 and 6 in Figure 4. These are defined by the bundle shown in lines 7-10 of the list. This bundle is executed when p==1010 and sets p=1000 in response to that execution.

[0061] As described above, these two bundles form a so-called prologue before the main loop. The execution of the bundles on lines 1-6 and 7-10 provides preliminary values for z5 and z6, namely z5' and z6'. Subsequently, by executing the bundles on lines 12-17 and 18-23, new values for z5 and z6, namely z5' and z6', are generated, while the old values z5' and z6' are consumed (see the operation on line 21). This process is then repeated (looping back from line 24 to line 12). That is, the operation on line 21 always consumes the previously generated values of z5 and z6, while the bundles (lines 18-23) generate new values for z5 and z6 for future iterations. The prologue is necessary to obtain the first pair of values for z5 and z6 at the beginning of the sequence.

[0062] Therefore, the main loop includes the following bundle. • The operations shown as 1, 2, 3, and 4 in Figure 4. These are defined by the bundle shown in lines 12-17 of the list. This bundle is executed when p==1000 and sets p=1100 in response to its execution. • The actions shown as 5 and 6 in Figure 4. These are defined by the bundle shown in lines 7-10 of the list. This bundle is executed when p==1100 and sets p=1101 in response to its execution. The loop continuation operation discussed above is represented as a single instruction bundle on line 23, which is executed when p==1101 and sets p=z000.

[0063] The loop then branches back to line 12, i.e., p==1000, if r1!=0. On the other hand, if r1==0, the check fails, and the code triggers an instruction / bundle with input predicate 0000 (not shown in this list) that handles further operations other than the DFG shown in Figure 4.

[0064] A side effect of the bundled triggering technique demonstrated above is that it can help reduce the number of active predicate bits or state space that must be traversed by the program, leading to potential hardware improvements. Multiple instructions can utilize a single trigger condition, thereby potentially improving the scalability of instruction selection logic.

[0065] Furthermore, instruction bundles enable concurrent execution without the need for traditional renaming operations, dependency checks, or real-time operand availability tracking units, which are common in conventional processors and can potentially lead to inefficiencies due to the additional complexity they introduce.

[0066] Note that all channel dequeue (or enqueue) operations are deferred until after all instructions in the bundle have completed. Therefore, the dequeue operations specified on lines 4, 5, 15, and 16 will not be executed until after the entire corresponding bundle has completed. In other words, compiling the code may include generating one or more operations that dequeue input data from one or more input channels after the execution of the group's program instructions for one or more input channels that provide input data for the group to execute.

[0067] The illustrative circuits described below can ensure that instructions are executed truly concurrently and therefore no writes to intermediate register variables are seen until the end of the bundle. For example, writes to vector register z5 on line 19 and vector register z6 on line 20 will not occur until the end of the bundle and will therefore not be seen in the "sub" operation on line 21. Multiple writes to the same destination vector register or channel within a bundle can lead to non-deterministic behavior that should ideally be detected by the compiler. If not, some embodiments may generate exceptions, while some other embodiments may provide system registers that can indicate (and can be used for debugging purposes) that such failures have occurred.

[0068] All instructions within a bundle share the same trigger condition and, therefore, depending on the current state of the machine, either all instructions proceed or none proceed. Thus, for example, if no data exists on input channel 3 (ich3), the "add" on line 14 and the "sub" on line 16 cannot be executed, but because they are bundled with the "mul" operation on line 13 and another "mul" operation on line 15, these operations will also be stalled, even if the data on which they operate is actually available. While this indirect synchronization of multi-channel readers has some advantages, it can be costly for large bundles due to the increased probability of stalls. As a result, some embodiments of the compiler may be able to operate to artificially reduce the bundle size from a specified size in the code in order to gain some of the benefits of bundling while limiting the frequency of stalls. In some examples, the bundle may be limited to the degree of parallelism provided by the circuitry used to execute the instructions, for example, some of the examples given below bundle up to four instructions.

[0069] Multiple execution pipelines or execution paths within a triggered microarchitecture capable of executing bundled instructions can operate simultaneously, but such pipelines may not be symmetrical. Consequently, in some embodiments, a compiler or some other similar tool may be used to verify that the mixture of instructions within each bundle is supported by the underlying hardware. If a certain execution "slot" or execution path accepts only a limited number of instructions, it may be necessary to rearrange (reconfigure) the instructions within the bundle to match the available hardware. Furthermore, if a certain combination of simultaneous operations is not supported, the bundle may need to be broken down by the compiler, or the triggered instruction hardware may invoke an ordering unit that can time-slice the operations within the bundle at runtime. In practice, this can also be true when the number of instructions in a bundle exceeds the maximum supported execution (parallel) width. Even when the bundle width matches the execution width, in some embodiments, due to register access constraints, not all instructions in the bundle can proceed simultaneously, and therefore some form of ordering may be unavoidable.

[0070] State transition instructions, such as comparisons (which write to predicate registers), are permitted within a bundle, but multiple writes to the same predicate register can lead to non-deterministic behavior. In addition, a predicate set that applies to the entire bundle has a high-impedance indicator (such as the one shown on line 24) at the relevant bit position, and it becomes possible to set that bit by the result of one or more comparisons within the bundle.

[0071] In other words, if there is only one comparison within the bundle, something like the following example might be used. 1 * set p = z101 : when p == 1111 2 sub z1, z2, z3 3 mul z7, z6, z8 4 ne p3, r1, #0 5 sub z0, z1, z6 6 *

[0072] Here, the transition in bit 3, which is normally associated with a single instruction, is now associated with the entire bundle, and the predicate bit "3" is updated only when the bundle is complete.

[0073] If there are more comparison operations within a bundle, it must be ensured that updates do not conflict, or in other words, that updates preferably apply to different predicate bits. For example, 1 * set p = z10z : when p == 1111 2 sub z1, z2, z3 3 ne p0, r2, #0 4 ne p3, r1, #0 5 sub z0, z1, z6 6 *

[0074] One aspect worth considering is the fact that bundle granularity does not have to be constant throughout the program, and in some programs it can even vary quite irregularly. Examining the programs with bundled trigger instructions presented above, we can observe several instances of this irregularity, as the programs have bundles of sizes 1, 2, and 4. Therefore, hardware implementations that lay out bundles in instruction memory as visible within the program may suffer from inefficiencies or underutilization of certain memory regions. In this regard, refer to Figure 5, which provides schematic examples of bundles of different sizes (represented by the first shading 500) and the corresponding underutilized or wasted memory regions (represented by the second shading 510) in each case.

[0075] Assuming that each triggered instruction involves a multiway state transition operation, they do not necessarily need to be arranged sequentially in instruction memory, but can be freely rearranged. As a result, the exemplary embodiment proposes the instruction memory organization shown in Figure 6 to enable potentially more efficient storage of bundled instructions. Figure 6 shows one exemplary embodiment of a processing element in a space-triggered architecture that supports instruction bundling.

[0076] The configuration in Figure 6 (and in practice, the configuration in Figure 7 or Figure 8) can represent the computation tile and / or memory tile configurations (other than the storage unit and array interface) discussed above, in other words, the circuitry at one node of the array in Figure 1. The compilation operation may involve routing groups of program instructions to individual processors in such a processor or array of circuits.

[0077] In other words, in Figure 6, bundles of similar size are grouped together within instruction queues (or more generally, instruction storage units) of different widths, and bundles of different sizes are placed in separate memory modules, thus collectively having a higher packing density than the implementation shown in Figure 5.

[0078] The instruction output trigger 610, a signal from the trigger circuit 615, is extended to incorporate a "way" number or other instruction (as described with reference to Figure 2) indicating the size of the bundle to be next triggered, thereby simplifying the retrieval of the associated bundle. The "way" number is also provided to the multiplexer 620 so that program instructions provided by the associated queue in the instruction storage unit 600 are routed for decoding and execution.

[0079] The actual sizing of the queues in the instruction storage unit 600 may depend on a mixture of bundle sizes within the profiled program. The routing circuit 625 operates under the control of signals or data provided by the compiler, for example, to route instructions into the appropriate queues of the instruction storage unit 600 for queuing, thereby routing groups of program instructions to an instruction queue selected from several instruction queues, which is configured to selectively provide groups of program instructions in parallel for execution. It should be noted that the present invention's configuration of bundling and providing queues in the instruction storage unit 600 provides the possibility that a particular bundle may contain one instruction, but also explicitly provides the possibility of multi-instruction bundles. If a bundle contains two or more instructions, the instruction storage unit is configured to provide two or more instructions (such as two, three, or four instructions) for parallel execution. It should be noted that, under the control of the compiler, a bundle of one or two instructions may be routed to a four-instruction queue by the routing circuit 625, depending on the need to store other bundles during its processing stage.

[0080] While it is not a requirement that a queue be provided to hold the entire bundle, in the exemplary configuration, the bundle can be compiled so that it does not exceed the maximum width of the queue provided by the circuit, as discussed below.

[0081] A queue can include one or more bundle depths, but in an exemplary embodiment, the queue includes at least two bundle depths so that when a bundle is provided by the queue for execution, another bundle is already queued and ready to become the new head of the queue.

[0082] It should be noted that in embodiments where it is more efficient to store bit patterns corresponding to previously decoded instructions in memory within the front-end, the multiple instruction decoding units 630 shown in Figure 6 (which can then be accessed in parallel instead of duplicating instructions to recover multiple hardware units) may not be present. In addition, in some embodiments, both methods may be present, and the power requirements or pipeline depth of the front-end in some instances can be reduced by skipping the decoding stage using a special memory that may contain previously decoded bundles (obtained in advance during the compilation phase or when decoding preceding instances of bundled instructions).

[0083] The trigger circuit 615 is used to evaluate the current state of the processing elements (possibly using data from the result bus and channel occupancy information and / or accessing the predicate store (not shown) as described above) to determine which bundle's trigger condition has been met, select the triggered bundle from the instruction memory, and control the multiplexer 620 for routing the selected bundle.

[0084] Execution is performed on values held by register file 635 and input channel 640, which are routed to execution circuit 645 by approximate multiplexer 650. The execution results are presented on result bus 655, from which they can be routed to output channel 660 and returned to register file 635 and / or trigger circuit 615.

[0085] Accordingly, Figure 6 provides an example in which the instruction storage unit 600 comprises at least two instruction queues, each configured to provide groups of program instructions for execution, the instruction queues comprising a first instruction queue configured to provide up to n groups of program instructions for execution in parallel, and a second instruction queue configured to provide up to m groups of program instructions for execution in parallel, where m is not equal to n. For example, in the example of Figure 6, m and n can be considered to be any of the values 1, 2, and 4.

[0086] In this example, the trigger circuit 615 controls a given instruction queue 610 to provide a program instruction that has been queued for execution in response to the generation of a trigger state by the execution of a program instruction and a trigger condition associated with a given instruction queue among at least two instruction queues of the instruction storage unit 600.

[0087] This configuration also includes a routing circuit 625 configured to route groups of program instructions to one of the selected instruction queues, and multiple execution paths 630, 645 that execute a number of program instructions in parallel that is greater than or equal to the maximum number of program instructions (four in this example) provided in parallel by any of the instruction queues of the instruction storage unit 600 (four in this example). In other embodiments, this number could be seven, for example, so that bundles from each queue can be executed in parallel.

[0088] Accordingly, Figure 6 (and, in practice, Figure 7 or 8) provides an example of a circuit comprising processing circuits 630, 645 configured to execute program instructions depending on whether each trigger condition matches the current trigger state, and to set the next trigger state in response to the execution of the program instructions, the processing circuit comprising an instruction storage unit 600 configured to selectively provide two or more groups of program instructions for execution in parallel, and a trigger circuit 615 that controls the instruction storage unit to provide program instructions (for example, a given group of program instructions) for execution in response to the generation of a trigger state by the execution of a group of program instructions and a trigger condition associated with a given group of program instructions.

[0089] Register file considerations Figure 6 provides a common or monolithic register file 635. Thus, this configuration provides multiple execution paths represented in Figure 6 by the decoding stage 630 and the execution circuit 645, but the configuration in Figure 6 provides a set of processor registers common to the execution of program instructions by any of the execution paths, i.e., a set of processor registers common to the execution of program instructions within a group of program instructions, and this set of processor registers is accessible during the execution of program instructions.

[0090] However, since the register file 635 may have numerous read and write ports, for example, 12 read ports and 4 write ports in one embodiment, this configuration presents challenges in practical designs. This potentially exorbitant cost arises because each execution slot should ideally be able to access up to three source operands simultaneously, and the bandwidth of the data path in the backend supplying the transfer path and register write circuitry out of the functional unit must be high enough to avoid introducing stalls. Several potential solutions exist.

[0091] In some exemplary embodiments, operations such as multiply-and-accumulate (MLA), which require three source operands, may be limited to a small number of pipelines (execution paths), while other pipelines may support register file access for only one or two source operands. This approach can reduce the overall number of ports but may potentially compromise flexibility and performance. Therefore, in this example, at least one of the execution paths is configured to execute program instructions having at most a first number of operands, and at least another execution path is configured to execute program instructions having at most a second number of operands, which are different from the first and second numbers. For example, the first and second numbers of operands may be 2 and 3.

[0092] Split register file Next, another embodiment using a partitioned register file will be described with reference to Figure 7. Here, the same reference numbers are given to the configuration and features common to Figure 6, and they will not be described in detail again.

[0093] The configuration in Figure 7 provides at least two sets of processor registers (shown in Figure 7 as a "register bank" 700, where n=0, 1, 2, or 3), one set for each execution path, where the set of processor registers for a given execution path is accessible during the execution of program instructions by that execution path, and a communication circuit 710 for communicating data between the sets of processor registers.

[0094] In this embodiment, during the compilation phase, the register allocator can be restricted such that all instructions sharing the same slot or execution path have a common portion of the register file (bank n) among them, but an explicit "register move" is required to transfer information between banks using the inter-bank communication circuit 710. In this scheme, lines 12-23 of the previously shown code can be rewritten as follows.

[0095]

number

[0096] To reduce the number of inter-bank data movement operations, we can see that the instructions within one of the bundles are "reordered." In the register notation used here, the suffixes "_0", "_1", "_2", and "_3" identify the respective register bank containing the registers. The explicit movement operations required to support the instruction bundling are shown on line 19.

[0097] In Figure 7, there are four input channels 640 that are not banked like a register file. This can be an acceptable trade-off, as the number of input channels is typically much smaller than that of a main register file and they are "read-only". In addition, since data sent to the four output channels is never read back, banking can be avoided in some embodiments. However, in other examples, the input channels may also be banked in the same way as a register file.

[0098] It should be noted that the return path 720 allows the results of execution by a particular execution path to be written back to the register bank associated with that execution path, without requiring any additional operation using the inter-bank communication circuit 710.

[0099] In another embodiment using a partitioned register file, as shown in Figure 8, a further example of the register access scheme provides a bundled triggered instruction set architecture that reduces the number of read ports required for the register file while simultaneously enabling parallel access by multiple slots or execution paths.

[0100] Here, too, features common to Figure 6 or Figure 7 are given the same numbers and will not be explained in further detail.

[0101] However, unlike the method in Figure 7, in Figure 8, no explicit move operation is required to maintain coherence between slots or execution paths. In this method, each slot has a small local register file called a Slot Register Buffer (SRB0...3) 800, which holds copies of operands retrieved from the main register file 810 (common to all paths) or operands written by previous data processing operations performed by its execution path (note that there is a return route 820 for data generated by a given execution path).

[0102] There are many possible techniques for managing the information flowing into or out of such a structure, one of which is described below. By keeping each SRB800 small relative to the main register file 810 and limiting the number of read ports each has, it is possible to handle opportunities for concurrent execution of bundles while remaining within a reasonable power envelope.

[0103] The Register Buffer Management Unit (RBMU) 830 is responsible for moving data between the main register file 810 and each SRB 800. Whenever a slot attempts to read a register value from its respective SRB 800, it examines a small local structure called the Register Index Translation Table (RITT) 840, and if a valid entry is found, it retrieves the actual value from the SRB 800 using the corresponding index stored in the RITT 840. In some embodiments, the RITT 840 may be implemented using a latch. The indirect nature of the RITT structure is employed to ensure that the SRB 800 can be compact and fully associative, while at the same time avoiding the introduction of numerous expensive comparators into potentially critical paths. If no valid entry is found, a valid entry may be obtained by directly accessing another SRB 800 (i.e., for another path). If no other SRB has a valid entry, the register read value is obtained from the main register file 810. The newly retrieved value is placed in the SRB800 of the requesting slot. An entry is also created in the corresponding RITT840 with the valid bit set to true. As one might expect, at the start of a program containing bundled instructions, many read values must be supplied directly from the main register file 810. These accesses may initially suffer delays due to the limited number of read ports available to access the main register file 810. However, as the program progresses and begins to iterate over a bundled trigger instruction, more accesses may be served by the SRB800 than by the main register file, thereby potentially improving parallel execution.

[0104] The RBMU830 broadcasts any updates to a specific register to all SRB800s that have a copy of that register. This broadcast method is effective because only one function unit is allowed to write to each destination register in the bundle. The "modified" bit is set in the corresponding location to ensure that the relevant entry in the SRB800 is written to the main register file if it is later excluded. Such exclusions may arise due to capacity issues when adding new values to the SRBs and trying to make the process efficient, and usage tracking may be performed using access counters. Infrequently used entries may be excluded. If no matching entry for the register being written to exists in any SRB at this point, the value is written to the main register file.

[0105] The aforementioned description of the multilevel register storage structure is unacceptable in conventional processors due to the additional delays these structures introduce when inference failures or exceptions are encountered. The architectural register state can only be made visible by potentially draining all SRBs, which can be an expensive and time-consuming operation. However, since the primary objective of the bundled trigger instruction architecture is flexible multi-domain space acceleration, the proposed structure is not only acceptable but also advantageous in this use case. The current use case has few requirements for common purposes that reflect the objectives applicable to conventional processors, such as rich inference, accurate exception handling, and fast exception entry capabilities.

[0106] In Figure 8, a hardware structure called the Input Channel Buffer Management Unit (ICBMU) 850 operates similarly to the RBMU, transferring incoming data from channels to a small memory area called the Slot Channel Buffer (SCB0...3) 860. However, because the number of input channels is far fewer than that of the main register file, each SCB can be directly mapped, unlike the fully associative SRB. In addition, since instructions to read up to three channels simultaneously are rare, each SCB may require fewer ports than the SRB. Furthermore, the input channels are "read-only," and therefore the ICBMU does not need to have as many coherent functions as the RBMU. Its primary purpose is to listen for "dequeue" and "channel read" events, and whenever a dequeue operation occurs at the end of a bundle, new data at the beginning of the corresponding channel is propagated to all slots that have recently read from that channel. Since the SCBs are directly mapped structures, exclusion is not necessary.

[0107] Since there are no critical events in the backend that require synchronization between data in transit and data in the output channel, information from the execution unit to the output channel does not need to traverse any special management units. In addition, the two assumptions that only one writer per channel is allowed within the bundle, and that data is never read back from these channels by the functional units of the triggered processing element, can further simplify the construction of the output paths required to support bundling.

[0108] Therefore, Figure 8 shows a set of processor registers 810 common to the execution of program instructions by any of the execution paths, a buffer circuit 800 for each execution path that stores a copy of the data held by one or more of the processor registers, and control circuits 830, 840 that control the copying of data between the set of processor registers and the buffer circuit. As described above, in some examples, the result of execution by any execution path can be written back directly to the buffer circuit 800 associated with that execution path.

[0109] Processing array The configuration in Figure 1 provides an example of a processing array comprising an array of such circuits and data communication circuits 140, 150 for communicating data between the circuits of the array, when implemented using a circuit according to any of the exemplary embodiments of the present invention.

[0110] Compiler example Next, exemplary configurations illustrating embodiments of the compiler will be described with reference to Figures 9 and 10. Here, Figure 9 schematically shows a data processing device 900 that can be used to perform compiler operations that generate the bundled program code described above, for example, according to the method described below with reference to Figure 12. Note that any of the aforementioned circuits may be used to perform compiler operations, but in at least some examples, a general-purpose data processing device such as device 900 may be used.

[0111] The device 900 comprises one or more processing elements or a central processing unit (CPU) 910, a non-volatile storage unit 930 such as a random access memory 920, flash memory, optical disk or magnetic disk, or read-only memory (forming an example of a machine-readable non-temporary storage medium capable of providing or storing computer software that performs the compiler operations described herein), a graphical user interface 940 such as one or more of a display, keyboard, or pointing control, and one or more other interfaces 950 such as a network interface, all interconnected by a bus structure 960. During operation, program instructions for the compiler or other functions are read from the non-volatile storage unit 930 and executed by the CPU 910 in cooperation with the random access memory 920.

[0112] Figure 10 schematically illustrates a compilation operation in which source code 1000 is compiled by a compilation operation 1010 using a process described, for example, with reference to Figure 12, to generate executable program instructions 1020.

[0113] Method Overview As an overview of the technique discussed above, Figure 11 is a schematic flowchart illustrating the method (which can be implemented, for example, by the circuit described above), and this method is (In step 1100) The execution of a program instruction depends on whether each trigger condition matches the current trigger state, and sets the next trigger state in response to the execution of the program instruction. (In step 1110) the instruction storage unit provides two or more groups of program instructions for parallel execution, (In step 1120) the instruction storage is controlled to provide program instructions from a given group of program instructions for execution in response to the generation of a trigger state by the execution of a program instruction and the trigger condition associated with a given group of program instructions. Includes.

[0114] As a further summary of the techniques discussed above, Figure 12 is a schematic flowchart showing a computer implementation method, and this method is... (In step 1200) Depending on each trigger condition, a program instruction is generated for execution, and the execution of the program instruction sets and generates the next trigger condition. (In step 1210) the division of a program instruction into groups of program instructions, wherein at least some groups contain two or more program instructions, one of which does not depend on the result of another program instruction within the given group. (In step 1220) for each group, input trigger conditions and output trigger states are generated, The input trigger condition is a condition that, if met, enables the execution of the program instructions in that group, and the output trigger condition is a condition to be generated in response to the completion of the execution of all program instructions in that group.

[0115] In an exemplary configuration, each of steps 1200, 1210, and 1220 may be implemented by a computer, such as the device in Figure 9, which operates under the control of computer software (which may be stored in the non-volatile memory unit 930).

[0116] Simulation Embodiment Figure 13 illustrates possible simulator implementations. While the above embodiments implement the present invention in terms of devices and methods for operating specific processing hardware that supports the technique, it is also possible to provide an instruction execution environment according to the embodiments described herein, which are implemented using a computer program. Such a computer program is often referred to as a simulator, insofar as it provides a software-based implementation of a hardware architecture. Various simulator computer programs include emulators, virtual machines, models, and binary translators, including dynamic binary translators. Typically, a simulator implementation can run on a host processor 1330 that supports the simulator program 1310, optionally running a host operating system 1320. In some configurations, there may be multiple layers of simulation between the hardware and the provided instruction execution environment, and / or multiple different instruction execution environments provided on the same host processor. Historically, powerful processors have been required to provide simulator implementations that run at reasonable speeds, but such an approach may be justified in certain circumstances, such as when it is desirable to run native code on a different processor for compatibility or reuse reasons. For example, a simulator implementation may provide an instruction execution environment with additional functionality not supported by the host processor hardware, or it may provide an instruction execution environment typically associated with a different hardware architecture. An overview of the simulation is provided in the reference cited below [2].

[0117] While embodiments have been described with reference to specific hardware components or characteristics, in simulated embodiments, equivalent functionality can be provided by appropriate software components or characteristics. For example, a particular circuit may be implemented as computer program logic in a simulated embodiment. Similarly, memory hardware such as registers or caches may be implemented as software data structures in a simulated embodiment. In configurations where one or more of the hardware elements referenced in the above embodiments reside on host hardware (e.g., host processor 1330), some simulated embodiments may utilize the host hardware, if preferred.

[0118] The simulator program 1310 may be stored on a computer-readable storage medium (which may be a non-temporary medium) and can provide the target code 1300 (which may include applications, operating systems, and hypervisors) with a program interface (instruction execution environment) that is the same as the interface of the hardware architecture modeled by the simulator program 1310. Thus, program instructions of the target code 1300, including instructions that provide processing program logic configured to execute program instructions depending on whether each trigger condition matches the current trigger state and to set the next trigger state in response to the execution of the program instructions; processing program logic comprising instruction storage program logic configured to selectively provide groups of two or more program instructions for execution in parallel; and trigger program logic that controls the instruction storage program logic to provide the above-mentioned program instructions for execution in response to the generation of trigger states by the execution of program instructions and trigger conditions associated with a given group of program instructions, may be executed from within the instruction execution environment using the simulator program 1310, thereby allowing a host computer 1330, which does not actually possess the hardware functions of the device discussed above, to mimic these functions.

[0119] Accordingly, an exemplary embodiment can provide a processing circuit comprising: a virtual machine computer program, which includes instructions for controlling a host data processing device that provides an instruction execution environment, the processing device having processing program logic configured to execute program instructions depending on whether each trigger condition matches the current trigger state and to set the next trigger state in response to the execution of the program instructions; instruction storage program logic configured to selectively provide two or more groups of program instructions for execution in parallel; and trigger program logic that controls the instruction storage program logic to provide program instructions for execution in response to the generation of trigger states by the execution of program instructions and trigger conditions associated with a given group of program instructions.

[0120] In this application, the term "configured to..." is used to mean that an element of the device has a configuration capable of performing a defined operation. In this context, "configuration" means a hardware or software arrangement or method of interconnection. For example, the device may have dedicated hardware to provide a defined operation, or a processor or other processing device may be programmed to perform a function. "Configured to..." does not mean that any modifications must be made to the device element in order to provide the defined operation.

[0121] While exemplary embodiments of the Art have been described in detail herein with reference to the accompanying drawings, it should be understood that the Art is not limited to those exact embodiments, and that various changes, additions, and modifications can be made by those skilled in the art without departing from the scope and spirit of the Art as defined by the accompanying claims. For example, various combinations of the features of the dependent claims can be made by the features of the independent claims without departing from the scope of the Art.

[0122] References: [1]A.Parashar et al.,“Efficient Spatial Processing Element Control via Triggered Instructions,”in IEEE Micro,vol.34,no.3,pp.120-137,May-June 2014,doi:10.1109 / MM.2014.14. [2]R.Bedichek,“Some Efficient Architecture Simulation Techniques”,in Winter 1990 USENIX Conference,Pages 53-63.

Claims

1. It is a circuit, The system includes a processing circuit configured to execute a program instruction depending on whether each trigger condition matches the current trigger state, and to set the next trigger state in response to the execution of the program instruction. The aforementioned processing circuit is An instruction storage unit configured to selectively provide groups of two or more program instructions, wherein the two or more program instructions are executed in parallel. A trigger circuit controls the instruction storage unit to provide program instructions from the given group of program instructions for execution in response to the generation of a trigger state by the execution of a program instruction and a trigger condition associated with a given group of program instructions. Equipped with, The instruction storage unit comprises at least two instruction queues, each configured to provide groups of program instructions for execution, wherein the instruction queues include a first instruction queue configured to provide up to n groups of program instructions for execution in parallel, and a second instruction queue configured to provide up to m groups of program instructions for execution in parallel, where m is not equal to n, in a circuit.

2. The circuit according to claim 1, wherein the trigger circuit controls the given instruction queue to provide a program instruction queued for execution in response to the generation of a trigger state by the execution of a program instruction and a trigger condition associated with a given instruction queue among the at least two instruction queues.

3. The circuit according to claim 1 or 2, further comprising a routing circuit configured to route a group of program instructions to one of the selected instruction queues.

4. The circuit according to any one of claims 1 to 3, comprising a plurality of execution paths for executing in parallel a number of program instructions that is greater than or equal to the maximum number of program instructions provided in parallel by any of the instruction queues.

5. The circuit according to claim 4, comprising a set of processor registers common to the execution of program instructions within a group of program instructions, wherein the set of processor registers is accessible during the execution of the program instructions.

6. For each of the execution paths, there are at least two sets of processor registers, the set of processor registers for a given execution path being accessible during the execution of a program instruction by that execution path, A communication circuit for communicating data between the aforementioned sets of processor registers, The circuit according to claim 4, comprising:

7. A set of processor registers common to the execution of program instructions by any of the aforementioned execution paths, A buffer circuit for each execution path that stores a copy of the data held by one or more of the aforementioned processor registers, A control circuit that controls the copying of data between the set of processor registers and the buffer circuit, The circuit according to claim 4, comprising:

8. The circuit according to any one of claims 4 to 7, wherein at least one of the execution paths is configured to execute a program instruction having at most a first number of operands, and at least another execution path is configured to execute a program instruction having at most a second number of operands, the second number being different from the first number.

9. The circuit according to any one of claims 1 to 8, wherein the processing circuit comprises a vector processing circuit configured to execute two or more vector processing instructions in parallel, and each vector processing instruction applies its respective processing operation to each vector of two or more data elements.

10. A processing array, An array of circuits according to any one of claims 1 to 9, A data communication circuit for communicating data between the circuits of the array, A processing array comprising:

11. It is a method, Executing program instructions such that each trigger condition matches the current trigger state, and that the next trigger state is set in response to the execution of a program instruction, An instruction storage unit comprising at least two instruction queues, each having a first instruction queue for providing groups of up to n program instructions for parallel execution, and a second instruction queue for providing groups of up to m program instructions for parallel execution, wherein the two or more program instructions are executed in parallel, and m is not equal to n. Controlling the instruction storage unit to provide program instructions from the given group of program instructions for execution in response to the generation of a trigger state by the execution of a program instruction and a trigger condition associated with a given group of program instructions. Methods that include...

12. A computer implementation method, The process involves generating program instructions for execution depending on each trigger condition, and the execution of these program instructions sets or generates the next trigger state. The division of the program instructions into groups of program instructions, wherein at least some groups each contain two or more program instructions, one of which does not depend on the result of another program instruction within the given group. To generate input trigger conditions and output trigger states for each group, wherein the input trigger conditions, when met, enable the execution of the program instructions in that group, and the output trigger states are states to be generated in response to the completion of the execution of all the program instructions in that group; The routing of a given group of program instructions to an instruction storage unit having at least two instruction queues, each configured to provide a group of program instructions for execution, wherein the instruction queues comprise a first instruction queue configured to provide up to n groups of program instructions for parallel execution, and a second instruction queue configured to provide up to m groups of program instructions for parallel execution, where m is not equal to n. Computer implementation methods, including those mentioned above.

13. The method according to claim 12, comprising generating one or more operations to dequeue the input data from one or more input channels after the execution of the program instructions for the group, for one or more input channels that provide input data for executing the group.

14. The method according to claim 12 or 13, comprising routing a group of program instructions to individual processors of an array of processors.

15. The method according to claim 12 or 13, comprising routing a group of program instructions to an instruction queue selected from a plurality of instruction queues, wherein the instruction queue is configured to provide the group of program instructions in parallel for execution.

16. A compiler comprising computer program code, which, when executed by a computer, causes the computer to perform the method according to any one of claims 12 to 15.

17. A non-temporary machine-readable storage medium for storing the compiler described in claim 16.