Method and apparatus for compiling a program
By scheduling and encoding instruction sequences and utilizing root slot attributes to reduce sub-slot involvement, a binary file suitable for VLI-designed processors is generated. This solves the compilation complexity and memory consumption problems caused by the highly flexible VLI design and achieves an efficient compilation method.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2022-10-31
- Publication Date
- 2026-06-19
Smart Images

Figure CN117950670B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a method and apparatus for compiling a program. Background Technology
[0002] With the development of artificial intelligence (AI) technology, the demands on processor computing power are increasing. To meet the power consumption requirements of high-performance processors, current processors typically employ Domain Specified Architecture (DSA). In DSA, processors often use Very Long Instruction Word (VLIW) design to improve instruction processing parallelism.
[0003] In VLIW design, various Variable Length Instruction (VLI) designs were proposed to reduce the memory footprint of the compiled program. VLI design reduces the code length of instructions through time compression or space compression, thereby reducing the memory footprint of the compiled program.
[0004] However, for highly flexible VLI designs, this can affect processes such as instruction selection, instruction scheduling, instruction packing, and instruction printing during program compilation. To reduce the memory footprint of the compiled program without compromising compilation efficiency, a compilation method that supports highly flexible VLI designs is urgently needed. Summary of the Invention
[0005] This application provides a program compilation method and apparatus that solves the complex compilation problems caused by the highly flexible VLI design and parallel instruction issuance, and can fully support the flexibility of VLI design coding.
[0006] In a first aspect, this application provides a method for compiling instructions, the method comprising: scheduling an instruction sequence to obtain a plurality of Very Long Instruction Words (VLIWs), each VLIW including at least one instruction from the instruction sequence, the at least one instruction being located in at least one target root slot, the target root slot indicating the resources required to execute the placed instruction; determining an optional VLIW mode for the VLIW from a plurality of VLIW modes based on the target root slot where each instruction in the VLIW is located, the optional VLIW mode including a target optional sub-slot for each instruction in the VLIW, the target optional sub-slot being a sub-slot of the target root slot where the instruction is located, the types of instructions supported by the target root slot including the types of instructions supported by the sub-slots of the target root slot; and determining an encoded instruction for each instruction in the VLIW based on the shortest optional VLIW mode among the optional VLIW modes of the VLIW, thereby obtaining the VLIW encoding corresponding to the VLIW.
[0007] Multiple VLIW patterns are all validly encoded VLIW patterns, which can be recognized by the processor's decoding circuitry. A VLIW pattern may include at least one slot, which can be a root slot or a sub-slot.
[0008] Its advantages include supporting the compilation of programs for VLI designs where instructions can be placed in different slots (including the root slot and sub-slots), generating binary or assembly files targeting the VLI-designed processor, solving the complex compilation problems caused by the high flexibility of VLI designs and parallel instruction issuance, and fully supporting the flexibility of VLI design coding. The encoding instructions for each instruction in the VLIW are determined based on the shortest optional VLIW mode. Compared to existing technologies, this effectively reduces program memory usage without affecting the flexibility, compilation efficiency, and code execution efficiency of the VLI design.
[0009] In one possible implementation, each instruction corresponds to at least one scheduling information, which indicates the resources available for each stage of the instruction. The scheduling of the instruction sequence to obtain multiple Very Long Instruction Words (VLIWs) includes: determining the bypass equivalence of each instruction in the instruction sequence based on a dependency graph of the instruction sequence; updating the weights of the edges between bypassed and bypass-equivalent instructions in the dependency graph and their successors; determining the height offset records of bypass-inequivalent instructions, where the height offset records include the height offset of the bypass-inequivalent instructions when executed on the resources indicated by the corresponding scheduling information; and performing instruction scheduling for the current cycle based on the dependency graph and the height offset records of the bypass-inequivalent instructions.
[0010] For example, the scheduling information of an instruction may include: an optional root slot for the instruction, at least one optional cycle for each stage of the instruction and optional functional units in each optional cycle, the effective machine cycle corresponding to at least one operand of the instruction (including input operands and / or output operands), and the bypass corresponding to each operand of the instruction. The optional root slot for the instruction is determined by the scheduling information corresponding to the instruction, and the optional root slots included in all the scheduling information corresponding to the instruction constitute all the optional root slots of the instruction.
[0011] Its beneficial effect is based on the attribute that the sub-slot is contained within the root slot. It only utilizes the root VLIW mode for instruction scheduling and instruction packaging, and the scheduling result is also represented only through the root VLIW mode. This eliminates the need for sub-slots to participate in instruction scheduling and instruction packaging, thereby reducing the scheduling scale and improving scheduling efficiency.
[0012] In one possible implementation, the process of determining the bypass equivalence of each instruction includes: if the instruction has a data dependency with subsequent instructions, and the weight of the edge between the instruction and each subsequent instruction remains unchanged when the instruction is executed on the resource indicated by the corresponding scheduling information, then the bypass of the instruction is determined to be equivalent; if the instruction has a data dependency with subsequent instructions, and the weight of the edge between the instruction and any subsequent instruction changes when the instruction is executed on the resource indicated by the corresponding scheduling information, then the bypass of the instruction is determined to be inequivalent. For bypass-equivalent instructions, their timing in the DAG remains unchanged during scheduling, and their depth offset and height offset when the instruction is executed on the resource indicated by the corresponding scheduling information are both defaulted to 0.
[0013] Its beneficial effect is that it incorporates the choice of bypass for each instruction into the scheduling process. This allows for instruction scheduling even when the same instruction has different encodings in different slots and different VLIW modes, and when instruction bypass is optional, laying the foundation for subsequent program compilation supporting VLI designs. Furthermore, this scheduling process is based on the attribute that sub-slots are contained within the root slot, utilizing only the root slot for instruction scheduling, and the scheduling result is represented only through the root slot. This eliminates the need for sub-slots to participate in instruction scheduling, thereby reducing the scheduling scale and improving scheduling efficiency.
[0014] In one possible implementation, the instruction scheduling for the current cycle based on the dependency graph and the height offset record of the bypassed inequivalent instructions includes: determining the ready instruction set for the current cycle based on the dependency graph, the depth offset record of each instruction in the instruction sequence, and the height offset record of the bypassed inequivalent instructions, wherein the ready instructions in the ready instruction set are arranged in priority order, and the depth offset record includes the depth offset of the instruction when it is executed on the resource indicated by the corresponding scheduling information; and scheduling each ready instruction in the ready instruction set sequentially.
[0015] In one possible implementation, determining the set of ready instructions for the current period based on the dependency graph, the depth offset record of each instruction in the instruction sequence, and the height offset record of the bypass-inequivalent instruction includes: determining at least one ready instruction for the current period using the depth of unscheduled instructions in the dependency graph and the depth offset record; and prioritizing the at least one ready instruction based on the priority parameter of each ready instruction to obtain the set of ready instructions, wherein the priority parameter of the ready instruction includes at least one of the following: the height of the ready instruction in the dependency graph, the number of successor instructions of the ready instruction in the dependency graph, and the height offset record of the ready instruction.
[0016] For example, when determining at least one ready instruction in the current cycle, the depth and depth offset of the unscheduled instruction in the dependency graph can be used to determine whether the unscheduled instruction meets the data conflict requirements of the current cycle. For instructions with bypass equivalents, the depth offset of their successor instructions is always 0. For instructions with bypass inequivalences, the depth offset of their successor instructions is dynamically calculated during the scheduling process. The depth offset record of the unscheduled instruction can be determined by detecting whether there is a bypass between the functional unit of each operand in the resource indicated by the scheduling information corresponding to the unscheduled instruction and the functional unit of the corresponding operand in the resource occupied by the predecessor instruction. For unscheduled instructions without a predecessor instruction, their depth offset record is 0 by default, and their depth in the dependency graph is also 0.
[0017] When prioritizing at least one ready instruction based on the priority parameters of each ready instruction, the following example illustrates how the priority parameters of a ready instruction include: the height of the ready instruction in the dependency graph, the number of successor instructions of the ready instruction in the dependency graph, and the height offset record of the ready instruction.
[0018] In one possible implementation, the step of prioritizing the at least one ready instruction to obtain the ready instruction set based on the priority parameter of each ready instruction includes: prioritizing the at least one ready instruction based on the height of each ready instruction; when there are multiple first ready instructions with the same height among the at least one ready instructions, prioritizing the multiple first ready instructions based on the number of successor instructions for each first ready instruction; when there are multiple second ready instructions with the same number of successor instructions among the multiple first ready instructions, prioritizing the multiple second ready instructions based on the height offset record of each second ready instruction to obtain the ready instruction set.
[0019] The beneficial effect is that the higher the instruction's height in the dependency graph or the more subsequent instructions there are, the more likely the instruction is on the critical path. Scheduling this instruction first allows for scheduling more instructions within a single cycle. Therefore, to schedule more instructions in the current cycle, setting these two parameters higher increases the priority of ready instructions.
[0020] In one possible implementation, prioritizing the plurality of second ready instructions based on the height offset record of each second ready instruction includes: determining second ready instructions with positive height offsets in the height offset record as high priority; determining second ready instructions with all height offsets of 0 in the height offset record as secondary priority; and determining second ready instructions with negative height offsets in the height offset record as low priority.
[0021] The lower the height of an instruction, the lower the height of its subgraph, and the lower the program's memory usage. To minimize the height of the subgraph containing the second ready instruction, scheduling can be done sequentially based on the height offset. If a second ready instruction has a positive height offset, it will have a higher height in subsequent scheduling cycles, so its priority can be set to high. If a second ready instruction has a negative height offset, its height may not increase even in subsequent scheduling cycles, so its priority can be set to low.
[0022] When a ready instruction corresponds to multiple scheduling information, in one possible implementation, the ready instruction corresponds to multiple scheduling information, and the step of sequentially scheduling each ready instruction in the ready instruction set includes: when the bypass of the ready instruction is equivalent, performing scheduling attempts on the resources indicated by the multiple scheduling information respectively; when the bypass of the ready instruction is not equivalent, determining the priority of the multiple scheduling information based on the height offset record of the ready instruction, and performing scheduling attempts sequentially according to the priority of the multiple scheduling information.
[0023] The beneficial effect is that the smaller the height offset of the instruction when it is executed on the resource indicated by the corresponding scheduling information, the smaller the interval between the instruction and the subsequent instruction, the more compact the program is, and the smaller the program's memory usage is usually. Therefore, in order to minimize the program's memory usage, the priority of scheduling information with smaller instruction heights can be set to be higher, that is, the scheduling information with smaller height offsets has higher priority.
[0024] In one possible implementation, the method further includes: after the ready instruction is scheduled, updating the weight of the edge between the ready instruction and its predecessor instruction and the depth of the ready instruction in the dependency graph.
[0025] In one possible implementation, the method further includes: determining whether each predecessor instruction of the target successor instruction of the ready instruction has been scheduled; when each predecessor instruction of the target successor instruction has been scheduled, determining the depth and depth offset record of the target successor instruction in the dependency graph; determining the current depth of the target successor instruction based on the depth and depth offset record of the target successor instruction in the dependency graph; and determining whether to add the target successor instruction to the ready instruction set based on the current depth of the target successor instruction.
[0026] In one possible implementation, the method further includes: converting each instruction in the intermediate code sequence into a composite coded pseudo-instruction to obtain the instruction sequence.
[0027] The beneficial effect is that compound coded pseudo-instructions can represent instructions with the same function but different resource consumption and encoding (different slots or functional units occupied), and they are not actual instruction codes. By using compound coded pseudo-instructions, instructions with the same function but different resource consumption and encoding can be uniformly scheduled, maximizing the use of instruction-level parallelism (ILP) characteristics, thereby achieving better scheduling performance and effectively supporting further compilation of flexible VLI.
[0028] In one possible implementation, determining the optional VLIW mode from multiple VLIW modes based on the target root slot where each instruction in the VLIW is located includes: determining the target optional sub-slot for each instruction in the VLIW based on the target root slot, optional sub-slots, and slot information, wherein the slot information includes the correspondence between the root slot and the sub-slots included in the root slot; arranging and combining the target optional sub-slots of each instruction in the VLIW to obtain at least one initial VLIW mode; and determining the overlapping VLIW mode of the at least one initial VLIW mode and the multiple VLIW modes as the optional VLIW mode of the VLIW.
[0029] The optional sub-slots for each instruction are pre-set. The sub-slots included in the target root slot where the instruction is located can be determined based on the slot information. Then, the intersection of the sub-slots included in the target root slot where the instruction is located and the optional sub-slots of the instruction is determined as the target optional sub-slots of the instruction.
[0030] In one possible implementation, determining the encoded instruction for each instruction in the VLIW based on the shortest optional VLIW mode among the optional VLIW modes of the VLIW includes: determining the machine instruction corresponding to each instruction in the VLIW from a first mapping relationship based on the slot corresponding to each instruction in the shortest optional VLIW mode, wherein the first mapping relationship represents the machine instruction of the instruction in each slot; and determining the encoded instruction for each instruction in the VLIW based on the machine instruction corresponding to each instruction in the VLIW, the shortest optional VLIW mode, and a second mapping relationship, wherein the second mapping relationship represents the encoding of the machine instruction in each VLIW mode.
[0031] Its beneficial effect is that by determining the shortest VLIW encoding through the first and second mapping relationships, the framework of mainstream compilers can remain unchanged, thus improving the versatility and scalability of the scheme.
[0032] Secondly, this application provides an instruction compilation apparatus, the apparatus comprising: a scheduling module, configured to schedule an instruction sequence to obtain a plurality of Very Long Instruction Words (VLIWs), each VLIW including at least one instruction from the instruction sequence, the at least one instruction being located in at least one target root slot, the target root slot indicating the resources required to execute the placed instruction; a first determining module, configured to determine an optional VLIW mode of the VLIW from a plurality of VLIW modes based on the target root slot where each instruction in the VLIW is located, the optional VLIW mode including a target optional sub-slot for each instruction in the VLIW, the target optional sub-slot being a sub-slot of the target root slot where the instruction is located, the types of instructions supported by the target root slot including the types of instructions supported by the sub-slots of the target root slot; and a second determining module, configured to determine the encoded instruction of each instruction in the VLIW based on the shortest optional VLIW mode among the optional VLIW modes of the VLIW, thereby obtaining the VLIW encoding corresponding to the VLIW.
[0033] In one possible implementation, each instruction corresponds to at least one scheduling information, which indicates the resources available for each stage of the instruction. The scheduling module is specifically configured to: determine the bypass equivalence of each instruction in the instruction sequence based on a dependency graph of the instruction sequence; update the weights of the edges between instructions with bypass equivalence and subsequent instructions in the dependency graph; determine the height offset records of bypass-inequivalent instructions, the height offset records including the height offset of the bypass-inequivalent instructions when executed on the resources indicated by the corresponding scheduling information; and perform instruction scheduling for the current cycle based on the dependency graph and the height offset records of the bypass-inequivalent instructions.
[0034] In one possible implementation, the scheduling module is specifically configured to: determine the ready instruction set for the current period based on the dependency graph, the depth offset record of each instruction in the instruction sequence, and the height offset record of the bypass inequality instruction, wherein the ready instructions in the ready instruction set are arranged in priority order, and the depth offset record includes the depth offset of the instruction when the corresponding scheduling information indicates the resource execution; and schedule each ready instruction in the ready instruction set sequentially.
[0035] In one possible implementation, the scheduling module is specifically configured to: determine at least one ready instruction for the current period using the depth of unscheduled instructions in the dependency graph and the depth offset record; and sort the at least one ready instruction by priority based on the priority parameter of each ready instruction to obtain the ready instruction set, wherein the priority parameter of the ready instruction includes at least one of the following: the height of the ready instruction in the dependency graph, the number of successor instructions of the ready instruction in the dependency graph, and the height offset record of the ready instruction.
[0036] In one possible implementation, the apparatus further includes a conversion module for converting each instruction in the intermediate code sequence into a composite coded pseudo-instruction to obtain the instruction sequence.
[0037] In one possible implementation, the first determining module is specifically configured to: determine the target optional sub-slot of each instruction in the VLIW based on the target root slot, optional sub-slots, and slot information of each instruction in the VLIW, wherein the slot information includes the correspondence between the root slot and the sub-slots included in the root slot; arrange and combine the target optional sub-slots of each instruction in the VLIW to obtain at least one initial VLIW mode; and determine the overlapping VLIW mode of the at least one initial VLIW mode and the multiple VLIW modes as the optional VLIW mode of the VLIW.
[0038] In one possible implementation, the second determining module is specifically configured to: determine the machine instruction corresponding to each instruction in the VLIW from a first mapping relationship based on the slot corresponding to each instruction in the VLIW in the shortest optional VLIW mode, wherein the first mapping relationship represents the machine instruction of the instruction in each slot; and determine the encoded instruction of each instruction in the VLIW based on the machine instruction corresponding to each instruction in the VLIW, the shortest optional VLIW mode, and the second mapping relationship, wherein the second mapping relationship represents the encoding of the machine instruction in each VLIW mode.
[0039] Thirdly, this application provides an instruction compilation apparatus, the apparatus comprising: one or more processors; a memory for storing one or more computer programs or instructions; wherein when the one or more computer programs or instructions are executed by the one or more processors, the one or more processors implement the method as described in any one of the first aspects.
[0040] Fourthly, this application provides an instruction compilation apparatus, including a processor for executing the method as described in any one of the first aspects.
[0041] Fifthly, this application provides a computer-readable storage medium including a computer program or instructions that, when executed on a computer, cause the computer to perform the method described in any one of the first aspects.
[0042] Sixthly, this application provides a chip, including: an input interface, an output interface, and at least one processor. Optionally, the chip further includes a memory. The at least one processor is used to execute code in the memory, and when the at least one processor executes the code, the chip implements the method described in any one of the first aspects above.
[0043] Alternatively, the chip described above can also be an integrated circuit.
[0044] In a seventh aspect, this application provides a computer program product containing instructions that, when run on a computer, cause the computer to implement the method described in any one of the first aspects. Attached Figure Description
[0045] Figure 1 A schematic diagram of a VLI design provided for an embodiment of this application;
[0046] Figure 2 A schematic diagram of instruction placement provided in an embodiment of this application;
[0047] Figure 3 A schematic diagram of the structure of a compiler provided in an embodiment of this application;
[0048] Figure 4 A flowchart illustrating a program compilation method provided in an embodiment of this application;
[0049] Figure 5 A flowchart illustrating another method for compiling a program provided in an embodiment of this application;
[0050] Figure 6 A block diagram of a second instruction scheduling module provided in an embodiment of this application;
[0051] Figure 7 This is a schematic diagram illustrating a process for scheduling an instruction sequence, provided as an embodiment of this application.
[0052] Figure 8 A flowchart illustrating an algorithm for constructing a dependency graph and determining bypass equivalence and height offset records, provided for embodiments of this application;
[0053] Figure 9 A flowchart of an instruction scheduling algorithm provided in an embodiment of this application;
[0054] Figure 10A block diagram of a Bundle selection module provided in an embodiment of this application;
[0055] Figure 11 A schematic diagram of a process for determining an optional VLIW mode from multiple VLIW modes, provided for an embodiment of this application;
[0056] Figure 12 A flowchart of an algorithm for obtaining the VLIW code corresponding to each VLIW is provided for an embodiment of this application;
[0057] Figure 13 A schematic diagram of the business source code provided for an embodiment of this application;
[0058] Figure 14 A schematic diagram of an IR sequence provided in an embodiment of this application;
[0059] Figure 15 A schematic diagram of the instruction sequence provided in the embodiments of this application;
[0060] Figure 16 A schematic diagram of a DAG provided in an embodiment of this application;
[0061] Figure 17 A schematic diagram of the DAG after considering bypass adjustments, provided for an embodiment of this application;
[0062] Figure 18 A schematic diagram of a DAG after partial instruction scheduling provided in an embodiment of this application;
[0063] Figure 19 A schematic diagram of the DAG after scheduling is completed, provided in an embodiment of this application;
[0064] Figure 20 A schematic diagram of the scheduled instruction sequence provided in an embodiment of this application;
[0065] Figure 21 A schematic diagram of multiple VLIWs provided in the embodiments of this application;
[0066] Figure 22 Provided for the embodiments of this application Figure 21 A schematic diagram of the first VLIW in the diagram;
[0067] Figure 23 A schematic diagram of a machine instruction provided for an embodiment of this application;
[0068] Figure 24 A schematic diagram of an assembly instruction provided for an embodiment of this application;
[0069] Figure 25 A block diagram of a program compilation apparatus provided in an embodiment of this application;
[0070] Figure 26 A block diagram of another program compilation apparatus provided in an embodiment of this application;
[0071] Figure 27 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application;
[0072] Figure 28 This is a schematic diagram of the structure of a program compilation device provided in an embodiment of this application. Detailed Implementation
[0073] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0074] The terms "first," "second," etc., used in the specification, embodiments, claims, and drawings of this application are for distinguishing purposes only and should not be construed as indicating or implying relative importance or order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, such as including a series of steps or units. A method, system, product, or apparatus is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to these processes, methods, products, or apparatuses.
[0075] It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.
[0076] With the development of semiconductor technology, processor computing power has been greatly improved, and AI technology has also developed rapidly. The development of AI technology, in turn, places increasingly higher demands on processor computing power. To meet the power consumption requirements of high-performance processors, current processors typically employ DSA (Dynamic Array Scheduling). In DSA, processors often use a VLIW (Virtual Limiting Work) design to improve instruction processing parallelism. The instructions processed by the processor refer to independently coded and issued basic hardware actions that can complete data transfers, such as a 32-bit (Binary digit) addition instruction.
[0077] In the current VLIW design, the compiler first determines the VLIW for each cycle through an instruction scheduling process. The VLIW employs a VLIW mode. A VLIW mode includes at least one slot, indicating at least one coding region. Each slot can hold at least one instruction. For any given instruction, the slot that can accommodate that instruction is called an optional slot. During instruction scheduling, once an instruction is placed in a slot in the current cycle, no other instructions can be placed in that slot. Other instructions can be placed in other optional slots of the current cycle and in optional slots of subsequent cycles. There is a temporal order between the current cycle and subsequent cycles.
[0078] Each slot corresponds to a function unit (FU), which executes the instructions that can be placed in that slot. The function unit corresponding to the optional slot for an instruction includes the function unit for executing that instruction.
[0079] In each cycle, the processor reads the instructions placed in the slots and issues them to the corresponding functional units for execution. An instruction, when issued by the processor, may be divided into multiple stages for execution. Each stage is executed in at least one cycle and occupies a corresponding functional unit during those cycles. Therefore, during the instruction scheduling phase, if an instruction is divided into *a* stages for execution, and these *a* stages occupy a total of *b* cycles, then if the instruction is placed in cycle *c*, it will occupy the corresponding functional unit from cycle *c* to cycle *c+b-1*.
[0080] Because processors can only recognize a few preset VLIW modes during decoding, when the compiler places instructions according to a certain VLIW mode, if various conflicts (data conflicts, structure conflicts, and control conflicts) or the fact that only a portion of the instructions can be placed in the current cycle result in no available instructions in a certain slot, a No Operation (NOP) instruction needs to be added to that slot as a placeholder so that the processor can recognize the final VLIW. However, NOP increases the memory footprint of the compiled program, resulting in a larger memory overhead.
[0081] To reduce the memory footprint of compiled programs in VLIW designs, various VLI designs have been proposed. VLI designs reduce program memory usage through either time compression or space compression. For time compression, the encoding of placeholder NOP instructions is removed, and additional encoding is used to represent the current valid VLIW slots (slots containing instructions) and their order. Additional encoding can include masks added to the lower bits or VLIW type encoding, etc.
[0082] In space compression, each slot in several preset VLIW modes is used as the root slot. Sub-slots are extracted from the root slot to form new VLIW modes. The instructions that can be placed in the root slot include those that can be placed in the sub-slots. The more instructions that can be placed in a slot, the greater the code length required for each instruction, and the greater the code length required to distinguish these instructions, thus increasing the total length of the final VLIW code. In space compression, because fewer instructions can be placed in the sub-slots, the total length of the final VLIW code is correspondingly reduced, resulting in a smaller memory footprint for the compiled program.
[0083] As can be seen from the aforementioned VLI design, the slot combination method in VLIW mode is very flexible. Furthermore, instructions within a single VLIW can be processed in parallel. Considering the impact of instruction parallelism requirements, code length, VLIW alignment constraints, and decoding circuit complexity on VLI, the following situations may occur in highly flexible VLI designs:
[0084] 1. The same instruction (i.e., an instruction with the same function) can be placed in several different VLIW modes, and the encoding of the instruction is different when placed in different VLIW modes. In this way, when an instruction is used frequently, it can be placed in multiple slots, realizing flexible instruction scheduling.
[0085] Please refer to Figure 1 , Figure 1 This is a schematic diagram of a VLI design provided in an embodiment of this application. Figure 1Four VLIW patterns are shown. The first VLIW pattern consists of slots A, B, C, and D; the second VLIW pattern consists of slots A1, B1, and C1; the third VLIW pattern consists of slots A11 and B11; and the fourth VLIW pattern consists of slots A and B. A, A1, and A11 are sub-slots of slot A; B, B1, and B11 are sub-slots of slot B; and C and C1 are sub-slots of slot C.
[0086] For example, please refer to Figure 2 , Figure 2 This is a schematic diagram of instruction placement provided in an embodiment of this application. Figure 2 by Figure 1 The first and second VLIW modes in the example will be used for illustration. Figure 2 As shown, the ADD32 instruction can be placed simultaneously in slots A and D of the first VLIW mode, or in slot A1 of the second VLIW mode. The encoding of the ADD32 instruction is different when placed in these three slots.
[0087] 2. The same instruction can be placed in different slots within a VLIW mode, with different encodings depending on the slot. This allows for flexible instruction scheduling when an instruction is used frequently, enabling placement in multiple slots.
[0088] In some VLIW modes, the same instruction can be placed in multiple locations simultaneously. Since the functional units occupied when placed in different locations are different, there may be bypass inequalities, which makes the subsequent instruction scheduling process more complex.
[0089] Bypass affects the minimum interval between instructions, thus impacting read-after-write (RAW) and write-after-read (WAR) dependencies in data dependencies.
[0090] For RAW dependencies, after the processor issues an instruction, if there is a bypass between the functional unit used to output the operand and the functional unit used to input the operand in the paired instruction, the functional unit can directly transfer the data to another functional unit that needs to read the output without going through the register write process. Since no register write is required, one cycle of overhead can be reduced.
[0091] For WAR dependencies, a read operation on a register must be executed before a write operation. Therefore, the instruction corresponding to the write operation is usually scheduled after the instruction corresponding to the read operation. When the read operation is executed in a later stage of the corresponding instruction pipeline, and the write operation is executed in an earlier stage, even if the instruction corresponding to the write register is scheduled before the instruction corresponding to the read operation, the read operation will still execute before the write operation. In this case, there is a negative edge between the instructions corresponding to the read and write operations. However, if there is a bypass between the functional units for writing and reading registers, the negative edge weight will increase. For such instructions, if the impact of the bypass on WAR is not considered during scheduling, the order of reading and writing registers will be disordered, leading to errors in the result of reading registers.
[0092] If an instruction can be placed in different slots, and the weight of each edge between the instruction and its successor instruction remains unchanged in the dependency graph when placed in different slots (i.e., the timing of the dependency graph remains unchanged), then the bypass of this instruction is equivalent. Otherwise, the bypass of this instruction is not equivalent. For example, if an instruction can be bypassed with a certain successor instruction when placed in some slots, but not when placed in other slots, then the bypass of this instruction is not equivalent.
[0093] 3. For space compression, the VLIW mode is relatively complex due to the combination of sub-slots.
[0094] As described above, the potential issues arising from VLI design can influence instruction selection, instruction scheduling, instruction packing, and instruction printing during program compilation. This places high demands on the compiler, and current compiler frameworks cannot support highly flexible VLI designs. Therefore, a compilation method that supports highly flexible VLI designs is urgently needed, aiming to reduce the memory footprint of the compiled program without compromising compilation or code execution efficiency.
[0095] This application provides a program compilation method that supports the compilation of programs with high flexibility in VLI designs. It reduces the memory footprint of the compiled program without affecting the flexibility, compilation efficiency, or code execution efficiency of the VLI design. This method can be applied to compilers, such as those targeting VLIW processors or accelerators, including LLVM. The method does not limit the architecture of the processor or accelerator and can combine VLI designs with other processor technologies to broaden the application scope of VLI designs. The processor or accelerator can implement static or non-static multiple-issue processing of the compiled file; this application does not limit the issuance method of the compiled file.
[0096] Please refer to Figure 3 , Figure 3 This application provides a schematic diagram of the structure of a compiler, which includes a compiler front-end, an optimizer, and a compiler back-end. The compiler back-end includes an instruction selection module, a first instruction scheduling module, a soft pipelining scheduling module, a register allocation module, a second instruction scheduling module, an instruction packing module, an instruction bundle selection module, and a code printing module.
[0097] like Figure 3 As shown, the business source code is input into the compiler frontend in text form. The compiler frontend processes the business source code and outputs an intermediate representation (IR) sequence to the optimizer. The optimizer performs machine-independent optimizations on the IR sequence and outputs the optimized IR sequence to the compiler backend. The instruction selection module converts the IR sequence into a machine instruction sequence and outputs the machine instruction sequence to the first instruction scheduling module. This machine instruction sequence is processed sequentially by the first instruction scheduling module, the soft pipelining scheduling module, the register allocation module, the second instruction scheduling module, the instruction packing module, and the bundle selection module. Finally, the code printing module generates and outputs the target code based on the input machine instruction sequence. The business source code can be developed by developers using a high-level language based on a VLIW processor or accelerator, and the target code can be a binary file or an assembly file.
[0098] The second instruction scheduling module handles the scheduling of instructions with bypass options. Bypass options mean that an instruction can be placed in different slots, allowing for either a bypass with a subsequent instruction or no bypass with a subsequent instruction. The Bundle selection module determines the shortest Bundle code and generates the final Bundle code; a Bundle refers to a VLIW.
[0099] Please refer to Figure 4 , Figure 4 This application provides a flowchart illustrating a program compilation method, which can be applied to a compiler, for example... Figure 3 The compiler shown. Figure 4 As shown, the method may include the following procedures:
[0100] 101. Schedule the instruction sequence to obtain multiple VLIWs. Each VLIW includes at least one instruction from the instruction sequence. Each instruction is located in at least one target root slot, which indicates the resources required to execute the placed instruction.
[0101] The compiler can schedule instruction sequences based on instruction scheduling information and instruction scheduling algorithms to implement the scheduling of optional bypass instructions. Each instruction corresponds to at least one type of scheduling information, which indicates the resources available for each stage of the instruction. For example, the instruction scheduling information may include: an optional root slot for the instruction, at least one optional cycle for each stage of the instruction and optional functional units within each optional cycle, the effective machine cycle corresponding to at least one operand of the instruction (including input operands and / or output operands), and the bypass corresponding to each operand of the instruction. This information is pre-set. As described above, the optional root slots of an instruction are determined by the scheduling information corresponding to the instruction, and the optional root slots included in all the scheduling information corresponding to the instruction constitute all the optional root slots of the instruction.
[0102] 102. Based on the target root slot where each instruction in VLIW is located, determine the optional VLIW mode from multiple VLIW modes. The optional VLIW mode includes a target optional sub-slot for each instruction in VLIW. The target optional sub-slot of the instruction is a sub-slot of the target root slot where the instruction is located.
[0103] The compiler can determine the optional VLIW mode from multiple VLIW modes based on the target root slot where each instruction is located, optional sub-slots, and slot information. The slot information can include the correspondence between the root slot and the sub-slots included in the root slot. A VLIW mode includes at least one slot, and any slot can be either the root slot or a sub-slot.
[0104] 103. Based on the shortest optional VLIW mode among the optional VLIW modes, determine the encoding instruction for each instruction in the VLIW, and obtain the VLIW encoding corresponding to the VLIW.
[0105] Optionally, the machine instruction corresponding to each instruction in the VLIW can be determined from the first mapping relationship based on the slot corresponding to each instruction in the shortest selectable VLIW mode. The first mapping relationship represents the machine instruction of the instruction in each slot. Then, based on the machine instruction corresponding to each instruction in the VLIW, the shortest selectable VLIW mode, and the second mapping relationship, the encoded instruction of each instruction in the VLIW is determined, thus obtaining the VLIW encoding corresponding to the VLIW. The second mapping relationship represents the encoding of the machine instruction in each VLIW mode.
[0106] In summary, the compilation method provided in this application schedules an instruction sequence to obtain multiple VLIWs. Each VLIW includes at least one instruction from the instruction sequence, with each instruction located in at least one target root slot. Based on the target root slot of each instruction in the VLIW, an optional VLIW mode is determined from the multiple VLIW modes. The optional VLIW mode includes a target optional sub-slot for each instruction in the VLIW, where the target optional sub-slot is a sub-slot of the target root slot where the instruction is located. Finally, based on the shortest optional VLIW mode among the optional VLIW modes, the encoded instruction for each instruction in the VLIW is determined, resulting in the VLIW encoding corresponding to the VLIW. This process supports the compilation of programs for VLI designs where instructions can be placed in different slots (including root slots and sub-slots). It can generate binary or assembly files targeting VLI design processors, solving the complex compilation problems caused by highly flexible VLI designs and parallel instruction issuance, and fully supporting the flexibility of VLI design encoding. The encoding instructions for each instruction in the VLIW are determined based on the shortest optional VLIW mode. Compared with existing technologies, this effectively reduces the memory footprint of the program without affecting the flexibility of VLI design, compilation efficiency, and code execution efficiency.
[0107] Please refer to Figure 5 , Figure 5 A flowchart illustrating another program compilation method provided in this application embodiment, which can be applied to a compiler, for example... Figure 3 The compiler shown. Figure 5 As shown, the method may include the following procedures:
[0108] 201. Schedule the instruction sequence to obtain multiple VLIWs. Each VLIW includes at least one instruction from the instruction sequence. Each instruction is located in at least one target root slot, which indicates the resources required to execute the placed instruction.
[0109] This process can be described by the aforementioned Figure 3The compiler shown executes the second instruction scheduler module. During the transfer of the instruction sequence from the first instruction scheduler module to the second, instructions may be added, deleted, or replaced. Therefore, the instruction sequence needs to be scheduled again to ensure code performance.
[0110] The scheduling process executed by the second instruction scheduling module may include: a dependency graph construction process and an instruction scheduling process. The dependency graph may include a directed acyclic graph (DAG). The specific structure of the second instruction scheduling module will be described below.
[0111] For example, please refer to Figure 6 , Figure 6 The block diagram of a second instruction scheduling module provided in this application embodiment includes: a dependency graph construction unit, a bypass equivalence and height offset (hs) record determination unit, a ready instruction priority determination unit, a remaining unscheduled instruction judgment unit, an instruction scheduling unit, and a depth and depth offset calculation unit.
[0112] The instruction sequence input dependency graph construction unit is used to construct a dependency graph based on the instruction sequence. The bypass equivalence and height offset record determination unit is used to determine the bypass equivalence of each instruction and the height offset record of instructions that are bypassed but not equivalent, based on the dependency graph. Each instruction in the instruction sequence corresponds to at least one type of scheduling information. The scheduling information indicates the resources occupied by the instruction at each stage. The height offset record includes the height offset of instructions that are bypassed but not equivalent when executed on the resources indicated by the corresponding scheduling information.
[0113] The remaining unscheduled instruction determination unit is used to determine whether there are any remaining unscheduled instructions during the entire scheduling process. If there are remaining unscheduled instructions, the ready instruction priority determination unit is used to determine the ready instructions that can be scheduled in the current cycle and the priority of each ready instruction. The instruction scheduling unit is used to schedule ready instructions according to their priorities, and the depth and depth offset calculation unit is used to calculate the depth and depth offset of the successor instructions of each ready instruction. When there are no remaining unscheduled instructions, the instruction scheduling unit outputs a sequence of instructions arranged according to the scheduling timing.
[0114] The following is based on the aforementioned Figure 6 The following description uses the structure of the second instruction scheduling module as an example to illustrate process 201. Please refer to... Figure 7 , Figure 7 This application provides a flowchart illustrating the scheduling of an instruction sequence, which can be described as follows:
[0115] 2011. Based on the dependency graph of the instruction sequence, determine the bypass equivalence of each instruction in the instruction sequence.
[0116] First, the process of establishing the dependency graph of the instruction sequence is explained. This process can be described by... Figure 6 The dependency graph construction unit shown executes. The dependency graph construction unit can build a dependency graph based on the scheduling information of each instruction in the instruction sequence.
[0117] For example, a dependency graph includes a Directed Acyclic Graph (DAG). The building block of a dependency graph can use specific instructions in an instruction sequence as "vertices," dependencies between instructions as "directed edges," and the minimum interval between instructions as "edge weights," thus forming a DAG. At this stage, the impact of bypasses on edge weights is not considered. Then, the initial height of each vertex is calculated based on the edge weights. The height of a vertex in a DAG is the sum of the weights of all edges on the path from that vertex to the deepest leaf vertex.
[0118] The aforementioned process is illustrated using a Directed Acyclic Graph (DAG) as an example. For each instruction in the DAG, all its successor instructions are traversed. In this embodiment, the bypass results when an instruction is executed according to the resources indicated by various scheduling information may differ. However, bypassing can affect data dependencies. Therefore, when traversing successor instructions, it is necessary to detect whether there are data dependencies between the instruction and its successors to achieve better instruction scheduling. For example, data dependencies may include RAW dependencies and / or WAR dependencies. In a RAW dependency scenario, bypassing may decrease the height of the instruction by one and decrease the depth of its successor instructions by one. In a WAR dependency scenario, bypassing may increase the height of the instruction by one and increase the depth of its successor instructions by one. Here, the depth of the instruction is the maximum value of the sum of weights on the paths from the vertex corresponding to the instruction to its ancestor vertices in the DAG. The effect of bypassing on the height of the instruction is called height offset, and the effect on the depth of the instruction is called depth offset.
[0119] When there is a data dependency between instructions and their successors, it is necessary to determine the bypass equivalence of the instructions and then process them accordingly. This usually allows for better instruction scheduling, enabling a larger number of instructions to be executed in parallel within a single cycle, thereby reducing the memory footprint of the compiled program.
[0120] For example, if an instruction has a data dependency with its successor instructions, and the weight of the edge between the instruction and each successor instruction remains unchanged when the instruction is executed on the resource indicated by the various scheduling information, then the instruction's bypass is determined to be equivalent. If an instruction has a data dependency with its successor instructions, and the weight of the edge between the instruction and any successor instruction changes when the instruction is executed on the resource indicated by the various scheduling information, then the instruction's bypass is determined to be inequivalent. For instructions with equivalent bypasses, their timing in the DAG remains unchanged during scheduling, and their depth and height offsets when executed on the resource indicated by the various scheduling information are both defaulted to 0.
[0121] Optionally, after determining the bypass equivalence of each instruction, the bypass equivalence of each instruction can also be marked in the dependency graph.
[0122] 2012. Update the weights of edges between instructions in the dependency graph that have bypass and whose bypass equivalents are present, and their successors.
[0123] This process can be performed by a bypass equivalence and height offset recording unit. For example, if a bypass-equivalent instruction has a bypass with each subsequent instruction when executed on the resource indicated by the corresponding scheduling information, the weight of the edge between the bypass-equivalent instruction and the subsequent instructions can be reduced by one. Correspondingly, the height of the bypass-equivalent instruction is also reduced by one. For instructions without bypasses but with bypass equivalence, the weight of the edge between them and the subsequent instructions remains unchanged.
[0124] 2013. Determine the height offset record of bypass-invalid instructions. The height offset record includes the height offset of the bypass-invalid instructions when they are executed on the resource indicated by the corresponding scheduling information.
[0125] This process can be performed by the bypass equivalence and height offset record determination unit. After determining the height offset record of the bypass equivalence instruction, the height offset record of the bypass equivalence instruction can be stored.
[0126] If a bypass-invalid instruction corresponds to multiple scheduling messages, the priority of these messages can be set based on the height offset record. The smaller the height offset of an instruction when executing on the resource indicated by its corresponding scheduling message, the shorter the interval between the instruction and its successor, resulting in a more compact program and generally lower memory usage. Therefore, to minimize memory consumption, the priority of scheduling messages with smaller heights can be set higher; that is, scheduling messages with smaller height offsets have higher priority.
[0127] For example, the height offset can typically be +1, 0, or -1. Scheduling information with a height offset of +1 can be assigned to low priority, scheduling information with a height offset of 0 to medium priority, and scheduling information with a height offset of -1 to high priority.
[0128] The following flowchart illustrates the aforementioned process from 2011 to 2013. Please refer to it. Figure 8 , Figure 8 This application provides an algorithm flowchart for constructing a dependency graph and determining bypass equivalence and height offset records, as illustrated in an embodiment. Figure 8 As shown, a dependency graph is first established. The instructions in the dependency graph are traversed, and it is determined whether an instruction has been encountered. If no instruction is encountered, the process ends. When instruction Q is encountered, its successor instructions are traversed. It is then determined whether a successor instruction for Q has been encountered. If no successor instruction for Q has been encountered, it is determined whether instruction Q has an already traversed successor instruction. Since instruction Q does not have an already traversed successor instruction, the process continues traversing other instructions in the dependency graph.
[0129] When traversing to a successor instruction of instruction Q, it checks whether there is a RAW or WAR dependency between instruction Q and the current successor instruction. If there is no RAW or WAR dependency, it means that instruction Q and the current successor instruction have a WAW dependency, and there is no need to check the bypass; it can continue traversing other successor instructions of instruction Q. If there is a RAW or WAR dependency, it checks and records the impact of the bypass on the edge weights. Then it continues traversing other successor instructions of instruction Q. When another successor instruction of instruction Q is encountered, the aforementioned process continues until no other successor instructions of instruction Q are encountered. Next, it checks whether instruction Q has any already traversed successor instructions. At this point, instruction Q has previously encountered successor instructions, so it continues to check whether the bypasses of instruction Q are equivalent. If the bypass of instruction Q is equivalent, it updates the edge weights between instruction Q and its successor instructions, and then continues traversing other instructions in the dependency graph. If the bypass is not equivalent, it determines the priority of the height offset record and scheduling information, and then continues traversing other instructions in the dependency graph.
[0130] After traversing other instructions in the dependency graph, continue executing the aforementioned process until no instructions are found in the dependency graph, at which point the entire process ends.
[0131] 2014. Based on the dependency graph and the height offset record of instructions that are not equivalent to bypass, the instruction scheduling for the current cycle is performed.
[0132] The compiler can determine the ready instruction set for the current cycle based on the dependency graph, the depth offset record of each instruction in the instruction sequence, and the height offset record of instructions that bypass inequalities. The ready instructions in the ready instruction set are arranged in priority order, and the depth offset record includes the depth offset of the instruction at the resource indicated by the corresponding scheduling information. Then, each ready instruction in the ready instruction set is scheduled sequentially.
[0133] The process of determining the ready instruction set for the current cycle can be executed by the ready instruction priority determination unit. This unit first identifies instructions that meet the data conflict requirements of the current cycle as ready instructions, and then sorts these ready instructions by priority to obtain the ready instruction set. If there are no instructions that meet the data conflict requirements in the current cycle, a NOP (Not Open Program) for one cycle is inserted, and the cycle is incremented by 1. This process is then executed in the next cycle.
[0134] The ready instruction priority determination unit can use the depth and depth offset records of unscheduled instructions in the dependency graph to determine the instructions that meet the data conflict requirements of the current cycle, thereby determining at least one ready instruction for the current cycle. Then, based on the priority parameters of each ready instruction, the at least one ready instruction is prioritized to obtain a ready instruction set. The priority parameters of a ready instruction include at least one of the following: the height of the ready instruction in the dependency graph, the number of successor instructions in the dependency graph, and the height offset record of the ready instruction. The ready instruction set can exist in the form of a queue.
[0135] When determining at least one ready instruction for the current cycle, the depth and depth offset of the unscheduled instruction in the dependency graph can be used to determine whether the unscheduled instruction meets the data conflict requirements of the current cycle. The depth offset of the unscheduled instruction may differ depending on the resource indicated by different scheduling information. Instructions are scheduled starting from cycle 0, with each cycle incrementing by 1. The compiler can sum the depth of the unscheduled instruction in the dependency graph with each depth offset recorded to obtain a sum for each scheduling information. When there exists a sum less than or equal to the current cycle number, the unscheduled instruction is determined as a ready instruction, and the scheduling information corresponding to the sum less than or equal to the current cycle number is recorded.
[0136] The following explains the depth offset of unscheduled instructions. For instructions with bypass equivalents, the depth offset of their successor instructions is always 0. For instructions with non-bypass equivalents, the depth offset of their successor instructions is dynamically calculated during the scheduling process. The depth offset record of an unscheduled instruction is calculated after all its predecessor instructions have been scheduled. At this point, the scheduling information used by the predecessor instructions of the unscheduled instruction has been determined, that is, the resources occupied (including the target root slot and the functional units occupied by each stage) have been determined. The compiler can detect whether there is a bypass between the functional units of each operand in the resources indicated by the scheduling information corresponding to the unscheduled instruction and the functional units of the corresponding operands in the resources occupied by the predecessor instructions, in order to determine the depth offset record of the unscheduled instruction.
[0137] For unscheduled instructions without a predecessor instruction, their depth offset records are all 0 by default, and their depth in the dependency graph is also 0. Therefore, if the current cycle is the first cycle (cycle = 0), then the depth offset of any unscheduled instruction has not been calculated, and only instructions without a predecessor instruction satisfy the data conflict requirement. If the current cycle is not the first cycle (cycle ≠ 0), then the compiler records the depth offset records of some unscheduled instructions calculated in previous cycles.
[0138] When prioritizing at least one ready instruction based on its priority parameters, the following example illustrates how the priority parameters of a ready instruction include: the height of the ready instruction in the dependency graph, the number of successor instructions in the dependency graph, and the height offset record of the ready instruction. In one implementation, the priority of each ready instruction can be determined according to these three priority parameters separately, and then the highest priority determined based on these three parameters is set as the priority of the ready instruction. In another implementation, the priority of each ready instruction can be determined first according to one of the priority parameters. When ready instructions with the same priority exist, further priority determination is performed based on the other priority parameter. If ready instructions with the same priority still exist, a final priority determination is performed based on the remaining priority parameter.
[0139] For example, corresponding to the other implementation described above, the compiler can first prioritize at least one ready instruction based on the height of each ready instruction. When there are multiple first ready instructions with the same height among the at least one ready instructions, the priority of these multiple first ready instructions is prioritized based on the number of successor instructions for each of the multiple first ready instructions. When there are multiple second ready instructions with the same number of successor instructions among the multiple first ready instructions, the priority of these multiple second ready instructions is prioritized based on the height offset record of each of the multiple second ready instructions, thus obtaining the ready instruction set.
[0140] When prioritizing at least one ready instruction based on its height, the ready instructions are arranged in descending order of height, with higher-height ready instructions having higher priority. When prioritizing multiple first ready instructions based on the number of successor instructions for each first ready instruction, the first ready instructions are arranged in descending order of successor instructions, with more successor instructions having higher priority. A higher height or more successor instructions in the dependency graph indicates that the instruction is on the critical path. Scheduling this instruction first allows for scheduling more instructions in a cycle. Therefore, to schedule more instructions in the current cycle, setting these two parameters higher increases the priority of the ready instruction.
[0141] When prioritizing multiple second-ready instructions based on their height offset records, those with positive height offsets are designated as high-priority instructions (e.g., instructions with positive height offsets but no negative ones). Instructions with all height offsets of 0 are designated as medium-priority instructions. Instructions with negative height offsets are designated as low-priority instructions. The lower the instruction's height, the lower the height of the subgraph containing that instruction, and generally, the lower the program's memory usage. To minimize the height of subgraphs containing second-ready instructions, they can be scheduled sequentially according to their height offsets. If a second-ready instruction has a positive height offset, its height will be higher in subsequent scheduling cycles, thus its priority can be set to high. If a second-ready instruction has a negative height offset, its height may not increase even in subsequent scheduling cycles, thus its priority can be set to low.
[0142] The process of sequentially scheduling each ready instruction in the ready instruction set can be executed by the instruction scheduling unit. The instruction scheduling unit reads the ready instructions sequentially according to their order in the ready instruction set. Then, based on structural conflict constraints and the root VLIW mode among several pre-set VLIW modes, the unit attempts to schedule the read ready instructions. The root VLIW mode includes all root slots.
[0143] Multiple VLIW patterns are all validly encoded VLIW patterns, which can be recognized by the processor's decoding circuitry. A VLIW pattern can include at least one slot, which can be a root slot or a sub-slot. VLIW patterns also have inclusion relationships with each other based on the slots they include. For example, if VLIW pattern A and another VLIW pattern B have a one-to-one slot correspondence, and each slot in VLIW pattern A is a sub-slot of the corresponding slot in VLIW pattern B, then VLIW pattern A is said to be included by VLIW pattern B. After removing all VLIW patterns that can be included by other VLIW patterns, the remaining VLIW patterns that do not have inclusion relationships with each other are called the root VLIW pattern (or root bundle pattern). All slots in the root VLIW pattern are root slots.
[0144] If the current cycle is not the first cycle, then there is at least one scheduling branch up to the current cycle. Each scheduling branch includes the instruction scheduling result for each cycle up to the current cycle. The instruction scheduling results for the same cycle under different scheduling branches are different. The instruction scheduling result for any cycle includes: the target root VLIW mode for that cycle and the resource usage of the instructions placed in the target root VLIW mode. The resource usage includes: the root slot occupied by the instructions in the target root VLIW mode, the cycle and functional unit occupied by each stage of the instruction.
[0145] After reading any ready instruction, the instruction scheduling unit needs to check for structural conflict constraints based on the instruction scheduling results of each scheduling branch in order to schedule the ready instructions. The scheduling process under one scheduling branch is described below; the scheduling under other scheduling branches can refer to this process. For any scheduling branch, after reading the first ready instruction, since there are no scheduled instructions before the first ready instruction, the scheduling unit attempts to schedule the first ready instruction in each root VLIW mode according to the scheduling results of all instructions in that scheduling branch, for each resource indicated by the corresponding scheduling information.
[0146] Specifically, the instruction scheduling unit can determine whether there are resource conflicts indicated by the various scheduling information of the first ready instruction in each root VLIW mode, based on the scheduling results of all instructions in any scheduling branch. The instruction scheduling unit can first determine the currently selectable root VLIW mode among multiple root VLIW modes based on the various scheduling information of the first ready instruction. The currently selectable root VLIW mode includes the selectable root slot of the first ready instruction.
[0147] Subsequently, for each currently selectable root VLIW mode, the instruction scheduling unit determines whether there are any conflicts between the resources indicated by the various scheduling information of the first ready instruction in the currently selectable root VLIW mode. Taking a current selectable root VLIW mode and a scheduling information corresponding to the first ready instruction as an example, the instruction scheduling unit can determine whether there are any conflicts between the selectable root slots in the scheduling information and the slots where other instructions in the selectable root VLIW mode are placed (since it is the first ready instruction, there are definitely no conflicts). It also determines whether there are any conflicts between the selectable functional units of each stage in the selectable period and the functional units occupied by already scheduled instructions in any scheduling branch. When neither conflict exists, the instruction scheduling unit determines that the first ready instruction can be placed in the root VLIW mode and uses the selectable root VLIW mode and the resources indicated by the scheduling information as an instruction scheduling result for the current period under any scheduling branch. When any conflict exists, the instruction scheduling unit determines that the first ready instruction cannot be placed in the root VLIW mode.
[0148] After the above process is completed, when the instruction scheduling result of the first ready instruction exists under some scheduling branches, it indicates that the first ready instruction can be placed in the current cycle, that is, the first ready instruction can be executed in the current cycle. The instruction scheduling unit then reads the next ready instruction in the ready instruction set, and under each scheduling branch, according to the instruction scheduling result of the scheduling branch, attempts to schedule the next ready instruction in each currently selectable root VLIW mode, until there are no ready instructions in the ready instruction set that can be scheduled in the current cycle. The scheduling attempt process of the next ready instruction can refer to the scheduling attempt process of the first ready instruction, which will not be described in detail in this embodiment. Then the cycle number is incremented by 1, and processes 2014 and 2015 are repeated in the next cycle. Processes 2014 and 2015 are repeated until all instructions in the instruction sequence are scheduled.
[0149] Optionally, for scheduling branches whose instruction scheduling results do not have a first ready instruction, these scheduling branches can be removed. Subsequent ready instruction scheduling in the current cycle will not attempt to schedule under the removed branches. This allows for scheduling more instructions in a single cycle, typically reducing the overall length of instruction scheduling and potentially decreasing the memory footprint of the instruction program.
[0150] When no instruction scheduling result for the first ready instruction exists under any scheduling branch, it indicates that the first ready instruction cannot be placed in the current cycle, meaning the first ready instruction cannot be executed in the current cycle. The instruction scheduling unit delays the scheduling of the first ready instruction in the next cycle. Then, it reads the next ready instruction from the ready instruction set and attempts to schedule it in the current cycle, until there are no more ready instructions schedulable for the current cycle in the ready instruction set. The cycle number is then incremented by 1, and processes 2014 and 2015 are repeated in the next cycle. Processes 2014 and 2015 are repeated until all instructions in the instruction sequence have been scheduled.
[0151] When there are no ready instructions available for scheduling in the current cycle from the ready instruction set, in one example, the scheduling results of each instruction under each scheduling branch can be used as branches to continue scheduling in the current cycle. In another example, only the scheduling result of one instruction under each scheduling branch can be retained to continue scheduling. This application does not limit this approach.
[0152] Optionally, the compiler includes a resource allocation table for each scheduling branch up to the current cycle. The instruction scheduling unit can record the instruction scheduling results in the scheduling branch in the corresponding resource allocation table. Since the instruction scheduling results are updated sequentially during instruction scheduling, the scheduling branches are also updated accordingly. Therefore, the resource allocation table is also updated sequentially during instruction scheduling. Each time a new scheduling branch is added, a new resource allocation table corresponding to that branch needs to be created. For each scheduling branch, each time an instruction scheduling result is added under that branch, that instruction scheduling result needs to be recorded in the resource allocation table.
[0153] When the instruction scheduling unit attempts to schedule ready instructions for the resources indicated by the various scheduling information, if a ready instruction corresponds to multiple scheduling information, the instruction scheduling unit needs to perform corresponding processing based on the bypass equivalence of the ready instruction.
[0154] The process of handling the bypass equivalence of ready instructions includes: when the bypass of a ready instruction is equivalent, the instruction scheduling unit attempts to schedule the resources indicated by the various scheduling information respectively. When the bypass of a ready instruction is not equivalent, the priority of the various scheduling information is determined based on the height offset record of the ready instruction, and scheduling attempts are made in descending order of priority. The priority of each scheduling information can be obtained from the aforementioned process 2013. For example, the instruction scheduling unit can first obtain the instruction scheduling result of the first priority (i.e., the highest priority) scheduling information among the various scheduling information, and then determine the selectable first priority scheduling information. The instruction scheduling result of any first priority scheduling information can be referred to the foregoing description, and will not be repeated here in the embodiments of this application.
[0155] When there are selectable first-priority scheduling information, the first example process of the bypass equivalent ready instruction described above can be referred to, and will not be repeated in this embodiment. If there are multiple selectable first-priority scheduling information, and the weight of the edge between the ready instruction and each subsequent instruction in the dependency graph remains unchanged when the ready instruction is executed on the resource indicated by multiple selectable first-priority scheduling information, then the second example process of the bypass equivalent ready instruction described above can be referred to, and will not be repeated in this embodiment. When there is no selectable first-priority scheduling information, the scheduling attempt result of the second-priority scheduling information among multiple scheduling information is obtained. The subsequent process can be referred to the description in this section, and will not be repeated in this embodiment.
[0156] During the scheduling of instructions in the current cycle, after a ready instruction is scheduled, the weights of the edges between the ready instruction and its predecessor instruction in the dependency graph, as well as the depth of the ready instruction, are updated. This process can be performed by the depth and depth offset calculation unit.
[0157] After a ready instruction is scheduled, the depth and depth offset calculation unit also needs to determine whether each predecessor instruction of the target successor instruction has been scheduled. When each predecessor instruction of the target successor instruction has been scheduled, the depth and depth offset record of the target successor instruction in the dependency graph are determined. Based on the depth and depth offset record of the target successor instruction in the dependency graph, the current depth of the target successor instruction is determined. Based on the current depth of the target successor instruction, it is determined whether to add the target successor instruction to the ready instruction set. The depth and depth offset calculation unit determines whether the current depth of the target successor instruction meets the data conflict requirements of the current cycle. This process can be referred to the aforementioned process 2014, and will not be elaborated upon here in this embodiment.
[0158] The ready instruction is then removed from the ready instruction set, and the next ready instruction in the ready instruction set is scheduled. If there is no ready instruction in the ready instruction set that can be placed in the current cycle, the scheduling process for the current cycle ends, and the scheduling process for the next cycle begins.
[0159] It should be noted that ready instructions that cannot be placed in the current cycle will be delayed until the next cycle for scheduling. In the next cycle, the ready instructions delayed in the current cycle also need to be confirmed as ready instructions. Correspondingly, if the current cycle is not the first cycle, then in the aforementioned process 2014, the ready instructions delayed in previous cycles also need to be confirmed as ready instructions for the current cycle. For example, ready instructions delayed in the current cycle can be placed in a pending queue. In the next cycle, the instructions in the pending queue also need to be confirmed as ready instructions for the next cycle. Correspondingly, in the aforementioned process 2014, the instructions currently in the pending queue also need to be confirmed as ready instructions for the current cycle.
[0160] This process is illustrated using a single-cycle instruction scheduling example; instruction scheduling for other cycles can refer to this process, and will not be elaborated upon in this embodiment. In each subsequent cycle, the ready instructions for each cycle are scheduled according to the instruction scheduling results of each scheduling branch. That is, in the entire scheduling process, starting from at least one instruction scheduling result of the first cycle, the instruction scheduling results of each subsequent cycle extend sequentially, forming a path diagram with multiple scheduling branches.
[0161] The scheduling process ends when no unscheduled instructions exist in a given cycle. If multiple scheduling branches exist, one branch can be selected, and the instruction scheduling result for each ready instruction is obtained based on the best-performing branch. Optionally, the earliest scheduling branch that allows all instructions to be placed can be selected. This ensures good performance in instruction scheduling. During the extension of scheduling branches, pruning can be performed when the branch size becomes too large, thereby fully utilizing the parallelism of instructions in the VLI design while improving compilation efficiency. The scheduled instruction sequence is then output.
[0162] The aforementioned instruction scheduling process takes into account the WAR dependency between an instruction and its successor, meaning there are cases where a bypass increases the instruction's height and depth, thus enabling precise timing scheduling and negative edge scheduling. The instruction scheduling process of this embodiment can also be applied to processors or accelerators with non-precise timing. In this case, it is not necessary to consider the WAR dependency between the instruction and its successor, nor is there a case where a bypass increases the instruction's height and depth. In this case, the height and depth offsets are negative or 0, and there is no need to consider cases where the height and depth offsets are positive.
[0163] The following flowchart illustrates the scheduling process of the entire instruction sequence. Please refer to it. Figure 9 , Figure 9 A flowchart illustrating an instruction scheduling algorithm provided in an embodiment of this application is shown. Figure 9As shown, the ready instruction set is first obtained based on the instruction depth, depth offset, and the Pending queue. The ready instruction set is traversed, checking if it is empty. If empty, the cycle number is incremented, and a new ready instruction set is obtained based on the instruction depth, depth offset, and the Pending queue. If not empty, subsets H of the ready instruction set are traversed in descending order of height, where all ready instructions in H have the same height. Subsets P of H are then traversed in descending order of the number of successor instructions, where all ready instructions in P have the same number of successor instructions. Next, the highest priority instruction in P is determined based on the height offset and scheduled. A structural conflict is checked; if a conflict exists, the instruction is added to the Pending queue, and the ready instruction set is traversed again. If no conflict exists, the instruction depth and the weight of its predecessor edge are updated, and the successor instructions are traversed again, calculating their depth and depth offset. The process checks if the sum of the depth and depth offset of the subsequent instruction meets the depth requirement of the current cycle. If it does, the subsequent instruction is added to the ready instruction set, and the previously scheduled instruction is removed from the ready instruction set. If the depth requirement is not met, the previously scheduled instruction is removed from the ready instruction set. Finally, the process checks if the instruction sequence has been fully scheduled. If it has, the scheduling process ends; otherwise, the ready instruction set is re-traversed.
[0164] For the scheduled instruction sequence, the compiler needs to package the instructions to obtain multiple VLIWs. In one implementation, this process can be performed by the aforementioned... Figure 3 The instruction packing module shown executes, and the scheduled instruction sequence can be a linear sequence. The instruction packing module, based on the root VLIW mode among multiple pre-set VLIW modes and the set of instructions that can be placed in each root slot of the root VLIW mode, combines structural and data conflicts to pack the instruction sequence of each cycle into a VLIW, i.e., a Bundle. Within a VLIW, the target root slot of the instruction can be indicated by the instruction's attribute tag. In another implementation, the compiler may not include an instruction packing module; the instruction packing process can be executed by a second instruction scheduling module, whose packing process can refer to the aforementioned instruction packing module process. This embodiment of the application does not limit this.
[0165] It should be noted that during the entire scheduling process, the remaining unscheduled instruction determination unit needs to determine whether there are any remaining unscheduled instructions. If there are remaining unscheduled instructions, the aforementioned processes 2014 and 2015 are executed; if there are no remaining unscheduled instructions, the instruction scheduling ends.
[0166] The aforementioned process is based on the attribute that the sub-slot is contained within the root slot. It only utilizes the root VLIW mode for instruction scheduling and instruction packaging, and the scheduling result is also represented only through the root VLIW mode. This eliminates the need for sub-slots to participate in instruction scheduling and instruction packaging, thereby reducing the scheduling scale and improving scheduling efficiency.
[0167] The instruction sequence in process 201 is derived from the business source code. As mentioned above... Figure 3 As shown, the business source code sequentially passes through the compiler front-end and optimizer to obtain the IR sequence. The instruction selection module converts each instruction in the IR sequence into a composite coded pseudo-instruction, resulting in an instruction sequence. For example, the instruction selection module can first convert the IR sequence into a DAG, then perform optimizations such as merging, data validating, and instruction validating on the DAG. Then, a composite coded pseudo-instruction is obtained based on the DAG using an instruction selection algorithm. Instruction selection algorithms can include tree covering algorithms, DAG covering algorithms, and graph covering algorithms. The composite coded pseudo-instruction can represent instructions with the same function but different resource consumption and encoding (different slots or functional units occupied); it is not the actual encoding of the instruction. By using composite coded pseudo-instructions, instructions with the same function but different resource consumption and encoding can be uniformly scheduled, maximizing the utilization of the instruction's ILP characteristics, thereby achieving better scheduling performance and effectively supporting further compilation of flexible VLI encoding.
[0168] Before executing process 201, the first instruction scheduling module can adjust the order of the instruction sequence to improve the execution performance of the compiled code. At this time, the operands of the instructions are virtual registers. The first instruction scheduling module adjusts the order of the instruction sequence based on the dependencies between instructions and resource usage, while also reducing the pressure on register usage. Optionally, the first instruction scheduling module can use a table scheduling algorithm to adjust the order of the instruction sequence.
[0169] The soft pipeline scheduling module then schedules the instruction sequence within the loop core to improve the loop's execution efficiency. At this point, the compound coded pseudo-instruction is in the form of static single assignment (SSA).
[0170] The register allocation module converts virtual registers in the instruction sequence into physical registers and deconstructs composite coded pseudo-instructions in SSA form. When the number of physical registers is insufficient, the register allocation module also needs to perform register splitting and overflow operations. Optionally, the register allocation module can use graph coloring methods or linear scan algorithms to implement the aforementioned process. The instruction sequence, which contains composite coded pseudo-instructions, is then output to the second instruction scheduling module. The second instruction scheduling module executes this process 201 based on the instruction sequence.
[0171] This process 201 implements the scheduling of optional bypass instructions and schedules each composite coded pseudo-instruction to a specific root slot. It should be noted that at this point, the instructions in the root slot do not have instruction encoding information; the encoding instructions need to be determined based on subsequent processes to complete the instruction compilation.
[0172] 202. Based on the target root slot where each instruction in VLIW is located, determine the optional VLIW mode from multiple VLIW modes. The optional VLIW mode includes a target optional sub-slot for each instruction in VLIW. The target optional sub-slot of the instruction is a sub-slot of the target root slot where the instruction is located.
[0173] This process can be described by the aforementioned Figure 3 The compiler shown executes the Bundle selection module. The specific structure of the Bundle selection module is explained below. For an example, please refer to... Figure 10 , Figure 10 This is a block diagram of a Bundle selection module provided in an embodiment of this application. The Bundle selection module includes: a Bundle building unit, a Bundle initial selection unit, and a Bundle final selection unit.
[0174] The Bundle building unit and the Bundle initial selection unit are used to execute this process 202. Multiple VLIWs (i.e., multiple Bundles) are input to the Bundle building unit, which is used to build multiple initial VLIW patterns based on the VLIWs. The Bundle initial selection unit is used to perform validity checks on all the initial VLIW patterns built based on the multiple VLIW patterns to obtain the optional VLIW patterns.
[0175] The following is based on the aforementioned Figure 10 The structure of the Bundle selection module shown is used as an example to describe process 202 in detail. Please refer to [link / reference]. Figure 11 , Figure 11 A flowchart illustrating the process of determining an optional VLIW mode from multiple VLIW modes, as provided in this application embodiment, can be shown below:
[0176] 2021. Based on the target root slot, optional sub-slots, and slot information of each instruction in VLIW, determine the target optional sub-slots for each instruction. The slot information includes the correspondence between the root slot and the sub-slots included in the root slot.
[0177] This process can be performed by Figure 10 The Bundle building unit is executed within the system. Instructions that can be placed in a sub-slot can also be placed in the root slot to which they belong, and the sub-slots of the root slot include itself.
[0178] The optional subslots for each instruction are pre-defined. The Bundle building unit can determine the subslots included in the target root slot of the instruction based on the slot information, and then determine the target optional subslots of the instruction by the intersection of the subslots included in the target root slot of the instruction and the optional subslots of the instruction.
[0179] 2022. Arrange and combine the target optional sub-slots of each instruction in VLIW to obtain at least one initial VLIW mode.
[0180] This process can be performed by Figure 10 The Bundle building unit is executed in VLIW. There is a one-to-one correspondence between the slots in VLIW and the initial VLIW mode, and any slot in VLIW is the root slot of the corresponding slot in the initial VLIW mode.
[0181] 2023. Determine at least one initial VLIW mode and multiple overlapping VLIW modes in a VLIW as optional VLIW modes.
[0182] Since invalid VLIW patterns may exist among the at least one initial VLIW pattern obtained in the aforementioned process 2022, these invalid VLIW patterns cannot be recognized by the processor's decoding circuit. Therefore, it is necessary to use multiple VLIW patterns to filter out valid initial VLIW patterns from at least one initial VLIW pattern. This process can be performed by... Figure 10 The initial selection unit of the Bundle is executed.
[0183] 203. Based on the slot corresponding to each instruction in the shortest selectable VLIW mode, determine the machine instruction corresponding to each instruction in the VLIW from the first mapping relationship. The first mapping relationship represents the machine instruction of the instruction in each slot.
[0184] The compiler first selects the shortest optional VLIW mode from the available VLIW modes, and then executes this process (203). The process of selecting the shortest optional VLIW mode can be handled by... Figure 10 The Bundle final selection unit is executed. The Bundle final selection unit traverses all optional VLIW modes of VLIW and selects the optional VLIW mode with the shortest code length. If there are multiple optional VLIW modes with the shortest code length, the shortest optional VLIW mode is selected from any of them.
[0185] This process 203 can be performed by Figure 10 The Bundle in the instruction set is executed by the final selection unit. The machine instructions in each slot are uncoded machine instructions.
[0186] It should be noted that if there is a slot with no instruction placed in the shortest optional VLIW mode, then a NOP needs to be added to that slot.
[0187] 204. Based on the machine instruction corresponding to each instruction in VLIW, the shortest optional VLIW mode, and the second mapping relationship, determine the encoding instruction of each instruction in VLIW, and obtain the VLIW encoding corresponding to VLIW. The second mapping relationship represents the encoding of the machine instruction in each VLIW mode.
[0188] This process can be performed by Figure 10 The Bundle final selection unit executes the process. After determining the encoded instruction for each instruction in the VLIW based on the second mapping relationship, the Bundle final selection unit concatenates the encoded instructions to obtain the VLIW encoding corresponding to the VLIW.
[0189] The encoding of the same slot may differ in different VLIW modes; for example, the same slot may have an additional preamble in some VLIW modes. Procedures 203 and 204 can be used to compile instructions even when the encoding of the same slot differs in different VLIW modes.
[0190] The following flowchart illustrates the process of obtaining the VLIW encoding for each VLIW. Please refer to it. Figure 12 , Figure 12 This is a flowchart illustrating an algorithm for obtaining the VLIW code corresponding to each VLIW, provided as an embodiment of this application. Figure 12 As shown, firstly, multiple VLIWs are input, and the target optional subslot for each instruction in the VLIW is determined based on the optional subslots and slot information. Next, an initial VLIW pattern is constructed, and its validity is checked against the multiple VLIW patterns to determine its validity. If the initial VLIW pattern is invalid, it is discarded. If the initial VLIW pattern is valid, the shortest optional VLIW pattern is selected. Then, the VLIW encoding is obtained through the first-level and second-level mapping tables and the shortest optional VLIW pattern. Finally, it is determined whether a VLIW encoding for each VLIW has been generated. If a VLIW encoding for each VLIW has been generated, the entire algorithm process ends. If a VLIW encoding for each VLIW has not been generated, the optional slot for each instruction in the VLIWs for which no VLIW encoding has been generated is re-determined.
[0191] After obtaining the VLIW encoding corresponding to each VLIW, each VLIW encoding is mapped to generate a binary file or an assembly file. This process can be performed by... Figure 3 The code printing module shown is executed.
[0192] The following uses the LLVM compiler as an example to illustrate the process of compiling a specific instruction sequence through steps 201 to 204 described above. The structure of the LLVM compiler can be found in the aforementioned documentation. Figure 3Please refer to this. Figure 13 , Figure 13 This is a schematic diagram of the business source code provided in an embodiment of this application. The business source code is test.c. Please refer to... Figure 14 , Figure 14 This is a schematic diagram of the IR sequence provided in an embodiment of this application. Figure 13 The test.c source code shown has been processed. Figure 3 The IR sequence generated by the LLVM compiler frontend in the example is as follows Figure 14 As shown.
[0193] Please refer to Figure 15 , Figure 15 This is a schematic diagram of the instruction sequence provided in an embodiment of this application. Figure 14 The IR sequence shown is transmitted to the LLVM compiler backend after passing through the optimizer. The instruction selection module converts the optimized IR sequence into an instruction sequence, which is now in SSA form. The SSA instruction sequence is then processed by the first instruction scheduling module and the register allocation module to generate a non-SSA instruction sequence, as shown below. Figure 15 As shown.
[0194] Then the second instruction scheduling module... Figure 15 The instruction sequence shown is scheduled. The second instruction scheduling module first establishes a dependency graph based on the scheduling information of each instruction. The following embodiment uses a Directed Acyclic Graph (DAG) as an example for illustration. The scheduling information of some instructions in the instruction sequence is shown below. The scheduling information of other instructions can be referred to in the following description, which will not be repeated in this embodiment. It should be noted that the scheduling information extends the support for optional bypass cases of instructions.
[0195] The scheduling information for some instructions is shown below:
[0196]
[0197]
[0198] The definitions of FuncUnit and Bypass are as follows; they are simply strings with specific meanings used to distinguish different functional units from Bypass.
[0199] class FuncUnit;
[0200] class Bypass;
[0201] InstrSchedData represents the specific scheduling information of the instruction. Since composite coded pseudo-instructions are fitted together from instructions with the same function but different codes, there may be multiple InstrSchedData, each corresponding to the scheduling information when the composite coded pseudo-instruction is placed in different slots.
[0202] For example, the PS_const_hw instruction has three types of InstrSchedData, corresponding to the scheduling information when the PS_const_hw instruction is placed in slots A, B, and D, respectively. The three types of scheduling information for the PS_const_hw instruction are as follows:
[0203]
[0204]
[0205] The similar PS_sll_rr instruction has an InstrSchedData, which corresponds to the scheduling information when the PS_sll_rr instruction is placed in Slot A. The scheduling information for the PS_sll_rr instruction is as follows:
[0206]
[0207] The scheduling information for other instructions in the embodiment is as follows:
[0208]
[0209]
[0210] Based on the ComInstrSchedData data structure described above, scheduling information for each instruction needs to be defined, forming an extended scheduling information table. The DAG is constructed using the instruction sequence and the extended scheduling information table as input. The following explains the DAG construction process.
[0211] First, based on the dependencies between the inputs and outputs of each instruction, and following the order of the instruction sequence, specific instructions in the sequence are designated as "vertices," the dependencies between instructions as "directed edges," and the minimum interval between instructions as "edge weights," forming a Directed Acyclic Graph (DAG). Then, the DAG is traversed from top to bottom to calculate the height of each instruction. The higher the height of an instruction, the more cycles are required for the entire instruction scheduling process.
[0212] For example, please refer to Figure 16 , Figure 16 This is a schematic diagram of a DAG provided in an embodiment of this application. The DAG was constructed without considering the impact of bypass. Figure 16As shown, the DAG illustrates the sentinel instruction EntrySU, the return instruction ExitSU: jump_ret (implicitly using the ra register), and 15 SU units (SU) from 0 to SU14. Each SU unit contains instructions and other related information. The box containing the SU unit shows the input operands of the included instruction above and the output operands below. Thick solid lines represent RAW dependencies between instructions within each SU unit, thin solid lines represent WAR dependencies, and dashed lines represent WAW dependencies.
[0213] Subsequent instruction scheduling is performed based on the SU (Suspended Unit) cell. Other relevant information in the SU cell includes instruction bypass equivalence, instruction depth offset records, and height offset records. The following information is initially set to default values and is updated in real-time during subsequent instruction scheduling. For example, this section is shown below:
[0214]
[0215] Then, the DAG is traversed from top to bottom, executing the aforementioned process from 2011 to 2013. The following uses SU units to directly represent the included instructions, such as... Figure 16 As shown, skipping EntrySU and starting from SU1. SU1 has only one successor instruction SU4, and SU1 and SU4 have a RAW dependency, so it is necessary to determine the equivalence of SU1's bypass. Based on the aforementioned scheduling information of SU1, determine the bypass of SU1 when executing resources indicated by various scheduling information, and based on the scheduling information of SU4, determine the bypass of SU4 when executing resources indicated by various scheduling information. SU4 corresponds to one type of scheduling information, which includes the optional root slot A. Therefore, it can be seen that SU1 has a bypass with SU4 when executing resources indicated by scheduling information including the optional root slot B, which is P0. This bypass makes the height offset of SU1 -1. There is no bypass between SU1 and SU4 when executing resources indicated by other scheduling information, and the height offset of other scheduling information is 0. Since the weight of the edge between SU1 and SU4 changes when executing resources indicated by various scheduling information, the DAG timing is inconsistent, therefore the bypass of SU1 is not equivalent. The inequivalence of SU1's bypass can be marked in the DAG.
[0216] For example, please refer to Figure 17 , Figure 17 This is a schematic diagram of a DAG with bypass adjustment provided in an embodiment of this application. Figure 17 Based on the traversal process Figure 16The result after adjustments. For example... Figure 17 As shown, the bypass equivalence of the instruction is marked by the value of isEquivalentBypass(isE) in SU1, where 0 indicates inequivalence and 1 indicates equivalence. The value of isE in the SU1 unit is 0. The altitude offset record of SU1 is represented in the order of the three scheduling information as [0, -1, 0]. Since the altitude offset of SU1 is not yet finalized, the value of hs of SU1 in the DAG is initially marked as 0 by default.
[0217] Then follow Figure 16 Traversing to SU4, SU4 has successor instructions SU5 and SU6. There are RAW dependencies between SU4 and these two successor instructions, requiring determination of the bypass equivalence of SU4. SU4 corresponds to a scheduling information, which includes slot A as the optional root slot. Based on the scheduling information of SU5, determine the bypasses for SU5 and SU6 when executing on the resources indicated by various scheduling information. Both SU5 and SU6 can only execute on the resource indicated by one scheduling information. Therefore, there is a bypass P0 between SU4 and both SU5 and SU6. This bypass reduces the edge weights of SU4 with both SU5 and SU6 by 1 and sets the height offset to -1. Since the edge weights of SU4 with SU5 and SU6 do not change when executing on the resources indicated by various scheduling information, and the DAG timing is consistent, the bypass of SU4 is equivalent. Figure 17 As shown, the isE value of SU4 in the DAG is 1, and the hs value is -1. The edge weights between SU4 and SU5 and SU6 in the DAG are all updated to 1, and the height of SU4 is reduced by 1 accordingly.
[0218] Similarly, traverse all instructions in the DAG and determine the bypass equivalence of each instruction, marking the bypass equivalence in the DAG. If the instructions' bypasses are equivalent, update the weight of the edge between the instruction and its successor and the instruction's height in the DAG based on whether a bypass exists. If the instructions' bypasses are not equivalent, determine the instruction's height offset record and the priority of the scheduling information. After traversing... Figure 16 After performing all the instructions in the DAG shown, the updated DAG is as follows: Figure 17 As shown.
[0219] Then according to Figure 17The DAG, the aforementioned scheduling information, the SU unit, and various VLIW modes shown execute the aforementioned processes 2014 and 2015. For example, please refer to Table 1, which illustrates multiple VLIW modes provided in embodiments of this application. Table 1 includes columns for Num, Bundle, slotTypesBitSet, isRootBundle, and encoding. The Num column includes seven numerical indices from 0 to 7; the Bundle column includes seven VLIW modes (i.e., seven Bundle modes); the slotTypesBitSet column includes the BitSet for each VLIW mode; the isRootBundle column includes the root Bundle identifier for each VLIW (YES indicates it is the root Bundle, NO indicates it is not the root Bundle); and the encoding column includes the encoding for each VLIW mode. As shown in Table 1, the VLIW mode with Num of 4 is A3C2, and the slotTypeBitSet is 100000100000. A3C2 can be included by ABCD or ABCDE, therefore it is not the root bundle, and its encoding is 0x400000.
[0220] In this context, the `slotTypesBitSet` of the VLIW mode is obtained by performing a bitwise OR operation on the `BitSetPosition` of each slot in the VLIW mode. For example, the `slotTypesBitSet` of A3C2 is obtained by performing a bitwise OR operation on the `BitSetPosition` of A3 (0b100000) and the `BitSetPosition` of C2 (0b1000000000000). This facilitates the determination of the selectable VLIW mode for the first VLIW from multiple initial VLIW modes based on multiple VLIW modes.
[0221] Table 1
[0222] Num Bundle slotTypeBitSet isRootBundle encoding 0 A1 1000 NO 0x100 1 B1 10000000 NO 0x2000 2 C1 10000000000 NO 0x30000 4 A3C2 100000100000 NO 0x400000 5 A2B2 100010000 NO 0x5000000 6 ABCD 1001001000100 YES 0x60000000000000000 7 ABCDE 11001001000100 YES 0x700000000000000000000
[0223] The following example illustrates how only one instruction scheduling result is retained for each scheduled instruction in each cycle, traversing from top to bottom. Figure 17 The DAG is shown below. First, based on the depth and depth offset records of each instruction in the DAG, it is determined whether the unscheduled instructions meet the depth requirements of the current cycle. The current cycle is cycle 0 (i.e., depth is 0), the Pending queue is empty, and the instructions whose depth and depth offset sum are less than or equal to 0 include SU0, SU1, and SU2. The depth and depth offset of SU0, SU1, and SU2 are all 0, so these three instructions are identified as ready instructions.
[0224] Prioritize SU0, SU1, and SU2. First, prioritize these three instructions based on their height. Figure 17 From the DAG and scheduling information shown, the height order of the three instructions is: SU1 = SU2 > SU0. Further prioritizing these two instructions based on the number of successor instructions, the order of successor instructions is: SU1 = SU2. Then, prioritizing again based on the height offset records of each instruction in these two instructions. After the aforementioned process, the height offset records of SU1 and SU2 are obtained. From these records, it can be seen that SU1 has a negative height offset, while SU2 has a height offset of 0. Therefore, the priority of SU1 is lower than that of SU2. Thus, the ready instruction set is {SU2, SU1, SU0}.
[0225] First, read SU2. From the aforementioned scheduling information for SU2 and Table 1, we know that SU2's bypass is equivalent, and SU2 can only be executed in the resources indicated by the scheduling information with slot D as the optional root slot. The current optional root VLIW modes include ABCD and ABCDE in Table 1. From the resource occupancy table, we know that no other instructions are currently placed in slot D, and when SU2 is placed in slot D, the functional units required by each stage are not occupied by other instructions. Therefore, SU2 can be placed in slot D and executed by the resources indicated by the scheduling information with slot D as the optional root slot. Update the resource occupancy table, setting SU2 to cycle 0, i.e., C = 0, and setting isdIdx of SU2 in the aforementioned scheduling information to 0. isdIdx indicates which scheduling information (i.e., which selection) SU2 uses in the aforementioned scheduling information. The scheduling information shows the resources occupied by the instruction in various scheduling information; therefore, isdIdx can represent the resources occupied by SU2 by representing the scheduling information used by SU2. SU2 is the root vertex and has no predecessor instructions, so there is no need to adjust the depth of SU2 or the weight of the edge between SU2 and its predecessor instructions. Its depth is naturally 0.
[0226] SU2's successor instruction SU3 has only one predecessor instruction SU2. After SU2 is scheduled, all predecessor instructions for SU3 have been scheduled. It is necessary to determine whether SU3 can be added to the ready instruction set. The instruction's depth is added to each depth offset in the instruction's depth offset record. If the sum is less than or equal to the current cycle number, the instruction is determined as a ready instruction; if the sum is greater than the current cycle number, the instruction is not processed. SU3's depth is 2. It is checked whether there is a bypass that reduces the weight of SU3's predecessor path by 1, so that SU3's depth offset is -1 when the bypassed functional unit is executed. According to the scheduling information of SU3 and SU2, there is no bypass between SU3 and SU2, and SU3 can only be executed on the resource indicated by the scheduling information with the optional root slot being slotD. Therefore, SU3's depth offset is 0, and the depth offset in SU3's scheduling information is set to 0. Since the sum of the depth and the depth offset is greater than the depth of the current cycle, SU3 cannot be considered a ready instruction.
[0227] Next, SU1 in the ready instruction set is read. Based on the aforementioned scheduling information for SU1, the bypass of SU1 is not equivalent. The priorities of the three scheduling information types need to be determined according to the height offset record [0, -1, 0] of SU1. Then, based on the structural conflict constraints, scheduling attempts are made in the ABCD and ABCDE VLIW modes in descending order of priority among the three scheduling information types. The priorities of the three scheduling information types are: fuPriority[0, 1, 2] = {default, high, default}, meaning fuPriority1 has the highest priority. According to the resource occupancy table, no other instructions are currently placed in slot B, which includes fuPriority1. Furthermore, when SU1 is placed in slot B, the resources required by each stage are not occupied by other instructions. Therefore, SU1 can be placed in slot B and executed using the resources indicated by fuPriority1. The currently selectable root VLIW modes include ABCD and ABCDE in Table 1. Update the resource usage table, set SU1 to cycle 0 (C=0), and set the value of isdIdx of SU1 in the aforementioned scheduling information to represent the optional index of InstrSchedData corresponding to scheduling information 1 in SU1's scheduling information, i.e., set isdIdx=1. SU1 is the root vertex and has no predecessor, so there is no need to adjust the weight and depth of the predecessor edge, as the depth is naturally 0.
[0228] SU4, the successor instruction to SU1, has only one predecessor instruction SU1. After SU1 is scheduled, all predecessor instructions of SU4 have been scheduled. It is necessary to determine whether SU4 can be added to the ready instruction set. Similar to the determination method for SU3 mentioned above, the depth of SU4 is 2. According to the scheduling information of SU4 and SU1, there is a bypass between SU4 and SU1, and SU4 can only be executed on the resource indicated by one scheduling information. Therefore, the depth offset of SU4 is -1, and the deepOffset in the scheduling information of SU4 is set to -1. The sum of the depth of SU4 and the depth offset is greater than the depth of the current cycle, so SU4 cannot be used as a ready instruction.
[0229] Continuing to traverse the ready instruction set and read SU0, as seen from the aforementioned scheduling information for SU0, SU0's bypass is equivalent, and SU0 can only be executed in the resource indicated by one scheduling information, namely, the resource indicated by the scheduling information for slot C as the optional root slot. The resource occupancy table shows that no other instructions are currently placed in slot C, and when SU0 is placed in slot C, the resources required by each stage are not occupied by other instructions. Therefore, SU0 can be placed in slot C and executed by the resource indicated by this scheduling information. The currently available root VLIW modes include ABCD and ABCDE in Table 1. Update the resource occupancy table, setting SU0 to cycle 0, i.e., C = 0, and setting isdIdx of SU0 in the aforementioned scheduling information to 0. SU0 is the root vertex and has no predecessor instructions, therefore, there is no need to adjust the weight and depth of the predecessor edge; the depth is naturally 0.
[0230] The successor instruction SU14 to SU0 has only one predecessor instruction SU0. After SU0 is scheduled, all its predecessor instructions have been scheduled. It is necessary to determine whether SU14 can be added to the ready instruction set. Referring to the aforementioned determination process, SU14 cannot be a ready instruction. At this point, there are no ready instructions. The shortest VLIW mode, ABCD, between ABCD and ABCDE, is selected; that is, the VLIW mode for cycle 0 is ABCD. The cycle number is incremented by 1, and instruction scheduling for cycle 1 begins.
[0231] In each cycle, the first step is to determine whether each unscheduled instruction meets the depth requirement of the current cycle based on its depth and depth offset in the DAG. Unscheduled instructions that meet the depth requirement are then designated as ready instructions. This process can be referenced from the process in cycle 0 described above, and will not be repeated here. It should be noted that the height and depth offsets of bypass equivalent instructions are always 0, while the height and depth offsets of bypass inequivalent instructions change during the scheduling process.
[0232] Traverse from top to bottom Figure 17The DAG shown includes instructions whose depth and depth offset sum is less than or equal to 1, including SU4, which has a depth of 2 and a depth offset of -1. SU4 is identified as a ready instruction. Since there is only one ready instruction, SU4, in the current ready instruction set, no priority sorting is required.
[0233] Read SU4 from the ready instruction set. Based on the aforementioned scheduling information for SU4, its bypass is equivalent, and SU4 can only be executed in the resource indicated by one scheduling message, namely, the resource indicated by the scheduling message with slot A as the optional root slot. The current optional root VLIW modes include ABCD and ABCDE in Table 1. The resource occupancy table shows that no other instructions are currently placed in slot A, and the resources required by each stage are not occupied by other instructions when SU4 is placed in slot A. Therefore, SU4 can be placed in slot A and executed by the resource indicated by this scheduling message. Update the resource occupancy table, setting SU4 to cycle 1 (C=1), and setting isdIdx of SU4 in the aforementioned scheduling message to 0. SU4 has a depth offset of -1 when executed by the resource indicated by this scheduling message. Update the weight of the edge between SU4 and the predecessor instruction, and update the depth of SU4 to the sum of the initial depth and the depth offset. For an example, please refer to [reference needed]. Figure 18 , Figure 18 This is a schematic diagram of a DAG after partial instruction scheduling, provided in an embodiment of this application. Figure 18 Based on Figure 17 The result obtained after performing the aforementioned scheduling process is as follows: Figure 18 As shown, the weight of the edge between SU4 and the predecessor instruction in the updated DAG is changed from 2 to 1.
[0234] Since the predecessor instructions of SU5 and SU10 in the successor instructions of SU4 have been scheduled, it is necessary to determine whether SU5 and SU10 can be added to the ready instruction set. Referring to the aforementioned determination process, SU5 cannot be a ready instruction, but SU10 can. SU10 can only be executed in the resource indicated by one type of scheduling information, namely the resource indicated by the scheduling information with slot A as the optional root slot. The resource occupancy table shows that slot A already contains SU4; therefore, SU10 is delayed in scheduling and added to the Pending queue. At this point, there are no ready instructions in the ready instruction set, so the shortest VLIW mode ABCD is selected from ABCD and ABCDE, i.e., the VLIW mode for cycle 1 is ABCD. The cycle number is incremented by 1, and instruction scheduling for cycle 2 begins.
[0235] Traverse from top to bottom Figure 18The DAG shown includes instructions with a depth and depth offset sum less than or equal to 2: SU3, SU5, and SU14. These three instructions all have a depth of 2 and a depth offset of 0. These three instructions, along with SU10 in the Pending queue of cycle 1, are identified as ready instructions.
[0236] Prioritize SU3, SU5, SU10, and SU14. First, prioritize these four instructions based on their altitude. Figure 18 From the DAG and scheduling information shown, the height order of the four instructions is: SU3 = SU5 > SU10 > SU14. Further prioritizing SU3 and SU5 based on the number of successor instructions, the order of their successor instructions is: SU3 > SU5. Therefore, the ready instruction set is {SU3, SU5, SU10, SU14}.
[0237] First, read SU3. Referring to the aforementioned method, determine that SU3 can be placed in cycle 2. Update the resource occupancy table, set SU3 to cycle 2 (C=2), and set isdIdx of SU3 in the aforementioned scheduling information to 0. SU3's bypass is equivalent to, and does not have, bypass when placed in cycle 2. There is no need to update the depth of SU3 and the weight of the edge between SU3 and its predecessor instruction.
[0238] The successor instructions to SU3 include SU6 and SU7. All the preceding instructions for SU6 have been scheduled, while the preceding instructions for SU7 have not yet been scheduled. It is necessary to determine whether SU6 can be added to the ready instruction set. Referring to the aforementioned determination process, SU6 cannot be used as a ready instruction.
[0239] Read SU5, and determine that SU5 can be placed in cycle 2 as described above. Update the resource occupancy table, set SU5 to cycle 2 (C=2), and set isdIdx of SU5 in the aforementioned scheduling information to 0. SU5's bypass is equivalent to, and does not have, bypass when placed in cycle 2. There is no need to update the depth of SU5 and the weight of the edge between SU5 and its predecessor instruction.
[0240] The successor instruction to SU5 is SU7, and all the preceding instructions for SU7 have been scheduled. Referring to the aforementioned determination process, SU7 cannot be considered a ready instruction.
[0241] Read SU10 and, referring to the aforementioned method, determine that the slot where SU10 can be placed has been occupied by SU5. Therefore, SU10 is delayed in scheduling and is added to the Pending queue.
[0242] Read SU14, and determine that SU14 can be placed in cycle 2 as described above. Update the resource occupancy table, set SU14 to cycle 2 (C=2), and set isdIdx of SU14 in the aforementioned scheduling information to 0. SU14's bypass is equivalent to, and does not have, bypass when placed in cycle 2. Therefore, it is not necessary to update the depth of SU14 or the weight of the edge between SU14 and its predecessor instruction.
[0243] There are no successor instructions for SU14. At this time, there are no ready instructions in the ready instruction set. The cycle number is incremented by 1, and the instruction scheduling of cycle 3 begins.
[0244] Traverse from top to bottom Figure 18 The DAG shown includes instructions with a depth and depth offset sum less than or equal to 3, namely SU6 and SU7. Both instructions have a depth of 3 and a depth offset of 0. These two instructions, along with SU10 in the Pending queue of cycle 2, are identified as ready instructions.
[0245] Prioritize SU6, SU7, and SU10. First, prioritize these three instructions based on altitude, then... Figure 18 From the DAG and scheduling information shown, the height order of the three instructions is: SU6 = SU7 > SU10. Further prioritizing SU6 and SU7 based on the number of successor instructions, the order of their successor instructions is: SU7 > SU6. Therefore, the ready instruction set is {SU7, SU6, SU10}.
[0246] First, read SU7. Referring to the aforementioned method, determine that SU7 can be placed in cycle 3. Update the resource occupancy table, set SU7 to cycle 3 (C=3), and set isdIdx of SU7 in the aforementioned scheduling information to 0. SU7's bypass is equivalent to, and does not have, bypass when placed in cycle 3. There is no need to update the depth of SU7 and the weight of the edge between SU7 and its predecessor instruction.
[0247] The successor instruction to SU7 is SU9, and all the preceding instructions for SU9 have been scheduled. Referring to the aforementioned determination process, SU9 cannot be considered a ready instruction.
[0248] Read SU6. Based on the above method, it is determined that SU6 cannot be placed in cycle 3. Therefore, SU6 is delayed in scheduling and added to the Pending queue.
[0249] Read SU10. Referring to the aforementioned method, it is determined that SU10 cannot be placed in cycle 3. Therefore, SU10 is delayed in scheduling and added to the Pending queue. At this time, there are no ready instructions in the ready instruction set, the cycle number is incremented by 1, and instruction scheduling enters cycle 4.
[0250] Traverse from top to bottom Figure 18 The DAG shown includes instructions whose depth and depth offset sum is less than or equal to 4, including SU9, which has a depth of 4 and a depth offset of 0. SU9, along with SU6 and SU10 in the Pending queue of cycle 3, are identified as ready instructions.
[0251] Prioritize SU6, SU9, and SU10. First, prioritize these three instructions based on their altitude. Figure 18 From the DAG and scheduling information shown, the height order of the three instructions is: SU6 > SU9 > SU10. Therefore, the ready instruction set is {SU6, SU9, SU10}. Following the aforementioned method, scheduling is performed sequentially. It is determined that SU6 and SU9 can be placed in cycle 4, while SU10 cannot. Therefore, SU10 is delayed in scheduling. SU6 and SU9 are then placed in cycle 4, i.e., C = 4, and so on, until all instructions have been scheduled. For an example, please refer to... Figure 19 , Figure 19 This is a schematic diagram of the DAG after scheduling is completed, provided in an embodiment of this application. Figure 19 The diagram shows the weights of each edge after the scheduling is completed, as well as the cycle in which each instruction is executed.
[0252] Please refer to Figure 20 , Figure 20 This is a schematic diagram of the scheduled instruction sequence provided in an embodiment of this application, based on the aforementioned method. Figure 15 The instruction sequence shown is followed by the instruction sequence obtained after scheduling, as follows: Figure 20 As shown.
[0253] The instruction packaging module then... Figure 20 The instruction sequence shown is packaged to obtain multiple VLIWs in the form of a root slot and a root bundle. For an example, please refer to [link / reference]. Figure 21 , Figure 21 This is a schematic diagram of multiple VLIWs provided in the embodiments of this application. Figure 20 The instruction sequence shown is packaged according to the root slot and root bundle pattern shown in Table 1 to obtain multiple VLIWs (i.e., multiple bundles). Figure 21 As shown.
[0254] Next to Figure 21 The multiple VLIWs shown execute the aforementioned processes 203 and 204 to obtain multiple VLIW codes. For each VLIW, process 2021 is first executed to determine the optional slots for the instructions based on the target root slot where the instructions are located in the VLIW, the slot information, and the initial optional slots. The slot information used in this process and the initial optional slots for the instructions are explained below.
[0255] Please refer to Table 2, which shows slot information provided in an embodiment of this application. Table 2 includes a number (Num) column, a root slot (rootSlot) column, a slotTypes column, and a slotTypesBitSet column. The Num column includes numerical indices from 0 to 4, the rootSlot column includes 5 root slots, the slotTypes column includes sub-slot groups for each root slot, and the slotTypesBitSet column includes the BitSet for each sub-slot group. As shown in Table 2, the root slot with Num of 0 is A, and A includes slotTypes A, A1, A2, A3, and slotTypesBitSet 111100.
[0256] Table 2
[0257] Num rootSlot slotTypes slotTypesBitSet 0 A A, A1, A2, A3 111100 1 B B, B1, B2, B3 111000000 2 C C, C1, C2 111000000000 3 D D 1000000000000 4 E E 10000000000000
[0258] The value in the SlotTypesBitSet column is obtained by ORing the BitSet positions (Pos) of each slot in the slotTypes column. Please refer to Table 3, which shows the BitSetPos for each slotType (i.e., slot) and the root slot identifier. The root slot identifier indicates whether the slotType is the root slot, denoted as isRootSlot (Y indicates it is the root slot, N indicates it is not the root slot). As shown in Table 3, the VarSlot corresponding to BitSetPos = 0 represents the slot where the composite coded pseudo-instruction is located. Since the composite coded pseudo-instruction represents an instruction that can be placed in multiple real slots, VarSlot is a virtual slot concept. NoSlot corresponding to BitSetPos = 1 is the default value. Other BitSetPos correspond to real slots; for example, BitSetPos = 2 corresponds to slot A, which is the root slot, and BitSetPos = 10 corresponds to slot C1, which is a sub-slot.
[0259] Table 3
[0260]
[0261]
[0262] Please refer to Table 4, which shows the optional sub-slots for each instruction provided in an embodiment of this application. Table 4 includes a Num column, a string (Instr) column, a slotTypes column, and a slotTypesBitSet column. The Num column includes 11 numeric indices from 0 to 30, the Instr column includes 11 compound coded pseudo-instructions, the slotTypes column includes the optional sub-slot set for each compound coded pseudo-instruction, and the slotTypesBitSet column includes the BitSet for each initial optional slot set. As shown in Table 4, the compound coded pseudo-instruction with Num of 0 is PS_sll_rr, and the optional sub-slots for PS_sll_rr include A, A1, A2, and A3, with a slotTypesBitSet of 111100.
[0263] Table 4
[0264] Num Instr slotTypes slotTypesBitSet 0 PS_sll_rr A, A1, A2, A3 111100 2 PS_sra_rr A, A1, A2, A3 111100 4 PS_movi D 1000000000000 8 PS_const_hw A, A1, A2, A3, B, B1, B2, B3, D 1000111111100 9 PS_or_ir A, A1, A2, A3 111100 17 PS_add_rr A, A1, A2, A3 111100 21 PS_ld_w C, C1, C2 111000000000 22 PS_st_w C, C1, C2 111000000000 26 PS_jump_ret A, A1, A2, A3 111100 27 PS_sp_adj C, C1, C2 111000000000 30 PS_mv_rr B, B1, B2, B3 111000000
[0265] Based on the aforementioned Tables 2 to 4, the determination is as follows: Figure 21 Let's take the first optional VLIW slot in the configuration as an example for illustration. Figure 21 The first VLIW in [the text] can be found in [the text]. Figure 22 .like Figure 22 As shown, the first VLIW includes three instructions: PS_const_hw, PS_sp_adj, and PS_movi, located in target root slots B, C, and D, respectively. Based on the attribute label of PS_const_hw, the root slot of PS_const_hw is determined to be B. Referring to Table 2, the sub-slots included in B are: B, B1, B2, and B3, denoted as root_B_slots. Referring to Table 4, the optional sub-slots of PS_const_hw include: A, A1, A2, A3, B, B1, B2, B3, and D, denoted as instr_slots. Performing a bitwise AND operation on root_B_slots and instr_slots yields the target optional sub-slots of PS_const_hw: B, B1, B2, and B3, denoted as final_B_slots. Similarly, we can deduce that the final_C_slots of PS_sp_adj are C, C1, and C2, and the final_D_slots of PS_movi are D.
[0266] Then, execute the aforementioned process 2022, arranging and combining the slots in final_B_slots, final_C_slots, and final_D_slots. For example, combining B from final_B_slots, C from final_C_slots, and D from final_D_slots yields the BCD pattern. The slotTypeBitSet of this BCD pattern is obtained by bitwise ORing the BitSetPositions of B, C, and D, and is denoted as BCD_bitSet. Similarly, combining B1 from final_B_slots and D from final_D_slots yields the B1D pattern, and the slotTypeBitSet of the B1D pattern is denoted as B1D_bitSet. Combining B2 from final_B_slots and D from final_D_slots yields the B2D pattern, and the slotTypeBitSet of the B2D pattern is denoted as B2D_bitSet, and so on.
[0267] Next, the aforementioned process 2023 is executed, determining the selectable VLIW mode of the first VLIW from the multiple initial VLIWs formed by the aforementioned permutations and combinations based on the pre-set multiple VLIW modes. It can be determined whether a VLIW mode from Table 1 exists among the initial VLIW modes, starting with the first VLIW mode in Table 1. For example, a bitwise AND operation can be performed between the slotTypeBitSet of the first VLIW mode and the slotTypeBitSet of each initial VLIW mode. If the result of the bitwise AND operation equals the slotTypeBitSet of a certain initial VLIW mode, then that initial VLIW mode is determined as... Figure 22 The optional VLIW modes in the table. Based on the aforementioned initial VLIW mode and Table 1, we can obtain... Figure 22 The optional VLIW modes in the VLIW library include: ABCD and ABCDE.
[0268] Since the VLIW modes in Table 1 are arranged in ascending order of encoding bit width from top to bottom, we need to determine whether each VLIW mode is a valid encoding mode starting from the top of Table 1. Figure 22 When considering the optional VLIW modes in the VLIW table, the first matching VLIW mode in Table 1 is the correct one. Figure 22 The shortest optional VLIW pattern in Table 1. The first matched VLIW pattern in Table 1 is ABCD, therefore ABCD is... Figure 22 The shortest optional VLIW mode in VLIW.
[0269] The determination of the optional VLIW mode for each subsequent VLIW can refer to the relevant process of the first VLIW mentioned above, and will not be repeated here in the embodiments of this application.
[0270] After confirming Figure 21 After determining the optional VLIW modes for each VLIW, the aforementioned process 203 is executed. Based on the slot corresponding to each instruction in the shortest optional VLIW mode, the machine instruction corresponding to each instruction in the VLIW is determined from the first mapping relationship. For example, please refer to Table 5, which shows the first mapping relationship provided in this embodiment. Table 5 includes columns Num, Instr, Slot A, Slot A1, Slot A2, Slot A3, Slot B, Slot B1, Slot B2, Slot C, Slot C1, Slot C2, Slot D, and Slot E. The Num column includes 11 numerical indices from 0 to 27, the Instr column includes 11 composite coded pseudo-instructions, and each slot column includes the machine instruction corresponding to the 11 composite coded pseudo-instructions in the slot. As shown in Table 5, the machine instruction for the composite coded pseudo-instruction PS_sra_rr with Num=2 is sra_rr under A, sra_rr_A1 under A1, sra_rr_A2 under A2, and sra_rr_A3 under A3.
[0271] Table 5
[0272]
[0273]
[0274] by Figure 22 Taking VLIW as an example, its shortest selectable VLIW mode is ABCD. Then... Figure 22 The three instructions are placed in slots B, C, and D respectively. Since no other instructions are placed in slot A, a NOP needs to be added at slot A. The slot corresponding to PS_const_hw in ABCD is B. Referring to Table 5, the machine instruction for PS_const_hw in slot B is const_hw. Similarly, the machine instruction for PS_sp_adj in slot C is sp_adj, and the machine instruction for PS_movi in slot D is Movi.
[0275] The machine instructions for each instruction in each VLIW are determined following the aforementioned process. Please refer to [link / reference]. Figure 23 , Figure 23 This is a schematic diagram of a machine instruction provided in an embodiment of this application, defining... Figure 21 The machine instructions obtained after decomposing the machine instructions in each VLIW can be found in [reference]. Figure 23 .
[0276] Finally, the aforementioned process 204 is executed. Based on the machine instruction corresponding to each instruction in the VLIW and the shortest optional VLIW mode, the encoded instruction of each instruction in the VLIW is determined based on the second mapping relationship, thus obtaining the VLIW encoding corresponding to the VLIW. For example, please refer to Table 6, which shows the second mapping relationship provided in the embodiments of this application. Table 6 includes a Num column, a machine instruction column, a VLIW mode ABCDE column, a VLIW mode ABCD column, a VLIW mode A2B2 column, a VLIW mode A3C2 column, a VLIW mode C1 column, a VLIW mode B1 column, and a VLIW mode A1 column. The Num column includes nine numerical indices from 0 to 9, the machine instruction column includes nine machine instructions, and each VLIW mode column includes the encoding of the nine machine instructions in the VLIW mode. The first row of the encoding represents the encoding of the machine instruction in the VLIW mode, denoted as encoding, and the second row represents the mask corresponding to encoding. Since multiple machine instruction encodings correspond to the same VLIW mode, a mask is used to identify the machine instruction. The mask can identify the bits occupied by machine instructions. This means that even if the machine instructions are distributed in multiple segments throughout the VLIW encoding, as long as the corresponding bit in the mask of the machine instruction is set to 1, multiple instructions in the VLIW can be arranged and encoded in any order, thus enhancing the flexibility of the VLIW.
[0277] As shown in Table 6, the machine instruction const_hw with Num=1 is encoded as 0Xxxxxxa1e4xxxxxxxxxxxx and 0x00000FFFF00000000000 in ABCDE, and as 0Xxxxxx4e25xxxxxxxx and 0x00000FFFF00000000 in ABCD.
[0278] Table 6
[0279]
[0280]
[0281] Since the same instruction may have different codes when placed in different slots in the embodiments of this application, it is necessary to further determine the encoded instruction based on the aforementioned Table 6. Figure 23Taking the machine instruction `const_hw` in the first VLIW as an example, the shortest VLIW mode for the first VLIW is ABCD. Looking up Table 6, we find that the encoding of `const_hw` under ABCD is 0Xxxxxx4e25xxxxxxxx, and the mask is 0x00000FFFF00000000. Performing a bitwise AND operation between 0Xxxxxx4e25xxxxxxxx and 0x00000FFFF00000000, we get the encoding of `const_hw` as `encodingMask1`. Similarly, we obtain the encodings of the instructions `A_nop`, `sp_adj`, and `movi` as `encodingMask0`, `encodingMask2`, and `encodingMask3`, respectively. Looking up the encoding of ABCD in Table 4, we denot it as `BundleEncoding`. Performing an OR operation on these encodings yields the VLIW encoding as: `BundleEncoding|encodingMask0|encodingMask1|encodingMask2|encodingMask3`.
[0282] After determining the VLIW encoding for each VLIW using the aforementioned process, code printing is then performed to generate binary or assembly instructions. For an example, please refer to [link to example]. Figure 24 , Figure 24 This is a schematic diagram of an assembly instruction provided in an embodiment of this application, based on Figure 23 After determining multiple VLIW codes as shown, the assembly instructions obtained by printing the code of multiple VLIW codes can be found in [reference needed]. Figure 24 .
[0283] It should be noted that Tables 1 to 6 above are all illustrative examples, and some tables do not show all the information, and do not limit the methods of the embodiments of this application.
[0284] In summary, the compilation method of the program provided in this application schedules the instruction sequence to obtain multiple VLIWs. Each VLIW includes at least one instruction in the instruction sequence, and each instruction is located in at least one target root slot. Based on the target root slot where each instruction in the VLIW is located, an optional VLIW mode is determined from the multiple VLIW modes. The optional VLIW mode includes a target optional sub-slot for each instruction in the VLIW. The target optional sub-slot of the instruction is a sub-slot of the target root slot where the instruction is located. Based on the slot corresponding to each instruction in the shortest optional VLIW mode, the VLIW is determined from the first mapping relationship. The machine instructions corresponding to each instruction in the VLIW are mapped in two ways. The first mapping relationship represents the machine instructions for each instruction in each slot. Finally, based on the machine instructions corresponding to each instruction in the VLIW, the shortest optional VLIW mode, and the second mapping relationship, the encoding instruction for each instruction in the VLIW is determined, resulting in the VLIW encoding. This process supports the compilation of programs for VLI designs where instructions can be placed in different slots (including the root slot and sub-slots). It can generate binary or assembly files targeting the VLI design processor, solving the complex compilation problems caused by the highly flexible VLI design and parallel instruction issuance, and fully supporting the flexibility of VLI design encoding. By determining the shortest VLIW encoding step by step through the first and second mapping relationships, compared with existing technologies, the memory footprint of the compiled program is effectively reduced without affecting the flexibility, compilation efficiency, and code execution efficiency of the VLI design. Furthermore, the method of determining the shortest VLIW encoding through the first and second mapping relationships can maintain the framework of mainstream compilers, improving the versatility and scalability of the solution.
[0285] When scheduling an instruction sequence, the bypass equivalence of each instruction is first determined based on the dependency graph of the instruction sequence. Then, the weights of the edges between instructions with bypass equivalence and their successors in the dependency graph are updated. The height offset records of instructions with inequivalent bypass equivalence are determined, including the height offset of the instructions with inequivalent bypass equivalence when executing the resource indicated by the corresponding scheduling information. Finally, based on the dependency graph, the depth offset records of each instruction in the instruction sequence, and the height offset records of instructions with inequivalent bypass equivalence, the ready instruction set for the current cycle is determined, and each ready instruction in the ready instruction set is scheduled sequentially. The ready instructions in the ready instruction set are arranged in priority order. This process incorporates the bypass selection of each instruction into the scheduling process through the bypass equivalence of each instruction and the dependency graph. It achieves instruction scheduling when the same instruction has different encodings in different slots and different VLIW modes, and when instruction bypass is optional, laying the foundation for subsequent program compilation that supports VLI design. Furthermore, this scheduling process is based on the attribute that a sub-slot is contained within a root slot, utilizing only the root slot for instruction scheduling, and the scheduling result is also represented solely through the root slot. This eliminates the need for sub-slots to participate in instruction scheduling, thereby reducing the scheduling scale and improving scheduling efficiency.
[0286] Furthermore, the instructions in the instruction sequence can be compound coded pseudo-instructions. Compound coded pseudo-instructions allow instructions with the same functionality but different resource consumption and encoding to be scheduled uniformly, maximizing the utilization of the instruction's ILP characteristics and thus achieving better scheduling performance. This effectively supports further compilation of flexible VLI.
[0287] The order of the methods provided in the embodiments of this application can be adjusted appropriately, and the processes can be added or removed as needed. For example, if the same slot is the same in different VLIW modes in a VLI design, then the aforementioned process 204 may not be executed. Any variations that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the protection scope of this application, and the embodiments of this application do not limit this.
[0288] The foregoing primarily describes the program compilation method provided in the embodiments of this application from the perspective of the device. It is understood that, in order to achieve the above functions, the device includes the corresponding hardware structure and / or software modules for executing each function. Those skilled in the art should readily recognize that, in conjunction with the algorithm steps of the examples described in the embodiments disclosed herein, this application can be implemented in hardware or a combination of hardware and computer software. Whether a function is executed in hardware or by computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0289] This application embodiment can divide the device into functional modules according to the above method example. For example, each function can be divided into its own functional module, or two or more functions can be integrated into one processing module. The integrated module can be implemented in hardware or as a software functional module. It should be noted that the module division in this application embodiment is illustrative and only represents one logical functional division. In actual implementation, there may be other division methods.
[0290] Figure 25 This is a block diagram of a program compilation apparatus provided in an embodiment of this application. Exemplarily, the program compilation apparatus can be the aforementioned compiler, or a chip therein, or other combined devices or components having the aforementioned program compilation function. When each functional module is divided according to its corresponding function, the program compilation apparatus 300 includes:
[0291] The scheduling module 301 is used to schedule the instruction sequence to obtain multiple Very Long Instruction Words (VLIWs). Each VLIW includes at least one instruction from the instruction sequence. The at least one instruction is located in at least one target root slot, and the target root slot indicates the resources required to execute the placed instruction.
[0292] The first determining module 302 is used to determine the optional VLIW mode of the VLIW from multiple VLIW modes based on the target root slot where each instruction in the VLIW is located. The optional VLIW mode includes a target optional sub-slot for each instruction in the VLIW. The target optional sub-slot of the instruction is a sub-slot of the target root slot where the instruction is located. The types of instructions that the target root slot supports include the types of instructions that the sub-slots of the target root slot support.
[0293] The second determining module 303 is used to determine the encoding instruction of each instruction in the VLIW according to the shortest optional VLIW mode among the optional VLIW modes of the VLIW, and obtain the VLIW encoding corresponding to the VLIW.
[0294] In conjunction with the above scheme, each instruction corresponds to at least one scheduling information, which is used to indicate the resources that can be occupied at each stage of the instruction. The scheduling module 301 is specifically used to: determine the bypass equivalence of each instruction in the instruction sequence according to the dependency graph of the instruction sequence; update the weight of the edge between the bypass and bypass-equivalent instruction and its successor instruction in the dependency graph; determine the height offset record of the bypass-inequivalent instruction, which includes the height offset of the bypass-inequivalent instruction when it is executed on the resource indicated by the corresponding scheduling information; and perform instruction scheduling for the current cycle based on the dependency graph and the height offset record of the bypass-inequivalent instruction.
[0295] In conjunction with the above scheme, the scheduling module 301 is specifically used to: determine the ready instruction set for the current period based on the dependency graph, the depth offset record of each instruction in the instruction sequence, and the height offset record of the bypass inequality instruction, wherein the ready instructions in the ready instruction set are arranged in priority order, and the depth offset record includes the depth offset of the instruction when the corresponding scheduling information indicates the resource execution; and schedule each ready instruction in the ready instruction set in sequence.
[0296] In conjunction with the above scheme, the scheduling module 301 is specifically used to: determine at least one ready instruction for the current period by using the depth of the unscheduled instructions in the dependency graph and the depth offset record in the instruction sequence; and sort the at least one ready instruction by priority based on the priority parameter of each ready instruction to obtain the ready instruction set, wherein the priority parameter of the ready instruction includes at least one of the following: the height of the ready instruction in the dependency graph, the number of successor instructions of the ready instruction in the dependency graph, and the height offset record of the ready instruction.
[0297] Combining the above solutions, Figure 26 A block diagram of another program compilation apparatus provided in this application embodiment, in Figure 25 In addition to the above, the device 300 further includes:
[0298] The conversion module 304 is used to convert each instruction in the intermediate code sequence into a composite coded pseudo-instruction to obtain the instruction sequence.
[0299] In conjunction with the above scheme, the first determining module 302 is specifically used for: determining the target optional sub-slot of each instruction in the VLIW based on the target root slot, optional sub-slots, and slot information of each instruction in the VLIW, wherein the slot information includes the correspondence between the root slot and the sub-slots included in the root slot; arranging and combining the target optional sub-slots of each instruction in the VLIW to obtain at least one initial VLIW mode; and determining the overlapping VLIW mode of the at least one initial VLIW mode and the multiple VLIW modes as the optional VLIW mode of the VLIW.
[0300] In conjunction with the above scheme, the second determining module 303 is specifically used for: determining the machine instruction corresponding to each instruction in the VLIW from the first mapping relationship based on the slot corresponding to each instruction in the shortest selectable VLIW mode, wherein the first mapping relationship represents the machine instruction of the instruction in each slot; and determining the encoded instruction of each instruction in the VLIW based on the machine instruction corresponding to each instruction in the VLIW, the shortest selectable VLIW mode, and the second mapping relationship, wherein the second mapping relationship represents the encoding of the machine instruction in each VLIW mode.
[0301] Figure 27 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. The electronic device 400 can be a compiler, or a chip or functional module within a compiler. Figure 27 As shown, the electronic device 400 includes a processor 401, a transceiver 402, and a communication line 403.
[0302] Among them, processor 401 is used to perform such as Figure 4 , Figure 5 , Figure 7 and Figure 11 In any step of the method embodiment shown, when performing processes such as sending and receiving instructions, the transceiver 402 and communication line 403 may be invoked to complete the corresponding operation.
[0303] Furthermore, the electronic device 400 may also include a memory 404. The processor 401, memory 404, and transceiver 402 can be connected via a communication line 403.
[0304] The processor 401 can be a processor, a general-purpose processor, a network processor (NP), a digital signal processor (DSP), a microprocessor, a microcontroller, a programmable logic device (PLD), or any combination thereof. The processor 401 can also be other devices with processing capabilities, such as circuits, devices, or software modules, without limitation.
[0305] Transceiver 402 is used to communicate with other devices or other communication networks, such as Ethernet, radio access network (RAN), wireless local area network (WLAN), etc. Transceiver 402 can be a module, circuit, transceiver, or any device capable of enabling communication.
[0306] The transceiver 402 is mainly used for sending and receiving instructions, and may include a transmitter and a receiver to send and receive instructions respectively. Operations other than sending and receiving instructions are implemented by the processor, such as scheduling instruction sequences and determining the optional VLIW mode from multiple VLIW modes.
[0307] Communication line 403 is used to transmit information between the various components included in electronic device 400.
[0308] In one design, the processor can be viewed as a logic circuit, and the transceiver as an interface circuit.
[0309] Memory 404 is used to store instructions. These instructions can be computer programs.
[0310] The memory 404 can be volatile memory or non-volatile memory, or it can include both. The non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. The volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM), double data rate synchronous DRAM (DDR SDRAM), enhanced synchronous DRAM (ESDRAM), synchronous linked DRAM (SLDRAM), and direct rambus RAM (DRRAM). Memory 404 can also be a compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed discs, laser discs, optical discs, digital universal discs, Blu-ray discs, etc.), magnetic disk storage media, or other magnetic storage devices. It should be noted that the memory in the systems and methods described herein is intended to include, but is not limited to, these and any other suitable types of memory.
[0311] It should be noted that the memory 404 can exist independently of the processor 401, or it can be integrated with the processor 401. The memory 404 can be used to store instructions, program code, or some data, etc. The memory 404 can be located inside or outside the electronic device 400, without limitation. The processor 401 is used to execute the instructions stored in the memory 404 to implement the methods provided in the above embodiments of this application.
[0312] In one example, processor 401 may include one or more processors, for example Figure 27 Processor 0 and processor 1 in the system.
[0313] As an optional implementation, the electronic device 400 includes multiple processors, for example, besides Figure 27 In addition to processor 401, it may also include processor 407.
[0314] As an optional implementation, the electronic device 400 also includes an output device 405 and an input device 406. For example, the input device 406 is a device such as a keyboard, mouse, microphone, or joystick, and the output device 405 is a device such as a display screen or speaker.
[0315] It should be noted that the electronic device 400 can be a chip system or... Figure 27 Devices with similar structures. The chip system can be composed of chips or include chips and other discrete components. Actions, terminology, etc., involved in the various embodiments of this application can be referenced interchangeably without limitation. The message names or parameter names in the messages used for interaction between devices in the embodiments of this application are merely examples; other names can be used in specific implementations without limitation. Furthermore, Figure 27 The structural composition shown does not constitute a limitation on the electronic device 400, except... Figure 27 In addition to the components shown, the electronic device 400 may include more than Figure 27 This may indicate more or fewer components, or combinations of certain components, or different component arrangements.
[0316] The processor and transceiver described in this application can be implemented on integrated circuits (ICs), analog ICs, radio frequency integrated circuits, mixed-signal ICs, application-specific integrated circuits (ASICs), printed circuit boards (PCBs), electronic devices, etc. The processor and transceiver can also be manufactured using various IC process technologies, such as complementary metal-oxide semiconductors (CMOS), n-metal-oxide-semiconductor (NMOS), positive-channel metal-oxide semiconductors (PMOS), bipolar junction transistors (BJTs), bipolar CMOS (BiCMOS), silicon germanium (SiGe), gallium arsenide (GaAs), etc.
[0317] Figure 28This is a schematic diagram of a program compilation apparatus provided in an embodiment of this application. This program compilation apparatus is applicable to the scenarios shown in the above method embodiments. For ease of explanation, Figure 28 Only the main components of the program's compilation device are shown, including the processor, memory, control circuitry, and input / output devices. The processor is primarily used to process communication protocols and data, execute software programs, and process the software program's data. The memory is mainly used to store the software program and data. The control circuitry is mainly used for power supply and the transmission of various electrical signals. The input / output devices are mainly used to receive user input data and output data to the user.
[0318] When the program's compilation device is a compiler, the control circuit can be a motherboard, and the memory includes storage media such as hard disks, RAM, and ROM. The processor can include a baseband processor and a central processing unit (CPU). The baseband processor is mainly used to process communication protocols and communication data, while the CPU is mainly used to control the entire program's compilation device, execute the software program, and process the software program's data. Input / output devices include displays, keyboards, and mice. The control circuit can further include or be connected to transceiver circuits or transceivers, such as network cable interfaces, for sending or receiving data or signals, such as for data transmission and communication with other devices. Furthermore, it can also include an antenna for transmitting and receiving wireless signals for data / signal transmission with other devices.
[0319] According to the method provided in the embodiments of this application, this application also provides a computer program product, which includes computer program code. When the computer program code is run on a computer, it causes the computer to execute any of the methods described in the embodiments of this application.
[0320] This application also provides a computer-readable storage medium. All or part of the processes in the above method embodiments can be executed by a computer or a device with program compilation capabilities, using computer programs or instructions to control related hardware. The computer program or set of instructions can be stored in the aforementioned computer-readable storage medium. When executed, the computer program or set of instructions can include the processes described in the above method embodiments. The computer-readable storage medium can be an internal storage unit of the compiler in any of the foregoing embodiments, such as the compiler's hard disk or memory. The aforementioned computer-readable storage medium can also be an external storage device of the compiler, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc., equipped on the compiler. Further, the aforementioned computer-readable storage medium can include both the compiler's internal storage unit and external storage devices. The aforementioned computer-readable storage medium is used to store the aforementioned computer program or instructions, as well as other programs and data required by the compiler. The aforementioned computer-readable storage medium can also be used to temporarily store data that has been output or will be output.
[0321] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0322] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0323] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0324] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0325] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.
[0326] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, ROM, RAM, magnetic disks, or optical disks.
[0327] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method of compiling an instruction, characterized by, The method includes: The instruction sequence is scheduled to obtain multiple Very Long Instruction Words (VLIWs). Each VLIW includes at least one instruction from the instruction sequence. The at least one instruction is located in at least one target root slot, and the target root slot indicates the resources required to execute the placed instruction. Based on the target root slot where each instruction in the VLIW is located, an optional VLIW mode is determined from multiple VLIW modes. The optional VLIW mode includes a target optional sub-slot for each instruction in the VLIW. The target optional sub-slot of the instruction is a sub-slot of the target root slot where the instruction is located. The instructions that can be placed in the target root slot include the instructions that can be placed in the sub-slots of the target root slot. Based on the shortest optional VLIW mode among the optional VLIW modes of the VLIW, the encoding instruction for each instruction in the VLIW is determined, and the VLIW encoding corresponding to the VLIW is obtained.
2. The method of claim 1, wherein, Each instruction corresponds to at least one scheduling information, which indicates the resources available for each stage of the instruction. The scheduling of the instruction sequence yields multiple Very Long Instruction Words (VLIWs), including: Based on the dependency graph of the instruction sequence, determine the bypass equivalence of each instruction in the instruction sequence; Update the weights of the edges between the instructions in the dependency graph that have bypass and the bypass equivalent and their successors. Determine the height offset record of the bypass-invalid instruction, the height offset record including the height offset of the bypass-invalid instruction when it is executed on the resource indicated by the corresponding scheduling information; Instruction scheduling for the current cycle is performed based on the dependency graph and the height offset record of the bypass inequality instruction.
3. The method of claim 2, wherein, The instruction scheduling for the current cycle based on the dependency graph and the height offset record of the bypass-inequivalent instruction includes: Based on the dependency graph, the depth offset record of each instruction in the instruction sequence, and the height offset record of the instruction that bypasses the inequivalence, the ready instruction set for the current period is determined. The ready instructions in the ready instruction set are arranged in order of priority. The depth offset record includes the depth offset of the instruction when it is executed at the resource indicated by the corresponding scheduling information. Each ready instruction in the ready instruction set is scheduled sequentially.
4. The method of claim 3, wherein, The process of determining the ready instruction set for the current cycle based on the dependency graph, the depth offset record of each instruction in the instruction sequence, and the height offset record of the bypass-inequivalent instructions includes: Using the depth of unscheduled instructions in the instruction sequence in the dependency graph and the depth offset record, at least one ready instruction for the current cycle is determined; Based on the priority parameter of each ready instruction in the at least one ready instruction, the at least one ready instruction is prioritized to obtain the ready instruction set. The priority parameter of the ready instruction includes at least one of the following: the height of the ready instruction in the dependency graph, the number of successor instructions of the ready instruction in the dependency graph, and the height offset record of the ready instruction.
5. The method according to any one of claims 1 to 4, characterized in that, The method further includes: Each instruction in the intermediate code sequence is converted into a composite coded pseudo-instruction to obtain the instruction sequence.
6. The method according to any one of claims 1 to 5, characterized in that, The step of determining the selectable VLIW mode from multiple VLIW modes based on the target root slot where each instruction in the VLIW is located includes: Based on the target root slot, optional sub-slots, and slot information of each instruction in the VLIW, the target optional sub-slots of each instruction in the VLIW are determined. The slot information includes the correspondence between the root slot and the sub-slots included in the root slot. The target optional sub-slots of each instruction in the VLIW are arranged and combined to obtain at least one initial VLIW mode; The VLIW mode that overlaps with the at least one initial VLIW mode and the plurality of VLIW modes is determined as the optional VLIW mode of the VLIW.
7. The method according to any one of claims 1 to 6, characterized in that, The step of determining the encoded instruction for each instruction in the VLIW based on the shortest optional VLIW mode among the optional VLIW modes includes: Based on the slot corresponding to each instruction in the VLIW in the shortest optional VLIW mode, the machine instruction corresponding to each instruction in the VLIW is determined from the first mapping relationship, where the first mapping relationship represents the machine instruction of the instruction in each slot. The encoded instruction for each instruction in the VLIW is determined based on the machine instruction corresponding to each instruction in the VLIW, the shortest optional VLIW mode, and the second mapping relationship, whereby the second mapping relationship represents the encoding of the machine instruction in each VLIW mode.
8. An instruction compilation apparatus, characterized in that, The device includes: A scheduling module is used to schedule an instruction sequence to obtain multiple Very Long Instruction Words (VLIWs). Each VLIW includes at least one instruction from the instruction sequence. The at least one instruction is located in at least one target root slot, and the target root slot indicates the resources required to execute the placed instruction. The first determining module is used to determine the optional VLIW mode of the VLIW from multiple VLIW modes based on the target root slot where each instruction in the VLIW is located. The optional VLIW mode includes a target optional sub-slot for each instruction in the VLIW. The target optional sub-slot of the instruction is a sub-slot of the target root slot where the instruction is located. The types of instructions that can be placed in the target root slot include the types of instructions that can be placed in the sub-slots of the target root slot. The second determining module is used to determine the encoding instruction of each instruction in the VLIW based on the shortest optional VLIW mode among the optional VLIW modes of the VLIW, and obtain the VLIW encoding corresponding to the VLIW.
9. The apparatus of claim 8, wherein, Each instruction corresponds to at least one scheduling information, which indicates the resources available for each stage of the instruction. The scheduling module is specifically used for: Based on the dependency graph of the instruction sequence, determine the bypass equivalence of each instruction in the instruction sequence; Update the weights of the edges between the instructions in the dependency graph that have bypass and the bypass equivalent and their successors. Determine the height offset record of the bypass-invalid instruction, the height offset record including the height offset of the bypass-invalid instruction when it is executed on the resource indicated by the corresponding scheduling information; Instruction scheduling for the current cycle is performed based on the dependency graph and the height offset record of the bypass inequality instruction.
10. The apparatus of claim 9, wherein, The scheduling module is specifically used for: Based on the dependency graph, the depth offset record of each instruction in the instruction sequence, and the height offset record of the instruction that bypasses the inequivalence, the ready instruction set for the current period is determined. The ready instructions in the ready instruction set are arranged in order of priority. The depth offset record includes the depth offset of the instruction when it is executed at the resource indicated by the corresponding scheduling information. Each ready instruction in the ready instruction set is scheduled sequentially.
11. The apparatus of claim 10, wherein, The scheduling module is specifically used for: Using the depth of unscheduled instructions in the instruction sequence in the dependency graph and the depth offset record, at least one ready instruction for the current cycle is determined; Based on the priority parameter of each ready instruction in the at least one ready instruction, the at least one ready instruction is prioritized to obtain the ready instruction set. The priority parameter of the ready instruction includes at least one of the following: the height of the ready instruction in the dependency graph, the number of successor instructions of the ready instruction in the dependency graph, and the height offset record of the ready instruction.
12. The apparatus according to any one of claims 8 to 11, characterized in that, The device further includes: A conversion module is used to convert each instruction in the intermediate code sequence into a composite coded pseudo-instruction to obtain the instruction sequence.
13. The apparatus of any one of claims 8 to 12, wherein, The first determining module is specifically used for: Based on the target root slot, optional sub-slots, and slot information of each instruction in the VLIW, the target optional sub-slots of each instruction in the VLIW are determined. The slot information includes the correspondence between the root slot and the sub-slots included in the root slot. The target optional sub-slots of each instruction in the VLIW are arranged and combined to obtain at least one initial VLIW mode; The VLIW mode that overlaps with the at least one initial VLIW mode and the plurality of VLIW modes is determined as the optional VLIW mode of the VLIW.
14. The apparatus of any one of claims 8 to 13, wherein, The second determining module is specifically used for: Based on the slot corresponding to each instruction in the VLIW in the shortest optional VLIW mode, the machine instruction corresponding to each instruction in the VLIW is determined from the first mapping relationship, wherein the first mapping relationship represents the machine instruction of the instruction in each slot; The encoding instruction for each instruction in the VLIW is determined based on the machine instruction corresponding to each instruction in the VLIW, the shortest optional VLIW mode, and the second mapping relationship, whereby the second mapping relationship represents the encoding of the machine instruction in each VLIW mode.
15. A computer-readable storage medium, characterized in that, It includes a computer program or instructions that, when executed on a computer, cause the computer to perform the method as described in any one of claims 1 to 7.