A method and system for automatically generating assembly code for a shenwei many-core processor

By performing line-by-line analysis and dependency graph construction on high-level scripting language code segments, the problem of insufficient code optimization on domestic heterogeneous many-core processors was solved, achieving efficient automatic generation of assembly code and improving development efficiency and code correctness.

CN121934826BActive Publication Date: 2026-06-26SHANDONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANDONG UNIV
Filing Date
2026-03-27
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies lack sufficient code optimization capabilities on domestically produced heterogeneous many-core processors. Manually writing assembly code is inefficient, error-prone, and makes it difficult to generate optimal code.

Method used

An automatic assembly code generation method for the Sunway many-core processor is adopted. By analyzing the code segments of the high-level scripting language line by line, a dependency graph is constructed, instructions are distributed to the operation and memory access queues, and a dual-channel static rearrangement is performed based on the list scheduling algorithm to generate a dual-channel instruction pair sequence, which is then output as assembly code.

Benefits of technology

It achieves the ultimate in pipeline parallelism and development efficiency, eliminates the blind spots of manual scheduling, and ensures the logical correctness and maintainability of the generated code.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121934826B_ABST
    Figure CN121934826B_ABST
Patent Text Reader

Abstract

The application provides a method and system for automatically generating assembly code for a Shenwei many-core processor, and relates to the technical field of computer architecture and high-performance computing software optimization, comprising: performing line-by-line analysis on a high-level script language code segment to obtain a variable register mapping table and a linear instruction list; performing data dependency analysis between instructions to construct a dependency graph containing read-after-write dependencies and write-after-read dependencies; dividing the instructions in the linear instruction list into an operation instruction queue and a memory access instruction queue, performing double-channel static rearrangement on the instructions in the two queues, and generating a double-channel instruction pair sequence; and outputting the double-channel instruction pair sequence in a double-emission syntax format of a Shenwei assembler to generate assembly code for the Shenwei many-core processor; the application maximizes the filling of a hardware pipeline and fully releases the peak computing power of a domestic heterogeneous many-core processor, thereby providing key underlying performance support for large-scale scientific computing applications such as molecular dynamics simulation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer architecture and high-performance computing software optimization technology, specifically to an automatic assembly code generation method and system for the Sunway many-core processor. Background Technology

[0002] With the continuous improvement of supercomputer performance, heterogeneous many-core processors, represented by the domestically produced Sunway supercomputer, have become the mainstream hardware platform for high-performance computing. These processors typically adopt an architecture of "control core (MPE) + compute core array (CPE)," where each CPE has its own independent local temporary storage (LDM) and a large register file. They also employ Reduced Instruction Set Computing (RISC) and SIMD vector technology to provide powerful floating-point computing capabilities. However, the improvement in hardware computing power has brought enormous challenges to software development. In order to pursue ultimate performance, it is necessary to perform low-level optimizations for specific hardware architectures.

[0003] While mainstream high-level scripting language compilers are generally versatile, they often struggle to generate optimal code when dealing with many-core processors, which have unique microarchitectural features. Domestic many-core processors typically employ static dual-issue pipelines and lack out-of-order execution capability. This means that the instruction issue order must be strictly arranged by the software during compilation. General-purpose compilers struggle to accurately estimate the various long latencies of each instruction, resulting in numerous pipeline bubbles in the generated code and severely reducing computational density. Furthermore, scientific computing often requires a large number of temporary variables. When faced with a large number of registers, the register allocation algorithms of general-purpose compilers often prioritize correctness over proper register allocation, leading to unnecessary register overflows, frequent memory accesses, and performance degradation.

[0004] To overcome the shortcomings of compilers, core operators in critical applications such as molecular dynamics often need to be manually written inline assembly by programmers. However, manual assembly development is extremely inefficient, requiring programmers to manually manage register numbers and calculate memory offsets, which is prone to errors. To mask the latency of memory access and computation, programmers need to design complex "software pipelines" and manually interleave memory access instructions and arithmetic instructions. When dozens or even hundreds of instructions are involved, manually finding the optimal scheduling sequence is almost an impossible task. Once the algorithm is fine-tuned, all register allocations and instruction sequences need to be re-derived, resulting in extremely poor code reusability. In those core loops, even a waste of one clock cycle can lead to huge performance losses after being amplified by hundreds of billions of iterations.

[0005] Existing general-purpose compilation methods have insufficient code optimization capabilities on domestically produced heterogeneous many-core processors, and the development efficiency and error-prone nature of manually writing assembly code make it difficult to generate optimal code. Summary of the Invention

[0006] To address the aforementioned issues, this invention proposes an automatic assembly code generation method and system for the Sunway many-core processor. By combining the flexibility of high-level scripting languages ​​with the underlying hardware architecture characteristics of the Sunway many-core processor, this method maximizes the filling of the hardware pipeline, fully unleashes the peak computing power of the domestically produced heterogeneous many-core processor, and provides crucial underlying performance support for large-scale scientific computing applications such as molecular dynamics simulations.

[0007] According to some embodiments, the present invention adopts the following technical solution:

[0008] An automatic assembly code generation method for the Sunway many-core processor includes:

[0009] The code segment of the high-level scripting language to be compiled is analyzed line by line to obtain a variable register mapping table and a linear instruction list consisting of intermediate representation instructions;

[0010] Based on the variable register mapping table and the linear instruction list, we perform data dependency analysis between instructions and construct a dependency graph that includes write-after-read dependency and read-after-write dependency.

[0011] According to the dependency graph, the instructions in the linear instruction list are split into the operation instruction queue and the memory access instruction queue. Based on the list scheduling algorithm, the instructions in the two queues are statically rearranged in a dual-channel manner. In each analog clock cycle, one instruction that meets the dependency constraints is selected from each of the two queues and issued as a pair. Empty instructions are filled in the channels where the dependency is not met, thus generating a dual-channel instruction pair sequence.

[0012] The dual-channel instruction pair sequence is output as assembly code according to the dual-issue syntax format of the Sunway assembler, and used in the Sunway many-core processor.

[0013] According to some embodiments, the present invention adopts the following technical solution:

[0014] An automatic assembly code generation system for the Sunway many-core processor includes:

[0015] The code analysis module is configured to perform line-by-line analysis of the high-level scripting language code segment to be compiled, and obtain a variable register mapping table and a linear instruction list consisting of intermediate representation instructions;

[0016] The dependency building module is configured to: perform data dependency analysis between instructions based on the variable register map and the linear instruction list, and construct a dependency graph that includes write-after-read dependencies and read-after-write dependencies;

[0017] The instruction reordering module is configured to: based on the dependency graph, distribute the instructions in the linear instruction list to the operation instruction queue and the memory access instruction queue; perform dual-channel static reordering of the instructions in the two queues based on the list scheduling algorithm; select one instruction that satisfies the dependency constraint from each of the two queues in each analog clock cycle and send them as a pair; and fill empty instructions in the channels where the dependency is not satisfied, thereby generating a dual-channel instruction pair sequence.

[0018] The assembly output module is configured to output the dual-channel instruction pair sequence as assembly code according to the dual-issue syntax format of the Sunway assembler, for use in the Sunway many-core processor.

[0019] According to some embodiments, the present invention adopts the following technical solution:

[0020] A computer program product includes a computer program that, when executed by a processor, implements the aforementioned method for automatically generating assembly code for the Sunway many-core processor.

[0021] According to some embodiments, the present invention adopts the following technical solution:

[0022] A non-transitory computer-readable storage medium is provided for storing computer instructions, which, when executed by a processor, implement the aforementioned method for automatically generating assembly code for the Sunway many-core processor.

[0023] According to some embodiments, the present invention adopts the following technical solution:

[0024] An electronic device includes a processor, a memory, and a computer program; wherein the processor is connected to the memory, the computer program is stored in the memory, and when the electronic device is running, the processor executes the computer program stored in the memory to enable the electronic device to execute an automatic assembly code generation method for the Sunway many-core processor.

[0025] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0026] Extreme pipeline parallelism: By accurately modeling the latency of each instruction, the scheduler can automatically insert irrelevant instructions to mask long latency, maximizing the filling of the dual-issue pipeline and eliminating the blindness of manual scheduling.

[0027] Significantly improved development efficiency: Developers can use the variable, loop, and function features of high-level scripting languages ​​to describe complex mathematical operators without having to manually manage hundreds or thousands of register numbers and stack offsets. This significantly reduces the amount of code and makes it easier to maintain and port.

[0028] Correctness Guarantee: The automated dependency analysis mechanism (RAW / WAR check) fundamentally eliminates the "data race" and "pipeline conflict" errors common in manual assembly, ensuring the logical correctness of the generated code. Attached Figure Description

[0029] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.

[0030] Figure 1 This is a flowchart of the method in Example 1.

[0031] Figure 2 The flowchart for code analysis of Example 1 is shown.

[0032] Figure 3 This is a mapping example diagram for Example 1.

[0033] Figure 4 This is an example diagram of the dual-channel instruction pair sequence in Example 1. Detailed Implementation

[0034] The present invention will be further described below with reference to the accompanying drawings and embodiments.

[0035] It should be noted that the following detailed descriptions are exemplary and intended to provide further illustration of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0036] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of exemplary embodiments according to the invention. As used herein, the singular form is intended to include the plural form as well, unless the context clearly indicates otherwise. Furthermore, it should be understood that when the terms "comprising" and / or "including" are used in this specification, they indicate the presence of features, steps, operations, devices, components, and / or combinations thereof.

[0037] Example 1

[0038] One embodiment of the present invention provides an automatic assembly code generation method for the Sunway many-core processor, such as... Figure 1 As shown, it includes:

[0039] Step S1: Perform line-by-line analysis of the high-level scripting language code segment to be compiled to obtain a variable register mapping table and a linear instruction list consisting of intermediate representation instructions;

[0040] like Figure 2As shown, the system identifies variable declaration statements and arithmetic operation statements in the code line; for variable declaration statements, it calls the virtual register dynamic allocation mechanism to allocate physical register numbers for them and stores the mapping relationship between variable names and register numbers in the variable register mapping table; for arithmetic operation statements, it calls the corresponding instruction abstraction model interface to generate intermediate representation instructions, including:

[0041] (1) Define the instruction abstract model interface:

[0042] Define the instruction abstract model interface for the Sunway many-core processor. Write corresponding instruction abstract model interfaces for each type of operation (including vector double-precision storage vstd, vector double-precision loading vldd, vector word shuffling vsshfw, vector double-precision multiply-accumulate vsmad, and loading immediate value ldi). Analyze the code characteristics of each type of operation and write an interface to convert the input code into an intermediate representation instruction of a 5-tuple. The 5-tuple includes: instruction text, queue type, hardware latency period, target operand set, and source operand set.

[0043] In the automatic assembly code generation method for the Sunway many-core processor, a standardized instruction abstraction model interface is defined to effectively shield the complex microarchitectural details of the underlying hardware. Corresponding instruction generation functions are written for each type of underlying operation. The core mechanism of these function is to receive operands (such as allocated physical register numbers, memory base addresses, or offsets) from the higher logic layer, strictly based on the hardware characteristics of the Sunway slave core, and uniformly convert and encapsulate them into a standard 5-tuple intermediate instruction representation. This 5-tuple structure completely records the instruction text, execution channel, execution latency, target register write set, and source register read set—that is, instruction text, queue type, hardware latency cycle, target operand set, and source operand set. Its general field setting rules are as follows:

[0044] Instruction text: Using the passed operand parameters, the complete assembly code string is formatted and concatenated according to the underlying assembly syntax rules of the Sunway supercomputer. This field is the entity carrier of the final output target code.

[0045] Queue type: Explicitly marks the hardware execution channel to which the instruction belongs, and assigns a resource tag of 0 (representing arithmetic instruction queue Q0) or 1 (representing memory access instruction queue Q1) to it according to the physical nature of the instruction.

[0046] Hardware latency period: The inherent execution time of this type of instruction on specific hardware is hard-coded. This parameter serves as the timing benchmark for static scheduling, guiding the scheduler to accurately calculate the waiting time of subsequent related instructions, thereby achieving seamless connection between instructions.

[0047] Target operand set: In the form of a data list, this precisely declares the set of all registers that the instruction will write to or modify after execution; if the instruction does not modify any registers, this set is set to empty.

[0048] Source operand set: In the form of a data list, it precisely declares the set of all registers that the instruction must read or depend on during execution.

[0049] Through the abstract encapsulation of the aforementioned quintuple, the instruction generation function successfully maps the originally chaotic and diverse underlying hardware instructions into an intermediate representation with a unified data structure. The subsequent scheduling module does not need to parse complex assembly strings at all; it only needs to read the timing and queue attributes in the quintuple to complete efficient pipeline orchestration through pure mathematical logic. For example, for the vector double-precision multiply-accumulate (vmad), the defined instruction generation function is gen_vmad(self, a, b, c, d), where self, a, b, c, and d are the formal parameters of this function. Within the function body, the specific values ​​of these formal parameters are received and directly concatenated according to the operation of the vmad assembly function itself: self.insns.append(("vmad %s,%s, %s, %s" % (a, b, c, d), 0, 7, [d], [a, b, c])) to obtain the final quintuple.

[0050] (2) Dynamic allocation of virtual registers:

[0051] A mapping mechanism between symbolic variables and physical registers is established. By maintaining a counter of available physical registers, when the generator receives a new variable declaration request, it automatically allocates the next available physical register number and stores the variable name and physical register number in the mapping table, thereby realizing automated management from high-level language variables to low-level registers.

[0052] (3) Generation of linear instruction sequences:

[0053] The code segment of the high-level scripting language to be compiled is traversed line by line. Each line of code is analyzed. If it is a variable declaration statement, the virtual register dynamic allocation mechanism is called to allocate a physical register number for it, and the mapping relationship between the variable name and the register number is stored in the variable register mapping table. For arithmetic operation statements, the corresponding instruction abstraction model interface is called to generate intermediate representation instructions according to the type of arithmetic operation in the code line.

[0054] The computational logic is described by calling the instruction abstract model interface. Instead of immediately outputting assembly code, the generated instructions are stored in a linear instruction list in sequence in the form of an intermediate representation. This list records the instruction text, execution channel, execution delay, target register write set, and source register read set of each instruction, i.e., a quintuple.

[0055] Here is a specific example:

[0056] Step S1-1: Generator Initialization and Resource Pool Configuration

[0057] In the initial stage of algorithm instantiation, the core data structure is first constructed.

[0058] (1) Instruction buffer initialization: Create an empty linear instruction list self.insns as a container for intermediate representation instructions (IR) to store unscheduled intermediate representation instructions in order.

[0059] (2) Register resource pool configuration: Considering that the first 32 registers ($0-$31) in the Sunway slave core architecture are usually used for system stack pointer, link register and parameter passing, this embodiment initializes the available physical register counter self.free_regs to 32. This means that the physical registers dynamically allocated by the virtual registers will strictly start from $32 and monotonically increase thereafter, thereby physically avoiding the system reserved area and preventing underlying resource conflicts.

[0060] (3) Establishment of variable register mapping table: Initialize the mapping table self.vars, which is used to maintain a two-way mapping between high-level scripting language variable name strings and hardware register number strings.

[0061] Step S1-2: Dynamic allocation logic of virtual registers

[0062] The process involves binding variables to physical registers. When the upper-level operator description logic requests the allocation of a variable named `name`, it reads the current available physical register counter `self.free_regs` (e.g., `$N$`) and generates the corresponding register identifier string `"$N"`. Subsequently, the key-value pair `{name: "$N"}` is stored in the mapping table `self.vars`, and the counter `self.free_regs` is incremented by 1, ultimately forming a sequence like this: Figure 3 The mapping table shown.

[0063] This process ensures that, regardless of the number of temporary variables involved in loop unrolling or complex operator computation, a unique physical register can be assigned to each one, completely eliminating the risk of register name conflicts and overwriting that is common when manually writing assembly code.

[0064] Step S1-3: Construction of Instruction Abstract Model Interface and Intermediate Representation

[0065] During the intermediate representation instruction generation stage, each line of operation statement is encapsulated into an intermediate representation instruction containing quintuple information by calling the corresponding instruction abstract model interface, and then appended to the linear instruction list self.insns.

[0066] The instruction abstract model interface here is pre-written by programmers who analyze the code characteristics of each type of operation. Taking vector double-precision multiply-accumulate (Vmad) as an example, the corresponding instruction generation function is gen_vmad, and the generated IR tuple structure (instruction text, queue type, hardware latency period, target operand set, and source operand set) is as follows:

[0067] Instruction text: Generates a string that conforms to assembly syntax based on the passed operands, such as "vmad $35, $59, $44, $44".

[0068] Queue type: Explicitly identifies the hardware pipeline to which this instruction belongs. In this embodiment, 0 represents the arithmetic instruction queue Q0, and 1 represents the memory access instruction queue Q1.

[0069] Hardware latency cycle: Presets the execution cycle of the instruction on specific hardware. For example, vector double multiply-add vmad is hardcoded to 7 cycles, and vector double load vldd is encoded to 5 cycles. This parameter is the timing basis for subsequent static scheduling.

[0070] The target operand set and the source operand set represent the data flow, explicitly recording which registers the instruction read and wrote, providing raw data for dependency analysis.

[0071] Step S2: Based on the variable register mapping table and the linear instruction list, perform data dependency analysis between instructions and construct a dependency graph that includes write-after-read dependency and read-after-write dependency;

[0072] Furthermore, constructing a dependency graph includes:

[0073] Traverse the linear instruction list, recording the source operand and destination operand for each instruction; maintain the last defined register table, recording the instruction index of the last time each register was defined, used to build write-after-read dependencies; maintain the last used register table, recording the instruction index of the last time each register was used, used to build read-after-write dependencies; store the dependent instruction indexes of each instruction into the write-after-read dependency list and read-after-write dependency list of that instruction, respectively.

[0074] Specifically, before instruction reordering, the linear instruction list `self.insns` is traversed to construct a dependency graph between instructions. Two temporary dictionaries are maintained for each register: `last_def` (the last defined register table, recording the instruction index of the last time the current register was defined (i.e., written)) and `last_use` (the last used register table, recording the instruction index of the last time the current register was used (i.e., read). Two lists are maintained for each instruction: `rawlist` (a write-after-read dependency list, recording all preceding write instructions) and `warlist` (a read-after-write dependency list, recording all preceding read instructions). The following logic is executed for each instruction in the linear instruction list:

[0075] (1) Read-after-write (RAW) dependency analysis: For each source operand (Use) of the current instruction, i.e., the current instruction is a read instruction of the source operand, look up the last_def table of the register corresponding to the source operand; if there is a definer, i.e. the instruction that writes to the register, then record the index of the preceding instruction in the rawlist of the current instruction. This means that the current instruction must wait for the preceding instruction to finish executing and for a certain delay before it can start.

[0076] (2) Read-after-write (WAR) dependency analysis: For each target operand (Def) of the current instruction, i.e., the current instruction is a write instruction to the target operand, look up the last_use table of the register corresponding to the target operand; if there is a user, i.e. an instruction to read the register, then record the index of the preceding instruction in the warlist of the current instruction. This means that the current instruction must be issued after the preceding instruction to prevent the data from being overwritten in advance.

[0077] (3) Status update: After each instruction is processed, the last_def and last_use tables of the register area involved in the current instruction are updated immediately to ensure the transitivity of the dependency relationship.

[0078] Step S3: According to the dependency graph, the instructions in the linear instruction list are split into the operation instruction queue and the memory access instruction queue. Based on the list scheduling algorithm, the instructions in the two queues are statically rearranged in a dual-channel manner. In each analog clock cycle, one instruction that meets the dependency constraint is selected from each of the two queues and issued as a pair. Empty instructions are filled in the channels where the dependency is not met, thus generating a dual-channel instruction pair sequence.

[0079] Furthermore, the instructions in the linear instruction list are distributed to the arithmetic instruction queue and the memory access instruction queue, specifically including:

[0080] Based on the queue type field marked in the intermediate instruction, all instruction indices are filled into the operation instruction queue and memory access instruction queue respectively, while the order of instructions in the original linear list is preserved as a scheduling priority reference.

[0081] Furthermore, in each analog clock cycle, one instruction satisfying the dependency constraint is selected from each of the two queues and paired for transmission, specifically including:

[0082] Maintain an analog clock variable and two queue pointers; in each clock cycle, check whether the instruction at the head of the arithmetic queue and the instruction at the head of the memory access queue satisfy the dependency constraints. If they do, fill the corresponding instruction pairs.

[0083] The dependency constraints include: the current clock cycle is greater than or equal to the completion time of all write-after-read dependent instructions of this instruction, and the current clock cycle is greater than the emission time of all read-after-write dependent instructions of this instruction.

[0084] Furthermore, the completion time of the instruction is obtained by adding the instruction's launch time to its hardware delay period; when an instruction is selected for launch, its launch time is recorded as the current clock cycle, and its estimated completion time is calculated for dependency checks of subsequent instructions.

[0085] Specifically, a time-stepping greedy algorithm is used to map the serial instruction stream to two parallel hardware channels, namely the arithmetic instruction queue Q0 and the memory access instruction queue Q1, as follows:

[0086] Step S3-1: Based on the queue type field in the IR tuple, fill all instruction indices into the arithmetic instruction queue Q0 and the memory access instruction queue Q1 respectively.

[0087] Step S3-2: Simulate clock stepping using a list scheduling algorithm to rearrange instructions;

[0088] Maintain an analog clock variable curTick (initialized to 0) and two queue pointers n0 and n1. Enter a while loop until both queues are processed. The specific operations of each loop are as follows:

[0089] (1) Dependency constraint check: In each clock cycle, attempts are made to fetch instructions from the head of the arithmetic instruction queue Q0 and the memory access instruction queue Q1 respectively, and a dependency constraint check is performed:

[0090] Check read-after-write (RAW) dependency constraints: Iterate through the read-after-write dependency list rawlist of this instruction and determine whether the current clock curTick is greater than or equal to the "emit time + hardware delay period" of all dependent instructions in the read-after-write dependency list.

[0091] Check read-after-write (WAR) dependency constraints: Iterate through the read-after-write dependency list warlist for this instruction and determine whether the current clock curTick is greater than the issue time of all dependent instructions in the read-after-write dependency list.

[0092] (2) Dual-launch decision logic:

[0093] Operation channel allocation: If the instruction at pointer Q0 of the operation instruction queue passes the dependency constraint check, it is placed in slot 0 of the current cycle, its emission time Tick is recorded, and pointer n0 is moved forward; otherwise, empty instructions are filled in slot 0.

[0094] Memory access channel allocation: If the instruction at pointer Q1 of the memory access instruction queue passes the check, it is placed in slot 1 of the current cycle, its issue time Tick is recorded, and pointer n1 is moved forward; otherwise, slot 1 is filled with an empty instruction.

[0095] Parallel submission: Store the instruction pair (inst0, inst1) determined in this loop into the final dual-channel instruction pair sequence sched_result.

[0096] (3) Clock advance: increment curTick by 1 and enter the next cycle.

[0097] Through this process, the final result is as follows: Figure 4 The dual-channel instruction pair sequence sched_result shown automatically inserts other unrelated instructions between long-delay instructions and their dependent instructions during the reordering process, thereby maximizing pipeline utilization.

[0098] Step S4: Output the dual-channel instruction pair sequence as assembly code according to the dual-issue syntax format of the Sunway assembler for use in the Sunway many-core processor.

[0099] After instruction reordering is complete, the dual-channel instruction pair sequence sched_result is traversed. For each instruction pair (inst0, inst1), the maximum string width is calculated for alignment, and it is formatted into the dual-issue syntax supported by the Sunway assembler. At the same time, the clock cycle Tick of the current instruction issue is appended in the comments, and the complete assembly code is output.

[0100] Example 2

[0101] One embodiment of the present invention provides an automatic assembly code generation system for the Sunway many-core processor, comprising:

[0102] The code analysis module is configured to perform line-by-line analysis of the high-level scripting language code segment to be compiled, and obtain a variable register mapping table and a linear instruction list consisting of intermediate representation instructions;

[0103] The dependency building module is configured to: perform data dependency analysis between instructions based on the variable register map and the linear instruction list, and construct a dependency graph that includes write-after-read dependencies and read-after-write dependencies;

[0104] The instruction reordering module is configured to: based on the dependency graph, distribute the instructions in the linear instruction list to the operation instruction queue and the memory access instruction queue; perform dual-channel static reordering of the instructions in the two queues based on the list scheduling algorithm; select one instruction that satisfies the dependency constraint from each of the two queues in each analog clock cycle and send them as a pair; and fill empty instructions in the channels where the dependency is not satisfied, thereby generating a dual-channel instruction pair sequence.

[0105] The assembly output module is configured to output the dual-channel instruction pair sequence as assembly code according to the dual-issue syntax format of the Sunway assembler, for use in the Sunway many-core processor.

[0106] Example 3

[0107] One embodiment of the present invention provides a computer program product, including a computer program that, when executed by a processor, implements the aforementioned method for automatically generating assembly code for the Sunway many-core processor.

[0108] Example 4

[0109] In one embodiment of the present invention, a non-transitory computer-readable storage medium is provided for storing computer instructions. When the computer instructions are executed by a processor, they implement the aforementioned method for automatically generating assembly code for the Sunway many-core processor.

[0110] Example 5

[0111] One embodiment of the present invention provides an electronic device, including: a processor, a memory, and a computer program; wherein, the processor is connected to the memory, the computer program is stored in the memory, and when the electronic device is running, the processor executes the computer program stored in the memory to enable the electronic device to execute an automatic assembly code generation method for the Sunway many-core processor.

[0112] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0113] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0114] While the specific embodiments of the present invention have been described above in conjunction with the accompanying drawings, this is not intended to limit the scope of protection of the present invention. Those skilled in the art should understand that various modifications or variations that can be made by those skilled in the art without creative effort based on the technical solutions of the present invention are still within the scope of protection of the present invention.

Claims

1. A method for automatically generating assembly code for the Sunway many-core processor, characterized in that, include: The code segment of the high-level scripting language to be compiled is analyzed line by line to obtain a variable register mapping table and a linear instruction list consisting of intermediate representation instructions; Based on the variable register mapping table and the linear instruction list, we perform data dependency analysis between instructions and construct a dependency graph that includes write-after-read dependency and read-after-write dependency. Based on the dependency graph, instructions in the linear instruction list are distributed to the arithmetic instruction queue and the memory access instruction queue. A dual-channel static rearrangement of instructions in the two queues is performed using a list scheduling algorithm. In each analog clock cycle, one instruction satisfying the dependency constraint is selected from each queue and issued as a pair. Empty instructions are filled in channels where the dependency is not satisfied, generating a dual-channel instruction pair sequence. Specifically, this includes: Based on the queue type field marked in the intermediate instruction, all instruction indices are filled into the arithmetic instruction queue and the memory access instruction queue respectively, preserving the order of instructions in the original linear list as a scheduling priority reference; a simulated clock variable and two queue pointers are maintained; in each clock cycle, the instruction at the head of the arithmetic queue and the instruction at the head of the memory access queue are checked sequentially to see if they satisfy the dependency constraints. If they do, the corresponding instruction pairs are filled; wherein, the dependency constraints include: the current clock cycle is greater than or equal to the completion time of all write-after-read dependent instructions of this instruction, and the current clock cycle is greater than the issue time of all read-after-write dependent instructions of this instruction; the completion time of the instruction is obtained by adding the issue time of the instruction to its hardware latency period; when an instruction is selected to be issued, its issue time is recorded as the current clock cycle, and its expected completion time is calculated for dependency checks of subsequent instructions; The dual-channel instruction pair sequence is output as assembly code according to the dual-issue syntax format of the Sunway assembler, and used in the Sunway many-core processor.

2. The method for automatically generating assembly code for the Sunway many-core processor as described in claim 1, characterized in that, Line-by-line analysis, specifically including: Identify variable declaration statements and arithmetic operation statements in the code line; for variable declaration statements, call the virtual register dynamic allocation mechanism to allocate physical register numbers for them, and store the mapping relationship between variable names and register numbers in the variable register mapping table; for arithmetic operation statements, call the corresponding instruction abstraction model interface to generate intermediate representation instructions.

3. The method for automatically generating assembly code for the Sunway many-core processor as described in claim 1, characterized in that, The intermediate instruction is encapsulated using a 5-tuple structure, which includes: instruction text, queue type, hardware latency period, target operand set, and source operand set.

4. The method for automatically generating assembly code for the Sunway many-core processor as described in claim 1, characterized in that, Constructing a dependency graph includes: Traverse the linear instruction list, recording the source operand and destination operand for each instruction; maintain the last defined register table, recording the instruction index of the last time each register was defined, used to build write-after-read dependencies; maintain the last used register table, recording the instruction index of the last time each register was used, used to build read-after-write dependencies; store the dependent instruction indexes of each instruction into the write-after-read dependency list and read-after-write dependency list of that instruction, respectively.

5. An automatic assembly code generation system for the Sunway many-core processor, characterized in that, include: The code analysis module is configured to perform line-by-line analysis of the high-level scripting language code segment to be compiled, and obtain a variable register mapping table and a linear instruction list consisting of intermediate representation instructions; The dependency building module is configured to: perform data dependency analysis between instructions based on the variable register map and the linear instruction list, and construct a dependency graph that includes write-after-read dependencies and read-after-write dependencies; The instruction reordering module is configured to: distribute instructions in the linear instruction list to an arithmetic instruction queue and a memory access instruction queue according to the dependency graph; perform dual-channel static reordering of instructions in the two queues based on a list scheduling algorithm; select one instruction from each queue that satisfies the dependency constraint and issue it as a pair in each analog clock cycle; and fill empty instructions in channels where the dependency is not satisfied, generating a dual-channel instruction pair sequence, specifically including: Based on the queue type field marked in the intermediate instruction, all instruction indices are filled into the arithmetic instruction queue and the memory access instruction queue respectively, preserving the order of instructions in the original linear list as a scheduling priority reference; a simulated clock variable and two queue pointers are maintained; in each clock cycle, the instruction at the head of the arithmetic queue and the instruction at the head of the memory access queue are checked sequentially to see if they satisfy the dependency constraints. If they do, the corresponding instruction pairs are filled; wherein, the dependency constraints include: the current clock cycle is greater than or equal to the completion time of all write-after-read dependent instructions of this instruction, and the current clock cycle is greater than the issue time of all read-after-write dependent instructions of this instruction; the completion time of the instruction is obtained by adding the issue time of the instruction to its hardware latency period; when an instruction is selected to be issued, its issue time is recorded as the current clock cycle, and its expected completion time is calculated for dependency checks of subsequent instructions; The assembly output module is configured to output the dual-channel instruction pair sequence as assembly code according to the dual-issue syntax format of the Sunway assembler, for use in the Sunway many-core processor.

6. A non-transitory computer-readable storage medium, characterized in that, The non-transitory computer-readable storage medium is used to store computer instructions, which, when executed by the processor, implement the automatic assembly code generation method for the Sunway many-core processor as described in any one of claims 1-4.

7. An electronic device, characterized in that, include: The device includes a processor, a memory, and a computer program; wherein the processor is connected to the memory, the computer program is stored in the memory, and when the electronic device is running, the processor executes the computer program stored in the memory to enable the electronic device to execute an automatic assembly code generation method for the Sunway many-core processor as described in any one of claims 1-4.