GPDSP assembly transplantation optimization method and system based on countdown buffering
An optimization method and assembly technology, applied in the direction of computing, code compilation, software engineering design, etc., can solve the problems of assembly code development and optimization difficulty, waste of human and financial resources, code is not portable, etc., to achieve good code performance and hardware. The effect of resource utilization
Pending Publication Date: 2021-07-23
NAT UNIV OF DEFENSE TECH
0 Cites 1 Cited by
AI-Extracted Technical Summary
Problems solved by technology
Although efficient algorithm design and implementation can be completed, the technical accumulation, labor cost and development cycle required are extremely high, and the code does not have good portability
Specifically: First, for the user's core-level assembly code, it is necessary to manually arrange the instructions in parallel, and at the same time, it is necessary to consider the dependencies between instructions, the number of instruction beats, delay slot hiding, functional unit restrictions, and parallel packaging of instructions, etc. , especially when the core-level code is long, manpower sometimes cannot find a better parallel arrangement scheme under many constraints, so the development and optimization of assembly code is extremely difficult; second, because assembly code is usually static A...
Method used
[0043] On the basis of completing instruction dependency analysis and new architecture mapping, efficient assembly code can be generated through parallel packaging and scheduling of instructions, thereby improving the execution efficiency of the entire program. On the directed acyclic graph of instructions generated by dependency analysis, the instructions are topologically sorted, and then the root node is first added to the dispatch queue according to the sorted instructions. The dispatch queue will pop up one instruction each time to execute the dispatch process. In order to complete instruction scheduling, we will maintain an instruction arrangement table, the horizontal axis is all functional units of the hardware, and the vertical axis is the time period. In addition to retaining the previous dependencies and executable functional unit list, each instruction will also maintain A timestamp of the latest execution start time. The root node that joins the scheduling queue for the first time has the latest start time of 0. The instruction to execute the scheduling process will be popped up from the scheduling queue. According to the constraints of its own functional unit and the latest start time constraint, it will find a place that can be placed in the instruction arrangement table according to the row priority. After encountering the earliest functional unit that can be executed, it will The functional unit is reserved in the instruction arrangement table, and the reservation time is from the clock cycle when the instruction is executed to the completion of the ...
Abstract
The invention discloses a GPDSP assembly transplantation optimization method and system based on countdown buffering. The method comprises the following steps: analyzing an assembly code, and performing dependency analysis based on a countdown buffering pool to construct an instruction-dependent directed acyclic graph; analyzing the difference of the instruction set information before and after transplantation, correcting the instruction in the assembly code according to the difference of the instruction set information before and after transplantation, and transplanting and mapping the instruction to the instruction information under the new system structure; and on the basis of the directed acyclic graph depended on by the instruction, enabling the instruction information transplanted and mapped to the new system structure to be subjected to parallel optimization packaging scheduling of the instruction based on a long instruction priority list scheduling algorithm, and obtaining a target assembly code transplanted to the new system structure. According to the method, the assembly codes can be effectively and automatically migrated on different generations of GPDSP platforms, the original assembly codes can be subjected to parallel packaging scheduling again, and better code performance and hardware resource utilization rate are obtained.
Application Domain
Program code adaptionCode compilation
Technology Topic
Priority listSystem structure +10
Image
Examples
- Experimental program(1)
Example Embodiment
[0031] like figure 1 As shown, the present embodiment based on the countdown buffer GPDSP assembly optimization method includes:
[0032] 1) For GPDSP assembly code to be transplanted ( figure 1 In the next embellishment name ".asm", parsing, and based on the countdown buffer pool, the dependency between the instruction flow sequence and the instruction is analyzed and the instruction-dependent directionless;
[0033] 2) Analyze the differences in the command set information before and after the transplantation, according to the difference in the intent information of the command set information before and after the transplantation, correct the instructions in the assembly code, and transplant it to the command information under the new architecture;
[0034] 3) Based on the command-dependent directionless diagram, transplantation to the new architecture of instruction information is based on long command priority list scheduling algorithm for instructions, to obtain target assembly code that transplanted to the new architecture ( figure 1 It is still in the name ".asm" indicated.
[0035] Each line in the user's assembly code is usually a compilation instruction. Since the GPDSP assembly code is usually static, only from the code, it can only see which shooting starts, after the instruction After the latency slot, it can be performed and take effect, which is difficult to see directly through code, especially after the optimized delay slot, which already includes instruction arrangement and delay slots hidden. Therefore, it is necessary to optimize assembly code. First, the code needs to be parsed, analyzes the dependencies of the instruction flow order and instructions, can optimize and transplant the code.
[0036]From the assembly code, you can explicitly see the clock cycle initiated by each instruction, while combining the execution of each instruction set, you can infer what shooting this instruction can take effect. According to the above principles, this embodiment proposes a mechanism based on a countdown buffer to combat all assembly instruction flow effectiveness and dependencies. Specifically, such as figure 2 As shown, the step of relying on the countdown buffer pool in this embodiment includes: Building a buffer pool and a resulting instruction pool, and maintaining a clock for countdown outside the buffer pool, instructions in the assembly code. Press the buffer pool in the explicit order, and follow the instructions in accordance with the effective time of the instruction to bring the instructions to the puff pool, first look for the effective instructions in the Effective instruction pool and establish a dependency, and then join the effectiveness The instruction pool is to obtain the dependencies of the instruction flow order and the instructions. In this embodiment, the analysis of the countdown buffer can be completed, and the analysis of the entire instruction stream and the dependency is completed, and the command-dependent directionless diagram is constructed.
[0037] See figure 2 The external provision of instruction set information is required before performing dependency analysis, that is, the specific name of each of the instructions, the functional unit and the specific number of beats. The dependency analysis module will maintain a clock to record the current clock cycle, and then read the assembly instructions to be executed in the order according to the order of the code and resolve the instructions into a buffer pool. The instructions of the pressing buffer pool will be added based on the number of clock cycles that start execution, and the time to perform the tabs is added to the instruction set, and the time the expected completion is expected, and the record is saved in the instruction. At the same time, the dependency analysis module needs to traverse all instructions in the current countdown buffer, and determines whether or not there is an instruction that has been completed in the current clock, and the effective instruction will pop up from the buffer pool and subsequently dependent analysis and scheduling. .
[0038] The entry into force pop-up can be analyzed according to the specific operation (including reading after reading, writing, writing, writing) and control dependence. In this embodiment, the dependency between the instructions in step 1) is represented as a three-tuple representation (INSTR_X, INSTR_Y, NUM), where INSTR_X represents the parent instruction, the INSTR_Y representative sub-instruction, NUM represents a parent instruction and sub-instructions. The number of beats. Take the read read as an example, if the number of operands written in the INSTR_X instructions are dependent on the operand read by the INSTR_Y instruction, and the implementation of the INSTR_X is CYCLE_X, then the dependencies between the two instructions can be described as (INSTR_X, INSTR_Y, CYCLE_X).
[0039] Where the parent command and sub-instructions require an empty beat number NUM as a positive or negative value, where the negative value indicates that the sub-instruction is executed in advance but take effect after the parent command takes effect. Since the reading of the operands is usually the first shot of the measurement of the tape, and the operand writes are usually the last shot of the executive section, so the value of Num is in the case of writing or reading after writing. It can be a negative value, that is, write instructions to perform in advance but take effect after the read command takes effect. Through the above countdown buffer mechanism, we can analyze the timing logic of the instruction stream in the source code, and depending on the effective time to perform dependencies, build an instruction-dependent directionless, which lays the foundation for subsequent instructions.
[0040] The same series of GPDSP products generally the architecture does not have too much change, but for assembly programming, even if it is not necessarily transplanted, it is impossible to directly transplant, and due to the improvement and adjustment of application scenarios or processes. Tones often make some information related to the command sets will change, and these changes in these and architectures can cause assembly code to not be used directly. In order to complete the migration of the code between different generations, the difference between the instruction set information between the two is needed, and the mapping of the instruction information under the new architecture is completed by the analysis of assembly code.
[0041] In this embodiment, the difference in transit information before and after transplantation in step 2) includes a change or adjustment of the command name, and the number of commands changes, and the change of the executable functional unit.
[0042] In this embodiment, in step 2), the assembly code in the assembly code is corrected includes a change or adjustment of the command name, and the instructions in the assembly code are automatically replaced; for the number of toded ports, the instructions in the assembly code The command will be corrected while adjusting the relationship with the instruction; for the change in the executable functional unit, the executable function unit of the assembly code is corrected one by one. With the assistance of the information difference table of the instruction set, the present invention can map the assembly code analysis and dependency analysis, and the specific mapping includes the following aspects: 1) Cut all assembly instructions in the source code Traverse, encountering the name of the directive, which can be changed to a new name; 2) Modify the data structure related to the functional unit in the architecture, the type and number of functional units are subject to the new architecture, follow-up Referring to the new functional unit for instruction scheduling; 3) Reference to the new instruction set information, update the executable function unit and the beat number of each instruction maintenance; 4) For command INSTR_X with dependency relationship, its execution section number Cycle_x If there is a adjustment, you need to recalculate each dependency triple set (INSTR_X, INSTR_Y, NUM). Based on the analysis mapping of the above aspects, the original assembly code can be completed to perform instruction mappings for the new architecture, and the dependency is also adjusted for the follow-up instruction scheduling and code generation. Through the above targeted adjustment mapping, the source code can be completed in the new architecture of the map transplant, and the automatic migration of assembly code between different generation DSP products can be directly completed, which greatly improves the portability of the software ecology and improves software development efficiency.
[0043] On the basis of completing the instruction dependent analysis and new architecture map, efficient assembly code can be generated by the parallel package scheduling of the instruction, thereby increasing the overall execution efficiency. The instructions generated in the dependency analysis are to topologize the instructions, and then the root node first add the schedule queue to the sorted instruction. The scheduling queue will pop up an instruction execution scheduling process each time. In order to complete the instruction scheduling, we will maintain a command release table, the horizontal axis is all functional units of the hardware, the vertical axis is the time period, and each instruction will be maintained in addition to the previous dependency and executable functional unit list. A timestamp of the time to execute the time. For the first time, the root node of the dispatched queue is added, and the first start time is 0. The command to perform the schedule process is populated from the schedule, and the time constraint according to the own functional unit constraints, according to the line, according to the line, it will be placed in the instruction release table, and after the first function unit, after the first function unit, will In the instruction release table, the function unit is reserved, and the appointment time is performed from the clock cycle executed from the instruction to its instruction. Take INSTR_X as an example, it is assumed that its latest start execution time is the first shot, and it is found to be used in the first + J shoot, so the function unit N will be reserved on the instruction release table. The first + J shot No. I + J + CYCLE_X. After completing the schedule of the current command, you need to delete the instruction to other instructions, and modify the relying time to the current instructions of the current instructions to the current instructions. Or the above INSTR_X instruction is an example, assuming that the instruction has determined that the first + J began to start execution, while INSTR_Y and INSTR_X have dependencies (INSTR_X, INSTR_Y, NUM), and the original start time of INSTR_Y is Time_Y, INSTR_Y The latest start time will be modified to max (Time_Y, I + J + NUM). Each time you schedule a directive, you can re-add the newly generated root node instruction to the scheduling queue. The dispatch queue uses the priority queue to be implemented. When there are multiple root nodes at the same time, it will be prioritized for a long time. Repeat the schedule process until all instructions have been scheduled, and the new assembly code after the final optimization is generated according to the final generated instruction.
[0044] As a preferred embodiment, such as image 3 As shown, step 3 of this embodiment includes:
[0045] 3.1) Create a schedule, the scheduling is used to record if each function unit is occupied (record resource constraint) during each clock cycle, and each instruction is maintained in addition to maintaining an executable functional unit list. The clock cycle is executed; the topology of the command to the ringless figure is performed, and the parent instruction that is not directed to dependencies will be added to the scheduler, and the initial period cycle that can begin can start execution. 1;
[0046] 3.2) Popping an instruction from a to-ring diagram as a scheduled object;
[0047] 3.3) According to the executable functional unit list as the command to the scheduled object, the corresponding view schedule is selected, and selects the first-handed function unit that can be permanently executed, and the clock cycle that is started, updating the schedule occupies; The instruction points to the dependencies of other sub-instructions all deleted, and according to the clock cycle and dependent relationship in the command, the first clock cycle of the sub-instructions;
[0048] 3.4) Judging whether or not there is a long traversation of the ring map, if it has not been traveled, the jump execution step 3.2); otherwise, according to the schedule generates the target assembly code under the new architecture.
[0049] In this embodiment, step 3) Parallel optimization package schedule for instructions based on long instruction priority list scheduling algorithm is specifically a list scheduling improvement algorithm with long command-oriented long instructions facing VLIW. Step 3.2) Popping an instruction from a to-ring map as a scheduled object, if there is a plurality of conformity instructions, priority is prioritized as a scheduled object.
[0050] In summary, assembly code developed for GPDSP, due to the compiler or artificial arrangement, there may already have delayed slots hidden and parallel packaging optimization. If you need to optimize or transplant the assembly code, you need to first analyze the instructions of the code to take effect on the entry into force and instruction dependencies. The code is then optimized or transplanted on the code based on the dependency analysis. In response to the above problems, this embodiment is analyzed for assembly code, and based on the countdown buffer pool, the dependency between the instruction flow sequence and the instructions are analyzed and the instruction-dependent directionless is analyzed; analysis and transplantation Differences in the command set information, according to the difference in the command set information before and after the transplantation, transplant the command information under the new architecture; based on the direction-dependent directionless, transplantation to the new system The instruction information under the structure is based on the parallel optimization package scheduling of the command of the long command priority list scheduling algorithm to obtain target assembly code ported to the new architecture. Based on the countdown buffer mechanism to complete the analysis of the instruction dependencies and the construction of the ring-free graph, the instruction set information map of the new architecture can be completed, and the instructions are efficiently scheduled, and the algorithm performance and portability can be improved. This embodiment can enable assembly code to implement a valid automatic migration on different generation GPDSP platforms, able to re-package dispatching the original assembly code, and achieve better code performance and hardware resource utilization.
[0051] Further, the present embodiment also provides an upward-based buffered GPDSP assembly optimization system, including a microprocessor and a memory connected to each other, the microprocessor is programmed or configured to perform the aforementioned countdown-based buffered GPDSP assembly optimization method A step of.
[0052] Further, the present embodiment also provides a computer readable storage medium that stores a computer program that is programmed or configured to perform the aforementioned countdown-based buffered GPDSP assembly optimization method.
[0053] It will be apparent to those skilled in the art that the present invention can be made in accordance with the present invention, and these modifications and deformations are within the scope of the invention, and they should belong to the claims of the present invention. scope.
PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.