A GPU performance modeling method combining an analytical model with a cycle-accurate model
By combining analytical models with periodic precision models, a memory access module and instruction execution flow are constructed, solving the problem of speed and accuracy imbalance in existing GPU modeling methods and achieving fast and accurate GPU performance modeling.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- 北京天数智芯半导体科技有限公司
- Filing Date
- 2022-10-11
- Publication Date
- 2026-06-19
AI Technical Summary
Existing GPU performance modeling methods struggle to find a balance between speed and accuracy. Cyclic precision models are slow but highly accurate, analytical models are low-accuracy but fast, and hybrid models are not precise enough in terms of memory access module modeling.
By combining analytical models and cycle-accurate models, a cycle-accurate memory access module model is constructed by collecting traces of GPU applications. The analytical model is then used to model the instruction execution flow, generate instruction sequences, and calculate the GPU performance metric IPC.
It enables fast and accurate calculation of GPU performance metrics, balancing modeling speed and accuracy, and improving the efficiency of GPU architecture exploration.
Smart Images

Figure CN115630489B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of GPU performance modeling technology, specifically relating to a GPU performance modeling method that combines an analytical model with a periodic accurate model. Background Technology
[0002] The rapid innovation of GPU architecture presents new challenges to GPU modeling. To research and develop new processor architectures and quickly evaluate the performance and power consumption of new designs, architecture researchers need to create accurate performance models for CPUs or GPUs and develop corresponding simulators to accelerate the design cycle. Currently, GPU architecture design widely adopts Virtual Instruction Sets (VISA) and Machine Instruction Sets (MISA), giving hardware developers great flexibility in modifying the MISA while maintaining application compatibility without altering the VISA. However, NVIDIA's MISA is typically not publicly available, while AMD's MISA is often significantly modified, making it difficult to implement in open-source simulators. Therefore, achieving accurate performance modeling for the ever-evolving GPU architecture has become a major challenge for current GPU design.
[0003] Current GPU performance modeling mainly falls into the following categories:
[0004] (1) Cyclic Accurate Modeling. Currently, open-source simulation frameworks for cyclic accurate modeling of GPUs include MacSim, Multi2sim, Gem5-GPU, GPGPU-sim, and Accel-Sim. These methods typically involve directly modeling the computational architecture and then obtaining performance estimates through detailed cyclic accurate simulations. Accel-Sim is the latest open-source GPU simulator released by Purdue University in 2020. While Accel-Sim supports more ISA architectures than previous open-source simulators, it does not directly model NVIDIA's latest architectures, such as Turing. Instead, it uses trace-based simulation to support multiple mISA architectures and establishes a set of micro-benchmarks to infer parameters of NVIDIA's undisclosed GPU architectures. The inferred parameters are used to build models supporting execution-based simulations, reducing the error to around 15%. Furthermore, current cyclic accurate simulations are generally slow due to the detailed simulation of hardware operations. When searching for optimal design parameters during the design process, simulation time can even exceed thousands of hours, making the simulation speed unacceptable.
[0005] (2) Analysis Models. Analysis models mainly rely on analyzing key parameters related to the computing architecture and performance to establish mathematical models, which can provide a design space search speed improvement of more than 100 times compared to periodic accurate simulation. Early analysis models include MWP-CWP and GPUPerf. GPUPerf mainly provides predictions of GPU performance bottlenecks. Interval Analysis can further improve analysis models by tracing key events that cause performance degradation through instruction trace, such as cache misses and branch misprediction, and building a performance model for each event. GPUMech, proposed by Georgia Institute of Technology, was the first to apply an interval analysis-based analysis model to GPU performance models, effectively improving accuracy and simulation speed. The latest GPU analysis model, MDM, establishes a more accurate analysis model of GPU architecture on-chip networks and GPU memory access, effectively improving the simulation accuracy of memory-divergent GPU applications. However, compared to periodic accurate modeling, the accuracy error of analysis models is larger, especially when dealing with memory access behavior with resource contention. The time estimated by the mathematical model differs significantly from the actual hardware time. Therefore, for actual production applications, its reference value is not as high as that of periodic accurate simulators.
[0006] (3) Hybrid Models. Hybrid models combine two different GPU modeling models to achieve a good balance between accuracy and speed in GPU performance modeling. The PPT-GPU model proposed by New Mexico State University mixes analytical models with event-driven models, dividing GPU modeling into computational modeling and memory access modeling. The computational modeling part uses instruction list-based modeling, while the memory access modeling uses analytical modeling. However, for memory-intensive applications, the accuracy error of memory access behavior modeling seriously affects the accuracy of GPU performance modeling.
[0007] In summary, existing GPU performance models can be categorized into cycle-accurate models, analytical models, and hybrid models. While cycle-accurate models offer high modeling precision, they are slow, with simulation times potentially exceeding thousands of hours for some applications, making them largely irrelevant for architecture exploration. Analytical models offer a significant speed improvement over cycle-accurate models, but their use of mathematical models to model the behavior of GPU components results in larger errors and lower modeling precision. Hybrid models, on the other hand, offer a finer-grained approach, dividing GPU modeling into computational and memory access components, achieving a balance between modeling speed and accuracy. However, existing hybrid models still employ analytical models for memory access modeling, resulting in insufficient precision in the memory access module modeling. Summary of the Invention
[0008] The technical problem to be solved by the present invention is to address the shortcomings of the prior art by providing a GPU performance modeling method that combines an analytical model and a cycle-accurate model, taking into account both speed and accuracy. The memory access module, which has a significant impact on accuracy, is modeled using a cycle-accurate model, while the instruction execution flow is modeled using an analytical model. This enables fast and accurate modeling of GPU performance, that is, by inputting the application program and the GPU configuration information, the GPU performance metric of instructions per cycle (IPC) can be calculated quickly and accurately.
[0009] To achieve the above-mentioned technical objectives, the technical solution adopted by the present invention is as follows:
[0010] A GPU performance modeling method combining an analytical model and a periodic accurate model includes:
[0011] Step 1: Collect runtime traces of GPU applications;
[0012] Step 2: Extract GPU architecture information and construct a memory access module model based on a periodic accurate model.
[0013] Step 3: Analyze the runtime trace of the GPU application and generate an instruction sequence;
[0014] Step 4: Map the instructions of each thread block to the target GPU architecture, and construct the execution time series of the instructions using analytical modeling until the instructions in all thread blocks are modeled, and obtain the number of GPU application execution cycles. max ;
[0015] Step 5: Obtain the total number of instructions N executed by the GPU application. instruction Combined with cycle max Calculate GPU performance metrics IPC.
[0016] To optimize the above technical solution, the specific measures also include:
[0017] Step 1 above uses a trace collection tool to collect log files that record the instruction status of the GPU application. The log files contain information such as the thread block number of the instruction execution, the warp number, the instruction opcode, the number and number of the instruction source registers, the number and number of the instruction destination registers, the type of memory access, and the memory access address.
[0018] Step 2 above includes the following steps:
[0019] Step 2.1: Extract GPU architecture information, including the number of GPU stream multiprocessors, GPU cache configuration, GPU memory configuration, GPU clock frequency, GPU warp scheduling policy, and GPU instruction latency information.
[0020] Step 2.2: Construct a memory access module model based on a cycle-accurate model to obtain a GPU memory access module cycle-accurate simulator:
[0021] Based on GPU architecture information, a cycle-accurate simulator including L1 cache, L2 cache, on-chip network and GPU memory is built;
[0022] Simulator parameters include: the number of L1 and L2 caches, cache mapping method, cache replacement policy, number of missed registers, on-chip network bandwidth, GPU memory bandwidth, and number of memory controllers.
[0023] Step 3 above involves parsing and classifying the collected traces, arranging the instructions of each warp in the parsed and classified traces according to the execution order, and generating an instruction sequence.
[0024] Step 4 above includes:
[0025] Step 4.1: Map the instructions of each thread block to the target GPU architecture to model the execution process until all thread blocks have been modeled.
[0026] Step 4.2: Obtain the execution cycle of the last warp instruction in all thread blocks, and take the maximum value as the execution cycle number of the GPU application. max .
[0027] Step 4.1 above specifically includes:
[0028] Step 4.1.1: Based on the correlation between register usage analysis instructions, select the active warp and model the instruction execution process;
[0029] Step 4.1.2: Based on the instruction latency and memory access instruction execution cycle in the GPU architecture configuration, construct the instruction execution time sequence. For memory access instructions, execute steps 4.1.3 to 4.1.4; for non-memory access instructions, obtain the instruction execution time from the architecture information until all instructions have been modeled.
[0030] Step 4.1.3: Generate memory access request information with launch time information;
[0031] Step 4.1.4: Pass the memory access request information to the memory access module with precise cycle modeling, perform precise cycle simulation, obtain the memory access instruction execution cycle, and return to step 4.1.2.
[0032] The total number of instructions executed by the GPU application in step 5 above refers to the number of instructions extracted from the GPU application runtime trace;
[0033] The formula for calculating GPU performance metrics is:
[0034]
[0035] The present invention has the following beneficial effects:
[0036] This invention addresses the challenge of balancing speed and accuracy in GPU modeling. For memory access modules, which significantly impact accuracy, it employs periodic precision modeling; for instruction execution flows, it utilizes analytical modeling. Compared to similar GPU performance modeling models, this approach ensures both modeling speed and improved accuracy, significantly contributing to GPU architecture exploration. Attached Figure Description
[0037] Figure 1 This is a flowchart of the GPU performance modeling method that combines the analytical model and the periodic accurate model of this invention.
[0038] Figure 2 This is a flowchart for constructing the instruction execution sequence of the present invention. Detailed Implementation
[0039] The embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
[0040] like Figure 1 As shown, a GPU performance modeling method combining an analytical model and a periodic accurate model includes:
[0041] Step 1: Collect runtime traces of GPU applications;
[0042] A GPU application runtime trace is a log file that records the state of GPU application instructions during operation, including the thread block number, warp number, instruction opcode, number and number of instruction source registers, number and number of instruction destination registers, memory access type, and memory access address.
[0043] Step 2: Extract GPU architecture information and construct a memory access module model based on a periodic accurate model.
[0044] Step 2.1: Extract GPU architecture information, including the number of GPU stream multiprocessors, GPU cache configuration, GPU memory configuration, GPU clock frequency, GPU warp scheduling policy, and GPU instruction latency information.
[0045] Step 2.2: Construct a memory access module model based on a cycle-accurate model to obtain a GPU memory access module cycle-accurate simulator:
[0046] Based on GPU architecture information, build a cycle-accurate simulator that includes L1 cache, L2 cache, on-chip network, and GPU memory;
[0047] The main parameters of the simulator include: the number of L1 and L2 caches, cache mapping method, cache replacement policy, number of missed registers, on-chip network bandwidth, GPU memory bandwidth, and number of memory controllers.
[0048] Step 3: Analyze and classify the collected traces. Arrange the instructions of each warp in the analyzed and classified traces according to the execution order to generate an instruction sequence.
[0049] This step categorizes each instruction in the trace into its corresponding warp, forming an instruction sequence.
[0050] Step 4: Map the instructions of each thread block to the target GPU architecture, execute the interval construction algorithm, and use different methods to obtain instruction latency for memory access instructions and non-memory access instructions, constructing the instruction execution time series until the instructions in all thread blocks are modeled, and obtain the number of GPU application execution cycles. max ;
[0051] This step combines GPU architecture configuration information and instruction sequences to model the instruction execution process for each thread block, calculating instruction issue time and completion time.
[0052] When calculating instruction completion time, for non-memory access instructions, the instruction execution time can be obtained from the architecture information; for memory access instructions, the issue time and request information are passed together into a periodically accurate GPU memory access module model for periodically accurate modeling to obtain the execution time of the memory access instruction. For example... Figure 2 The flowchart shown is for constructing the instruction execution sequence.
[0053] The first step in constructing an instruction execution sequence is to analyze the correlations between instructions.
[0054] Then, determine whether all warp instructions in all thread blocks have been executed. If they have been executed, output the instruction execution sequence, and the instruction execution sequence construction process ends.
[0055] If an instruction has not been completed, select an instruction from the currently active warp and determine if it is related to any already executed instructions. If so, the current instruction's issue cycle is the maximum of (dependent instruction completion cycle + 1, previous instruction issue cycle + 1); otherwise, the current instruction's issue cycle is the previous instruction issue cycle + 1. After calculating the instruction issue cycle, determine the instruction type. If it is a non-memory access instruction, obtain its latency from the GPU configuration information; if it is a memory access instruction, generate a memory access request with an issue cycle and model it in the periodic precise memory access model to obtain the memory access instruction latency. Then, calculate the instruction completion cycle from the issue cycle and the instruction execution latency. Again, check if all warp instructions in all thread blocks have been completed, and continue executing the above process until all instructions have been completed.
[0056] Step 4 specifically includes:
[0057] Step 4.1: Map the instructions of each thread block to the target GPU architecture to model the execution process until all thread blocks are modeled. Select the active warp and model the instruction execution process, specifically including:
[0058] Step 4.1.1: Based on the correlation between register usage analysis instructions, select the active warp and model the instruction execution process;
[0059] Step 4.1.2: Based on the instruction latency and memory access instruction execution cycle in the GPU architecture configuration, construct the instruction execution time sequence. For memory access instructions, execute steps 4.1.3 to 4.1.4; for non-memory access instructions, obtain the instruction execution time from the architecture information until all instructions have been modeled.
[0060] Step 4.1.3: Generate memory access request information with launch time information;
[0061] Step 4.1.4: Pass the memory access request information to the memory access module with precise cycle modeling, perform precise cycle simulation, obtain the memory access instruction execution cycle, and return to step 4.1.2.
[0062] Step 4.2: Obtain the execution cycle of the last warp instruction in all thread blocks, and take the maximum value as the execution cycle number of the GPU application. max .
[0063] Step 5: Obtain the total number of instructions N executed by the GPU application. instruction Combined with cycle max The formula for calculating GPU performance metrics IPC is:
[0064]
[0065] The above are merely preferred embodiments of the present invention. The scope of protection of the present invention is not limited to the above embodiments. All technical solutions falling within the scope of the present invention's concept are within the scope of protection of the present invention. It should be noted that for those skilled in the art, any improvements and modifications made without departing from the principles of the present invention should be considered within the scope of protection of the present invention.
Claims
1. A GPU performance modeling method combining an analytical model and a periodic accurate model, characterized in that, include: Step 1: Collect runtime traces of GPU applications; Step 2: Extract GPU architecture information and construct a memory access module model based on a periodic accurate model. Step 3: Analyze the runtime trace of the GPU application and generate an instruction sequence; Step 4: Map the instructions of each thread block to the target GPU architecture, and construct the execution time series of the instructions using analytical modeling until the instructions in all thread blocks are modeled, and obtain the number of execution cycles of the GPU application. ,include: Step 4.1: Map the instructions of each thread block to the target GPU architecture to model the execution process until all thread blocks are modeled. This includes: Step 4.1.1: Based on the correlation between register usage analysis instructions, select the active warp and model the instruction execution process; Step 4.1.2: Based on the instruction latency and memory access instruction execution cycle in the GPU architecture configuration, construct the instruction execution time sequence. For memory access instructions, execute steps 4.1.3 to 4.1.4; for non-memory access instructions, obtain the instruction execution time from the architecture information until all instructions have been modeled. Step 4.1.3: Generate memory access request information with launch time information; Step 4.1.4: Pass the memory access request information to the memory access module with accurate cycle modeling, perform accurate cycle simulation, obtain the memory access instruction execution cycle, and return to step 4.1.2; Step 4.2: Obtain the execution cycle of the last warp instruction in all thread blocks, and take the maximum value as the execution cycle number of the GPU application. ; Step 5: Obtain the total number of instructions executed by the GPU application. , combined Calculate GPU performance metrics IPC.
2. The GPU performance modeling method combining an analytical model and a periodic accurate model according to claim 1, characterized in that, Step 1 uses a trace collection tool to collect log files that record the instruction status of the GPU application. The log files contain information such as the thread block number of the instruction execution, the warp number, the instruction opcode, the number and number of the instruction source registers, the number and number of the instruction destination registers, the type of memory access, and the memory access address.
3. The GPU performance modeling method combining an analytical model and a periodic accurate model according to claim 1, characterized in that, Step 2 includes the following steps: Step 2.1: Extract GPU architecture information, including the number of GPU stream multiprocessors, GPU cache configuration, GPU memory configuration, GPU clock frequency, GPU warp scheduling policy, and GPU instruction latency information; Step 2.2: Construct a memory access module model based on a cycle-accurate model to obtain a GPU memory access module cycle-accurate simulator: Based on GPU architecture information, a cycle-accurate simulator including L1 cache, L2 cache, on-chip network, and GPU memory is built; Simulator parameters include: the number of L1 and L2 caches, cache mapping method, cache replacement policy, number of missed registers, on-chip network bandwidth, GPU memory bandwidth, and number of memory controllers.
4. The GPU performance modeling method combining an analytical model and a periodic accurate model according to claim 1, characterized in that, Step 3 involves parsing and classifying the collected traces, arranging the instructions of each warp in the parsed and classified traces according to the execution order, and generating an instruction sequence.
5. The GPU performance modeling method combining an analytical model and a periodic accurate model according to claim 1, characterized in that, The total number of instructions executed by the GPU application in step 5 is the number of instructions extracted from the GPU application runtime trace; The formula for calculating GPU performance metrics is: 。