Performance analysis method, electronic device, and storage medium

CN122309307APending Publication Date: 2026-06-30ZHONGKE SUGUANG INFORMATION IND CHENGDU CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: ZHONGKE SUGUANG INFORMATION IND CHENGDU CO LTD
Filing Date: 2026-03-26
Publication Date: 2026-06-30

Application Information

Patent Timeline

26 Mar 2026

Application

30 Jun 2026

Publication

CN122309307A

IPC: G06F11/34; G06F3/06

AI Tagging

Technology Topics

Computational science Graphics

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Data packing method for 64-bit cpu native parallel
CN122308911AComputational science Datasheet
Efficient hybrid placement generation method and system of concrete three-dimensional meso-model
CN121922287BDesign optimisation/simulation Computational materials science Computational scienceCollision detection
A multi-feature adaptive mesh generation method based on GPU-accelerated multi-scale FDTD
CN122287258AComputational scienceNumerical stability
A memory-aware sparse matrix multiplication method suitable for edge embedded platforms
CN122309907AComputational science Parallel computing
Unified cross-template matrix multiplication auto-tuning method for heterogeneous accelerators
CN122286060AComputational scienceHardware architecture

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

In existing technologies, the performance analysis accuracy of objective functions running on graphics processors is low, mainly due to the inaccurate bottleneck type identification caused by the dependence of theoretical parameters of the roofline model and indirect estimation.

Method used

By obtaining the number of operations, running data, and runtime of the graphics processor, the computing speed and computing intensity are calculated. By comparing actual data with the roofline model, the bottleneck type of the specific memory area is determined, including computing bottlenecks and bandwidth bottlenecks.

Benefits of technology

It improves the accuracy of bottleneck identification and performance analysis, enabling the bottleneck to be located to a specific memory area, thus enhancing the precision of performance analysis.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122309307A_ABST

Patent Text Reader

Abstract

This disclosure provides a performance analysis method, an electronic device, and a storage medium. The method includes: obtaining the number of operations of instructions supported by the graphics processing unit (GPU), as well as the runtime data and duration during the execution of a target function. Based on the runtime, runtime data, and the number of operations of instructions executed during the execution of the target function (obtained using the number of operations and the types of instructions executed), the processing speed of the GPU and the computational intensity for multiple memory regions are determined. The processing speed and multiple computational intensities are combined separately as performance indicators of the amount of data read and written to multiple memory regions during the execution of the target function by the GPU. If the bottleneck type of the target function is determined to be a computational bottleneck or a bandwidth bottleneck based on the positional relationship of the performance indicator relative to the roofline model, the performance of the target function is analyzed according to the bottleneck type. This method improves the accuracy of performance analysis of target functions running on a GPU.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of artificial intelligence technology, and in particular to a performance analysis method, electronic device, and storage medium. Background Technology

[0002] With the widespread application of heterogeneous computing platforms such as graphics processing units (GPUs) in the field of artificial intelligence, performance optimization of objective functions running on GPUs has become a core aspect of improving computing efficiency. Before performance optimization, it is usually necessary to identify the bottleneck type of the objective function as a basis for performance analysis.

[0003] In related technologies, when locating the bottleneck type of an objective function, the computational intensity and computing power of the graphics processing unit (GPU) after running the objective function are usually compared with the roofline model representing the performance boundary value to determine whether a performance bottleneck exists. However, when determining the computational intensity, it is usually based on indirect estimation using theoretical formulas, and the roofline model is usually constructed based on ideal parameters provided by the manufacturer. This leads to low accuracy in locating the bottleneck type after comparison, resulting in low accuracy in performance analysis.

[0004] Therefore, the related technologies suffer from the problem of low accuracy in performance analysis of objective functions running on graphics processors. Summary of the Invention

[0005] In view of this, the present disclosure provides a performance analysis method, an electronic device, and a storage medium.

[0006] One aspect of this disclosure provides a performance analysis method, comprising: obtaining the number of operations of instructions supported by the graphics processing unit (GPU), and the running data and runtime during the execution of a target function, wherein the running data includes the type and number of executed instructions, and the amount of data read and written by each of multiple independent memory regions in the GPU; determining the GPU's computing speed and computational intensity for multiple memory regions based on the runtime, running data, and the number of operations of the executed instructions during the execution of the target function obtained using the number of operations and the type of executed instructions, wherein the computational intensity represents the number of operations performed per unit amount of data read and written in the corresponding memory region; combining the computing speed and multiple computational intensities respectively as performance indicators of the amount of data read and written by multiple memory regions during the execution of the target function by the GPU; and, if the bottleneck type of the target function is determined to be a computational bottleneck or a bandwidth bottleneck based on the positional relationship of the performance indicator relative to the roofline model, analyzing the performance of the target function according to the bottleneck type, wherein the roofline model represents the relationship between the GPU's maximum computing power and computational intensity.

[0007] According to embodiments of this disclosure, the actual data obtained during the execution of the objective function—that is, the computational speed of the graphics processor and the computational intensity for multiple memory regions—are calculated using the execution data and runtime. Based on this performance indicator, a comparison with the roofline model reveals the bottleneck type. This not only avoids the problem of large discrepancies between theoretically derived computational intensity and actual values, improving the accuracy of bottleneck location, but also pinpoints the bottleneck to a specific memory region, further enhancing the accuracy of performance analysis.

[0008] According to embodiments of this disclosure, the processing speed of a graphics processor and the computational intensity for multiple memory regions are determined based on the runtime, runtime data, and the number of operations of the executed instructions during the execution of the target function obtained by utilizing the number of operations and the types of instructions executed. This includes: determining the total number of operations of multiple executed instructions during the execution of the target function based on the number of operations and the number of executions of the executed instructions; and determining the processing speed of the graphics processor using the total number of operations and the runtime.

[0009] According to embodiments of this disclosure, the total number of operations is obtained by combining the number of operations for different instruction types, thereby quantifying the computational load during the execution of the objective function and providing a basis for subsequent performance quantification. The computational speed is quantified as the number of operations per unit time, thereby measuring the utilization rate of the graphics processor's computing power by the objective function.

[0010] According to embodiments of this disclosure, the method further includes: calculating the computational intensity of the graphics processor for the multiple memory regions using the total number of operations and the amount of data read and written by each of the multiple independent memory regions.

[0011] According to embodiments of this disclosure, the computational intensity is calculated for different memory areas, taking into account the varying degrees of data reuse in different memory regions, which affect their respective computational intensity. Calculating the computational intensity for each region separately helps in identifying the bottleneck memory region during bottleneck localization.

[0012] According to embodiments of this disclosure, combining computing speed and multiple computing intensities as performance indicators of the amount of data read and written in multiple memory regions during the execution of an objective function by a graphics processor includes: using the combination of computing speed and multiple computing intensities as the target position of multiple memory regions relative to a roofline model, where the target position indicates the performance of the amount of data read and written in multiple memory regions during the execution of the objective function by the graphics processor.

[0013] According to embodiments of this disclosure, quantifying performance indicators using target locations can improve the efficiency of comparison with roofline models, thus helping to improve the efficiency of bottleneck location.

[0014] According to embodiments of this disclosure, the roofline model includes a first roofline and a second roofline, where the computational intensity represented by the first roofline is greater than that represented by the second roofline. Based on the positional relationship of the performance indicator relative to the roofline model, the bottleneck type of the objective function is determined, including: when the target locations of multiple memory regions match the first roofline, the bottleneck type of the objective function is determined to be a computational bottleneck; when the target location of at least one memory region matches the second roofline, the bottleneck type of the objective function is determined to be a bandwidth bottleneck.

[0015] According to embodiments of this disclosure, higher computing intensity corresponds to a first roofline for comparing computing power, while lower computing intensity corresponds to a second roofline for comparing bandwidth. This allows for the identification of computing bottlenecks and bandwidth bottlenecks at different computing intensities, not only distinguishing between computing bottlenecks and bandwidth bottlenecks but also pinpointing bandwidth bottlenecks to specific memory regions.

[0016] According to embodiments of this disclosure, the roofline model includes a first roofline and a second roofline, where the computational intensity represented by the first roofline is greater than that represented by the second roofline. The method further includes, before determining the bottleneck type of the objective function based on the positional relationship of the performance indicator relative to the roofline model: acquiring test data during the execution of a test program of at least one matrix size by the graphics processor; the test data includes the total number of test operations, the test runtime, and the amount of test read / write data for multiple memory regions; the matrix size represents the amount of data input to the function in the test program; calculating the test computation speed of the graphics processor using the test data when the matrix size is greater than a preset matrix size; and generating a first roofline during the execution of the test program by the graphics processor based on the maximum test computation speed and the corresponding matrix size, where the first roofline represents the relationship between the maximum computing power and computational intensity of the graphics processor when the matrix size is greater than a preset matrix size.

[0017] According to embodiments of this disclosure, a preset matrix size is used to distinguish the data volume of functions in the input test program, thereby differentiating the computational intensity of the test program. High computational intensity allows the true maximum computing power of the graphics processor to be determined. Based on this, different matrix sizes are used to determine the maximum value of the test operation speed for generating the first roofline. This utilizes data measured during actual operation to obtain the first roofline in the roofline model, improving the accuracy of the roofline model and serving as the basis for bottleneck location.

[0018] According to embodiments of this disclosure, the method further includes: when the matrix size is smaller than a preset matrix size, calculating the test memory bandwidth for multiple memory regions using test data; calculating the maximum computing intensity of multiple memory regions using the maximum value of the test operation speed and the test memory bandwidth; and generating second rooflines for each of the multiple memory regions during the execution of the test program by the graphics processor based on the maximum computing intensity of the multiple memory regions, the maximum value of the test operation speed, and the test memory bandwidth, wherein the second roofline represents the relationship between the maximum computing power and computing intensity of the graphics processor when the matrix size is smaller than the preset matrix size.

[0019] According to embodiments of this disclosure, a low-computational-intensity testing program determines the true memory bandwidth of the graphics processor. Based on this, for actual data from different memory regions, the maximum computational intensity, maximum test processing speed, and test memory bandwidth of multiple memory regions are determined to generate a second roofline, improving the accuracy of the roofline model and serving as the basis for bottleneck localization.

[0020] According to embodiments of this disclosure, the test program includes a first test program and a second test program. The first test program is used to obtain the total number of test operations, and the second test program is used to obtain the amount of test read / write data for multiple memory regions. The method further includes: obtaining preset instructions for the target computing unit and the address of the graphics processor's registers; using the preset instructions and addresses, generating a first test program that processes data in the graphics processor's registers according to the preset instructions; obtaining the storage capacity and type of multiple memory regions, including any one of shared memory, high-bandwidth memory, and L2 cache; obtaining the amount of data that supports read / write operations in multiple memory regions based on the storage capacity and type; and using the data amount and read / write instructions for each of the multiple memory regions, generating a second test program that reads and writes data in multiple memory regions.

[0021] According to embodiments of this disclosure, the computing power test program is generated using preset instructions and the addresses of the graphics processor's registers. This ensures that data is transferred only within the registers, eliminating memory access latency and bandwidth interference, thereby accurately obtaining the maximum computing power of the computing unit. The bandwidth test program determines the amount of data accessed to different memory regions based on different storage capacities and types, ensuring that accessed data only passes through the target memory region. This eliminates interference between memory regions and improves the accuracy of determining the memory bandwidth of each memory region.

[0022] Another aspect of this disclosure provides an apparatus comprising:

[0023] The acquisition module is used to acquire the number of operations of the instructions supported by the graphics processor, as well as the running data and runtime during the execution of the target function. The running data includes the type and number of instructions executed, as well as the amount of read and write data in each of the multiple independent memory areas in the graphics processor.

[0024] The determination module is used to determine the graphics processor's computing speed and the computational intensity for multiple memory areas based on the runtime, runtime data, and the number of operations performed on the execution target function obtained by utilizing the number of operations and the types of instructions executed. The computational intensity represents the number of operations performed to read and write a unit amount of data in the corresponding memory area.

[0025] The combination module is used to combine the computing speed and multiple computing intensities separately, serving as a performance indicator of the amount of data read and written to multiple memory areas during the execution of the target function by the graphics processor.

[0026] The analysis module is used to analyze the performance of the objective function based on its bottleneck type, which is determined to be either a computational bottleneck or a bandwidth bottleneck, according to the positional relationship of the performance indicator relative to the roofline model. The roofline model represents the relationship between the maximum computing power and computational intensity of the graphics processor.

[0027] Another aspect of this disclosure provides an electronic device, including: one or more processors; and a storage device for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors perform the method described above.

[0028] Another aspect of this disclosure provides a computer-readable storage medium having executable instructions stored thereon, which, when executed by a processor, cause the processor to perform the methods described above.

[0029] Another aspect of this disclosure provides a computer program product, including a computer program that, when executed by a processor, implements the above-described method.

[0030] The above one or more embodiments have the following technical effects: Utilizing actual data obtained during the execution of the objective function, i.e., using the execution data and runtime, the processing speed of the graphics processor and the computational intensity for multiple memory regions are calculated. Based on this, a comparison with the roofline model is made, which not only avoids the problem of large deviations between theoretically derived computational intensity and actual values, improving the accuracy of bottleneck location, but also pinpoints the bottleneck to a specific memory region, further enhancing the accuracy of performance analysis.

[0031] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0032] The above and other objects, features and advantages of this disclosure will become clearer from the following description of embodiments with reference to the accompanying drawings, in which:

[0033] Figure 1 The illustration schematically shows a performance analysis method according to an embodiment of the present disclosure;

[0034] Figure 2 A flowchart illustrating a performance analysis method according to an embodiment of the present disclosure is shown schematically.

[0035] Figure 3 A flowchart illustrating the determination of a roofline model according to an embodiment of the present disclosure is shown schematically.

[0036] Figure 4 A flowchart illustrating a performance analysis method according to an embodiment of the present disclosure is shown schematically.

[0037] Figure 5 A flowchart illustrating a performance analysis method according to an embodiment of the present disclosure is shown schematically.

[0038] Figure 6 A schematic diagram of a roofline model according to an embodiment of the present disclosure is shown.

[0039] Figure 7 This schematic diagram illustrates a structural block diagram of an apparatus according to an embodiment of the present disclosure;

[0040] Figure 8 A schematic block diagram of an electronic device that can be used to implement the methods of embodiments of the present disclosure is shown. Detailed Implementation

[0041] The embodiments of this disclosure will now be described with reference to the accompanying drawings. Various details of the embodiments of this disclosure are included to aid understanding and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0042] In the technical solutions disclosed herein, the collection, storage, use, processing, transmission, provision, disclosure, and application of data (including but not limited to user personal information) comply with the provisions of relevant laws and regulations, necessary confidentiality measures have been taken, and they do not violate public order and good morals.

[0043] With the widespread application of heterogeneous computing platforms such as graphics processing units (GPUs) in fields such as artificial intelligence, scientific computing, and massively parallel processing, performance optimization of objective functions running on GPUs has become a core aspect of improving computational efficiency. The performance of the objective function directly determines the processing speed and resource utilization of the computational task, and accurate identification of performance bottlenecks is a prerequisite for effective optimization.

[0044] In related technologies, the roofline model is usually used as the mainstream tool for performance analysis. This model is a mapping relationship between computational intensity and maximum computing power, clearly defines the performance boundary of the hardware, can intuitively distinguish between computational bottlenecks and memory bottlenecks, and provides directional guidance for the optimization of the objective function.

[0045] However, the determination of roofline models using related technologies often relies on theoretical parameters or indirect derivations, leading to low model accuracy and affecting bottleneck type identification. Furthermore, the determination of the crucial parameter of computational intensity typically employs indirect estimation methods, making it susceptible to interference from the operating environment. Therefore, these technologies suffer from inaccurate bottleneck type identification, resulting in low accuracy in the performance analysis of the objective function running on the graphics processing unit.

[0046] Figure 1 The illustration shows a schematic diagram of the principle of a performance analysis method according to an embodiment of the present disclosure.

[0047] like Figure 1 As shown, the graphics processor 100 includes a computing unit 101 and a memory area 102. The computing unit 101 includes Vector Arithmetic Logic Unit (VALU) and Tensor Core. The memory area 102 includes Local Data Share (LDS), High Bandwidth Memory (HBM), Level 1 Cache (L1), and Level 2 Cache (L2).

[0048] During the execution of the objective function by the computing unit 101, data is read and written in the memory area 102, thereby enabling the graphics processor 100 to output the execution data, number of operations, and runtime, which serve as the basis for obtaining the computational intensity and speed. The computational intensity and speed are then compared with the roofline model 103 to determine the bottleneck type of the objective function, thereby analyzing the performance of the objective function.

[0049] Figure 2 A flowchart illustrating a performance analysis method according to an embodiment of the present disclosure is shown schematically.

[0050] like Figure 2 As shown, the performance analysis method of this embodiment includes operations S210 to S240.

[0051] During operation S210, the number of operations of the instructions supported by the graphics processor, as well as the running data and runtime during the execution of the target function, are obtained.

[0052] The graphics processor supports instructions including computation instructions and memory access instructions. Computation instructions are used to perform calculations, while memory access instructions are used to perform data transfer between multiple memory spaces.

[0053] It should be understood that each instruction supported by a graphics processing unit (GPU) corresponds to a specific number of operations. The number of operations refers to the number of operations supported during instruction execution. For example, the fused multiply-add (FMA) instruction supports two operations during execution: one multiplication and one addition.

[0054] The runtime data includes the types and number of instructions executed, as well as the amount of data read and written to each of the multiple independent memory areas in the graphics processor.

[0055] The execution count refers to the number of times an instruction called during the execution of the target function is executed. For example, if the target function calls instruction 'a', and this instruction is executed 4 times during the execution of the target function, then the execution count is 4 times when the instruction type is instruction 'a'.

[0056] Multiple independent memory regions represent multiple independent storage spaces on a graphics processing unit (GPU). In practical applications, the storage spaces on a GPU include LDS, HBM, L1, and L2.

[0057] The amount of data read and written refers to the amount of data read from and written to the memory area during the execution of the target function. For example, the amount of data read and written in memory area A is 500 bytes.

[0058] In operation S220, the processing speed of the graphics processor and the computational intensity for multiple memory regions are determined by the number of operations performed on the target function obtained by using the number of operations and the types of instructions executed, based on the runtime, the running data, and the number of operations performed on the execution instructions.

[0059] The number of operations performed by the executed instructions is the number of operations performed by the instructions called during the execution of the target function.

[0060] The processing speed represents the number of operations performed per unit time during the execution of the target function, and is used to reflect how fast the instruction executes this operation.

[0061] Computational intensity represents the number of operations performed to read or write a unit amount of data in a corresponding memory area.

[0062] It should be understood that because data is reused at different memory levels at varying speeds during the execution of a function on a graphics processor, memory access varies, resulting in different computational intensities for different memory regions. Calculating the corresponding computational intensity for different memory regions can improve the accuracy of identifying bottlenecks in the target function, pinpointing them to specific memory regions.

[0063] When operating the S230, the computing speed and multiple computing intensities are combined separately as a performance indicator of the amount of data read and written to multiple memory areas during the execution of the target function by the graphics processor.

[0064] Among them, the performance indicator represents the speed and computational intensity of reading and writing data in multiple memory areas during the execution of the target function, and is used to represent the performance of the target function.

[0065] When operating S240, and determining whether the bottleneck type of the objective function is a computational bottleneck or a bandwidth bottleneck based on the positional relationship of the performance indicator relative to the roofline model, the performance of the objective function is analyzed according to the bottleneck type.

[0066] Among them, the computational bottleneck indicates that the image processor's processing speed has reached its limit.

[0067] A bandwidth bottleneck indicates that the amount of data read and written by an image processor in a certain memory area per unit time has reached the transmission limit of that memory area.

[0068] The roofline model represents the relationship between the maximum computing power and computing intensity of a graphics processor.

[0069] According to embodiments of this disclosure, the actual data obtained during the execution of the objective function, namely the running data and runtime, are used to calculate the graphics processor's processing speed and computational intensity for multiple memory regions. Based on this, a comparison with the roofline model is made, which not only avoids the problem of large deviations between theoretically derived computational intensity and actual values, improving the accuracy of bottleneck location, but also pinpoints the bottleneck to a specific memory region, further enhancing the accuracy of performance analysis.

[0070] In some exemplary embodiments, the total number of operations of multiple executed instructions during the execution of the target function is first determined based on the number of operations and the number of executions of the executed instructions. Then, the processing speed of the graphics processor is determined using the total number of operations and the runtime.

[0071] The total number of operations refers to the sum of the number of operations performed by the instructions called during the execution of the target function. For example, if the target function calls one floating-point fused multiply-accumulate instruction and three floating-point addition instructions during its execution, and the floating-point fused multiply-accumulate instruction performs two operations and the floating-point addition instruction performs one operation, then the total number of operations is 1×2+3×1=5 floating-point operations (FLOPs), meaning that the target function performed 5 floating-point operations.

[0072] If the runtime is 10 nanoseconds, then the computation speed is 5 ÷ (10 × 10) -9 ) = 5 × 10 8 (FLOPs / s) represents the number of floating-point operations (FLOPS) that the graphics processor performs per second while running the objective function.

[0073] According to embodiments of this disclosure, the total number of operations is obtained by combining the number of operations for different instruction types, thereby quantifying the computational load during the execution of the objective function and providing a basis for subsequent performance quantification. The computational speed is quantified as the number of operations per unit time, thereby measuring the utilization rate of the graphics processor's computing power by the objective function.

[0074] In some exemplary embodiments, the computational intensity of the graphics processor for multiple memory regions is calculated using the total number of operations and the amount of data read and written by each of the multiple independent memory regions.

[0075] In practical applications, for each memory region, the computational intensity is calculated by dividing the total number of operations by the amount of data read and written for that region. For example, when the memory region is of type LDS, its computational intensity is the total number of operations divided by the LDS data read and written amount; when the memory region is of type L2, its computational intensity is the total number of operations divided by the L2 data read and written amount; and when the memory region is of type HBM, its computational intensity is the total number of operations divided by the HBM data read and written amount.

[0076] According to embodiments of this disclosure, the computational intensity is calculated for different memory areas, taking into account the varying degrees of data reuse in different memory regions, which affect their respective computational intensity. Calculating the computational intensity for each region separately helps in identifying the bottleneck memory region during bottleneck localization.

[0077] In some exemplary embodiments, a combination of computational speed and multiple computational intensities is used as the target position of multiple memory regions relative to the roofline model, the target position indicating the performance of the amount of data read and written by the multiple memory regions during the execution of the objective function by the graphics processor.

[0078] It should be understood that the roofline model represents the relationship between the graphics processor's maximum computing power and computational intensity, which can be represented in a two-dimensional coordinate system. In practical applications, coordinates are used to represent the target location. In the two-dimensional coordinate system of the roofline model, the x-axis represents computational intensity, and the y-axis represents computational speed. Therefore, the horizontal coordinate of the target location represents computational intensity, and the vertical coordinate represents computational speed.

[0079] According to embodiments of this disclosure, quantifying performance indicators using target locations can improve the efficiency of comparison with roofline models, thus helping to improve the efficiency of bottleneck location.

[0080] In the embodiments of this disclosure, the roofline model includes a first roofline and a second roofline. The computational intensity represented by the first roofline is greater than that represented by the second roofline. The first roofline represents the relationship between computational intensity and computing speed when the bottleneck type is a computing power bottleneck, and the second roofline represents the relationship between computational intensity and computing speed when the bottleneck type is a bandwidth bottleneck.

[0081] In some exemplary embodiments, if the target locations of multiple memory regions match the first roofline, the bottleneck type of the objective function is determined to be a computational bottleneck. If the target location of at least one memory region matches the second roofline, the bottleneck type of the objective function is determined to be a bandwidth bottleneck.

[0082] It should be understood that matching the target location with the first roofline means that the target location is within the region of the first roofline. This is because the position on the first roofline represents the maximum computing power of the graphics processing unit (GPU) under high computational intensity, i.e., the maximum processing speed of the GPU. Therefore, if the target locations of multiple memory regions match the first roofline, it indicates that the processing speed of the objective function has reached its maximum. For example, if the target locations of LDS, L2, and HBM all match the first roofline, this is identified as a computational bottleneck.

[0083] The position on the second roofline represents the maximum processing speed of the corresponding graphics processor under relatively low computational intensity. Therefore, if one of the target locations in multiple memory regions matches the second roofline, it indicates that the target function encounters a bandwidth bottleneck in the matched memory region during runtime.

[0084] For example, if the target position of LDS matches the first roofline, and the target positions of L2 and HBM match the second roofline, then it is determined that the objective function has a bandwidth bottleneck in either L2 or HBM. As another example, if only the target position of HBM matches the second roofline, then it is determined that the objective function has a bandwidth bottleneck in HBM. And as yet another example, if the target position of LDS matches the second roofline, then it is determined that the objective function has a bandwidth bottleneck in LDS.

[0085] According to embodiments of this disclosure, higher computing intensity corresponds to a first roofline for comparing computing power, while lower computing intensity corresponds to a second roofline for comparing bandwidth. This allows for the identification of computing bottlenecks and bandwidth bottlenecks at different computing intensities, not only distinguishing between computing bottlenecks and bandwidth bottlenecks but also pinpointing bandwidth bottlenecks to specific memory regions.

[0086] Figure 3 A flowchart illustrating the determination of a roofline model according to one embodiment of the present disclosure is shown schematically.

[0087] like Figure 3 As shown, this embodiment includes operations S301 to S306, which correspond to the steps before determining the bottleneck type of the objective function based on the positional relationship of the performance indicator relative to the roofline model, and are used to illustrate the steps for determining the roofline model.

[0088] In operation S301, test data is acquired during the process of the graphics processor executing a test program of at least one matrix size.

[0089] Test data includes the total number of test operations, test runtime, and the amount of data read and written for multiple memory areas.

[0090] The matrix size represents the amount of data input to the function in the test program. It should be understood that the test program consists of a function and the data input to that function; the size of the input data corresponds to the size of the matrix. A larger matrix size indicates a larger amount of data input to the function, resulting in greater computational intensity. Increasing the matrix size according to the gradient is used to test the actual maximum computational speed, i.e., the maximum computing power.

[0091] In operation S302, when the matrix size is larger than the preset matrix size, the test processing speed of the graphics processor is calculated using test data.

[0092] The preset matrix size is used to divide the data volume of the functions executed when determining the computing speed and memory bandwidth, thereby dividing the test programs for determining computing speed and memory bandwidth. This is because determining computing speed requires test programs with high computational intensity, while determining memory bandwidth requires test programs with low computational intensity; therefore, the preset matrix size is used for this division. A matrix size larger than the preset matrix size corresponds to high computational intensity, while a matrix size smaller than the preset matrix size corresponds to low computational intensity.

[0093] Test computation speed refers to the number of operations performed per unit time during the execution of a test program. In practical applications, the ratio of the total number of test operations to the test runtime obtained using a test program with a matrix size larger than a preset matrix size is used as the test computation speed corresponding to that matrix size.

[0094] In operation S303, the first roofline is generated during the execution of the test program by the graphics processor based on the maximum value of the test computing speed and the corresponding matrix size.

[0095] The maximum test computing speed refers to the maximum test computing speed obtained by the test program under various matrix sizes when the matrix size is larger than the preset matrix size. It is used to reflect the maximum computing power under this condition, that is, the maximum computing power under high computing intensity.

[0096] The first roofline represents the relationship between the maximum computing power and computational intensity of the graphics processor when the matrix size is larger than the preset matrix size.

[0097] In practical applications, the horizontal axis (x-axis) of the roofline model represents computational intensity, and the vertical axis (y-axis) represents maximum computational power. The test computational speed is calculated using actual running test data. Combining the maximum test computational speed with the location of the corresponding computational intensity point, the straight line parallel to the x-axis containing that point is taken as the first roofline of the roofline model, representing the maximum computational power under this condition, and thus serving as the maximum computational power of the graphics processor.

[0098] When operating S304, if the matrix size is smaller than the preset matrix size, the test memory bandwidth for multiple memory areas is calculated using test data.

[0099] Test memory bandwidth refers to the amount of data transferred in storage space per unit time. In practical applications, the ratio of the test read / write data volume to the test runtime obtained by a test program with a matrix size smaller than a preset matrix size for multiple memory areas is used as the test memory bandwidth for multiple memory areas corresponding to that matrix size.

[0100] When operating the S305, the maximum computational intensity of multiple memory regions is calculated by using the maximum value of the test operation speed and the test memory bandwidth.

[0101] It should be understood that the first roofline represents the computing power roofline, and the second roofline represents the bandwidth roofline. Both types of rooflines represent the relationship between the graphics processor's maximum computing power and computational intensity. The intersection of the first and second rooflines represents the maximum computational intensity corresponding to the memory area. Therefore, the computational intensity at this intersection is calculated using the maximum tested computing speed and the tested memory bandwidth. In practical applications, the ratio of the maximum tested computing speed to the tested memory bandwidth is used as the maximum computational intensity.

[0102] When operating S306, based on the maximum computational intensity of multiple memory regions, the maximum value of the test operation speed, and the test memory bandwidth, the second roofline of each of the multiple memory regions during the execution of the test program by the graphics processor is generated.

[0103] The second roofline represents the relationship between the maximum computing power and computational intensity of the graphics processor when the matrix size is smaller than the preset matrix size.

[0104] In practical applications, the x-axis of the roofline model represents computational intensity, and the y-axis represents maximum computational power. After calculating the test memory bandwidth using test data obtained from actual operation, the test memory bandwidth, along with the maximum value of the slope, maximum computational intensity, and test computational speed, is used as a point to fit the second roofline of the roofline model.

[0105] According to embodiments of this disclosure, a preset matrix size is used to differentiate the data volume of functions in the input test program, thereby differentiating the computational intensity of the test program and determining the true maximum computing power and memory bandwidth of the graphics processor. Based on this, different matrix sizes are used to determine the maximum test computation speed for generating the first roofline. Simultaneously, the maximum computational intensity of multiple memory regions, the maximum test computation speed, and the test memory bandwidth are determined to generate the second roofline. This utilizes data obtained from actual operation to obtain the roofline model, improving the accuracy of the roofline model and serving as the basis for bottleneck localization.

[0106] In the embodiments of this disclosure, the test program includes a first test program and a second test program. The first test program is used to obtain the total number of test operations, and the second test program is used to obtain the amount of test read and write data for multiple memory areas.

[0107] In some exemplary embodiments, firstly, preset instructions for the target computing unit and the addresses of the graphics processor's registers are obtained. Then, using the preset instructions and addresses, a first test program is generated to process data in the graphics processor's registers according to the preset instructions. Next, the storage capacity and type of multiple memory regions are obtained. Then, based on the storage capacity and type, the amount of data that supports reading and writing to multiple memory regions is obtained. Finally, using the data amount and the read / write instructions for each of the multiple memory regions, a second test program for reading and writing data in the multiple memory regions is generated.

[0108] The preset instructions refer to the instructions for the target computing unit being tested. For example, in a test environment where calculations use single-precision floating-point numbers as the basic data unit, if the target computing unit is a VALU, the preset instructions include the FMA instruction from the calculation instructions. In the same environment, if the target computing unit is a Tensor Core, the preset instructions include the matrix operation-related instructions from the calculation instructions.

[0109] It should be understood that when testing the computing power of a graphics processor, data needs to be stored only in registers without involving writing to or from memory. This is to eliminate the impact of memory access bandwidth and latency on the test, eliminate interference from irrelevant instructions, and make the measured computing power closer to the true value.

[0110] In some exemplary embodiments, preset instructions and addresses are converted into assembly format to generate a first test program that processes data in the registers of the graphics processor according to the preset instructions.

[0111] The amount of data that can be read and written to multiple memory areas is used to control the scope of data access when running the second test program. In practical applications, this amount of data is combined with the type of memory area to limit the scope of data access. The type can be any of shared memory, high-bandwidth memory, or L2 cache.

[0112] For example, when the memory region is of type shared memory (LDS), the amount of data must be less than the storage capacity of the LDS. The corresponding second test program focuses on the direct interaction path between the registers and the LDS to test the read and write bandwidth between them. When the memory region is of type high-bandwidth memory (HBM), the amount of data must be greater than the storage capacity of L2 so that the L2 memory region is not hit. When the memory region is of type L2 cache (L2), the amount of data must be less than the storage capacity of L2. For example, if the storage capacity of L2 is 2 megabytes (MB), then the amount of data is set to 1MB for testing.

[0113] According to embodiments of this disclosure, the computing power test program is generated using preset instructions and the addresses of the graphics processor's registers. This ensures that data is transferred only within the registers, eliminating memory access latency and bandwidth interference, thereby accurately obtaining the maximum computing power of the computing unit. The bandwidth test program determines the amount of memory access data for different memory regions based on different storage capacities and types, ensuring that memory access data only passes through the target memory region. This eliminates interference between different memory regions and improves the accuracy of determining the memory bandwidth of each memory region.

[0114] In practical applications, when the memory area is HBM, in order to achieve the peak bandwidth during the second test program, it is necessary to read data based on the streaming access mode. In this way, if the cache misses, the data will be directly transmitted to the register through the cache link. At this time, the cache has a very small impact on the measurement result of HBM bandwidth and can be ignored, so as to measure the memory bandwidth corresponding to the real HBM.

[0115] When the memory region is LDS or L2, data access is strictly aligned with the hardware cache bit width, such as a single read of 128 bits (binary digit), comprehensively avoiding bandwidth test errors caused by cache overflow, bit width mismatch, and interference between different memory regions. Simultaneously, when the memory region is L2, assembly instruction flags, such as cache control bits for memory access instructions, must be configured for read / write instructions of this memory region to forcibly bypass the L1 cache's caching function and directly test the read / write bandwidth between the L2 cache and registers via the L1 bypass path.

[0116] Figure 4 A flowchart illustrating a performance analysis method according to an embodiment of the present disclosure is shown schematically.

[0117] like Figure 4 As shown, the performance analysis method of this embodiment includes operations S410 to S430.

[0118] After operating the S410 and generating a test program for the graphics processor, the test processing speed and memory bandwidth are calculated using the test data after executing the test program, and a roofline model is generated.

[0119] The S420 is used to obtain the number of operations, the running data and runtime of the objective function, and to calculate the operation speed and computational intensity.

[0120] When operating the S430, the combined computing speed and computational intensity are used as performance indicators for comparison with the roofline model, and the performance of the objective function is analyzed.

[0121] Figure 5 A flowchart illustrating a performance analysis method according to an embodiment of the present disclosure is shown schematically.

[0122] like Figure 5 As shown, the performance analysis method of this embodiment includes operations S510 to S560.

[0123] When operating the S510, a test program for the graphics processor is generated.

[0124] When operating the S520, the test processing speed and test memory bandwidth are calculated using the test data after the test program is executed, and a roofline model is generated.

[0125] The S530 is used to obtain the number of operations, the running data during the execution of the objective function, and the running time.

[0126] In practical applications, when using Heterogeneous System Architecture (HAS) queues to achieve asynchronous communication between the CPU and GPU, runtime data acquisition can be achieved by inserting configuration information before and after the Architected Queuing Language (AQL) instruction package corresponding to the objective function through the HAS queue control unit. This configuration information is used to set up the Performance Monitoring Counter (PMC) to monitor and collect runtime data. During the execution of the objective function, the PMC acquires runtime data in real time and sends it back after execution, ensuring data integrity and timestamp synchronization.

[0127] After operating the S540, the calculation speed and calculation intensity are combined and used as a performance indicator to determine the position of the performance indicator relative to the roofline model.

[0128] After analyzing the performance of the objective function using the S550, optimization suggestions were obtained.

[0129] When operating S560, the objective function is optimized based on the optimization suggestions, and the optimized objective function is re-executed.

[0130] In practical applications, if the bottleneck type of the objective function is determined to be the computational bottleneck, optimization suggestions can be obtained by combining the computing power utilization rate: "improve the parallelism of the objective function, replace the split operation instructions with General Matrix Matrix multiply (GEMM) dedicated instructions, and reduce the precision of a single calculation, such as promoting the conversion of single precision (FP32) to half precision (FP16) or integer (INT8)".

[0131] If it is determined that there is a bandwidth bottleneck in L2 or HBM when the objective function is running, the optimization suggestion is to "optimize the data prefetching strategy and reconstruct memory access to a contiguous address mode".

[0132] If it is determined that the target function is running at a bandwidth bottleneck in HBM, the optimization suggestion is to "simplify the data transport links and increase local data storage".

[0133] If it is determined that the target function is running at a bandwidth bottleneck in LDS, the optimization suggestion is to "improve LDS data reuse rate and optimize cache block size".

[0134] After implementing the optimization recommendations, the established roofline model is reused, the optimized objective function is rerun, and operational data is collected. By comparing the offset of the performance indicator location and the improvement in computing power or bandwidth utilization before and after optimization, the optimization effect is quantitatively verified, forming a complete technical closed loop of "multi-level analysis - targeted optimization - quantitative verification".

[0135] Figure 6 A schematic diagram of a roofline model according to an embodiment of the present disclosure is shown.

[0136] like Figure 6 As shown, the x-axis of the roofline model represents the computational intensity, with units of floating-point operations per byte (FLOPs / Byte); the y-axis represents the computational speed, with units of billions of floating-point operations per second (GFLOPs / sec).

[0137] In this diagram, horizontal lines 601 and 602 represent the first roofline corresponding to the tensor core computational unit and the first roofline corresponding to the vector arithmetic logic unit computational unit, respectively. Diagonal lines 603, 604, and 605 represent the second roofline corresponding to the LDS memory region, the L2 memory region, and the HBM memory region, respectively.

[0138] The horizontal line 601 corresponds to a computation speed of 61395 (GFLOP / sec), meaning that the tensor core computation unit completes an average of 6.1395 × 10⁻⁶ operations per second. 13 The horizontal line 602 corresponds to a processing speed of 15338 GFLOPs / sec, indicating that the Vector Arithmetic Logic Unit (VALUES) performs an average of 1.5338 × 10⁻⁶ floating-point operations per second. 13 This is a floating-point operation.

[0139] The slope corresponding to slash 603 is 14521 gigabytes per second (GB / s), indicating that the memory bandwidth of the LDS memory area is 14521 GB / s, meaning that LDS can transfer 14521 GB of data per second. The slope corresponding to slash 604 is 5415 GB / s, indicating that the memory bandwidth of the L2 memory area is 5415 GB / s, meaning that L2 can transfer 5415 GB of data per second. The slope corresponding to slash 605 is 1530 GB / s, indicating that the memory bandwidth of the HBM memory area is 1530 GB / s, meaning that HBM can transfer 1530 GB of data per second.

[0140] The dot 606 represents the position of the performance indicator corresponding to a certain objective function. If this position does not match the first roofline or the second roofline, it means that the objective function has neither a computational bottleneck nor a bandwidth bottleneck.

[0141] Figure 7 A schematic block diagram of an apparatus according to an embodiment of the present disclosure is shown.

[0142] like Figure 7 As shown, the apparatus 700 of this embodiment includes...

[0143] The acquisition module 710 is used to acquire the number of operations of the instructions supported by the graphics processor, as well as the running data and runtime during the execution of the target function. The running data includes the type and number of instructions executed, and the amount of read and write data in each of the multiple independent memory areas in the graphics processor. In one embodiment, the acquisition module 710 can be used to execute the operation S210 described above, which will not be repeated here.

[0144] The determining module 720 is used to determine the graphics processor's processing speed and computational intensity for multiple memory regions based on the runtime, running data, and the number of operations performed on the execution target function obtained using the number of operations and the types of instructions executed. The computational intensity represents the number of operations performed per unit amount of data read and written in the corresponding memory region. In one embodiment, the determining module 720 can be used to perform the operation S220 described above, which will not be repeated here.

[0145] The combination module 730 is used to combine the computing speed and multiple computing intensities separately as a performance indicator of the amount of data read and written to multiple memory areas during the execution of the target function by the graphics processor. In one embodiment, the combination module 730 can be used to perform the operation S230 described above, which will not be repeated here.

[0146] Analysis module 740 is used to analyze the performance of the objective function based on its bottleneck type, whereby the bottleneck type is determined to be either a computational bottleneck or a bandwidth bottleneck based on the positional relationship of the performance indicator relative to the roofline model. The roofline model represents the relationship between the maximum computing power and computational intensity of the graphics processor. In one embodiment, analysis module 740 may be used to perform the operation S240 described above, which will not be repeated here.

[0147] Figure 8 A schematic block diagram of an electronic device that can be used to implement the methods of embodiments of the present disclosure is shown.

[0148] like Figure 8 As shown, an electronic device 800 according to an embodiment of this disclosure includes a processor 801, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 802 or a program loaded from a storage portion 808 into a random access memory (RAM) 803. The processor 801 may include, for example, a general-purpose microprocessor (e.g., a CPU), an instruction set processor and / or an associated chipset and / or a special-purpose microprocessor (e.g., an application-specific integrated circuit (ASIC)), etc. The processor 801 may also include onboard memory for caching purposes. The processor 801 may include a single processing unit or multiple processing units for performing different actions of the method flow according to an embodiment of this disclosure.

[0149] RAM 803 stores various programs and data required for the operation of electronic device 800. Processor 801, ROM 802, and RAM 803 are interconnected via bus 804. Processor 801 performs various operations of the method flow according to embodiments of this disclosure by executing programs in ROM 802 and / or RAM 803. It should be noted that programs may also be stored in one or more memories other than ROM 802 and RAM 803. Processor 801 may also implement the methods provided in embodiments of this disclosure by executing programs stored in one or more memories.

[0150] According to embodiments of this disclosure, the electronic device 800 may further include an input / output (I / O) interface 805, which is also connected to a bus 804. The electronic device 800 may also include one or more of the following components connected to the I / O interface 805: an input section 806 including a keyboard, mouse, etc.; an output section 807 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 808 including a hard disk, etc.; and a communication section 809 including a network interface card such as a LAN card, modem, etc. The communication section 809 performs communication processing via a network such as the Internet. A drive 810 is also connected to the I / O interface 805 as needed. A removable medium 811, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on the drive 810 as needed so that computer programs read from it can be installed into the storage section 808 as needed.

[0151] This disclosure also provides a computer-readable storage medium, which may be included in the device / apparatus / system described in the above embodiments; or it may exist independently and not assembled into the device / apparatus / system. The computer-readable storage medium carries one or more programs that, when executed, implement the method according to the embodiments of this disclosure.

[0152] According to embodiments of this disclosure, the computer-readable storage medium can be a non-volatile computer-readable storage medium, such as including, but not limited to: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this disclosure, the computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. For example, according to embodiments of this disclosure, the computer-readable storage medium may include ROM 802 and / or RAM 803 and / or one or more memories other than ROM 802 and RAM 803 described above.

[0153] Embodiments of this disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowchart. When the computer program product is run on a computer system, the program code is used to cause the computer system to implement the methods provided in the embodiments of this disclosure.

[0154] When the computer program is executed by the processor 801, it performs the functions defined in the system / apparatus of this disclosure embodiments. According to embodiments of this disclosure, the systems, apparatuses, modules, units, etc., described above can be implemented by computer program modules.

[0155] In one embodiment, the computer program may rely on a tangible storage medium such as an optical storage device or a magnetic storage device. In another embodiment, the computer program may also be transmitted and distributed in the form of signals over a network medium, and may be downloaded and installed via the communication section 809, and / or installed from a removable medium 811. The program code contained in the computer program can be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination thereof.

[0156] In such an embodiment, the computer program can be downloaded and installed from a network via communication section 809, and / or installed from removable medium 811. When the computer program is executed by processor 801, it performs the functions defined in the system of this disclosure embodiment. According to embodiments of this disclosure, the systems, devices, apparatuses, modules, units, etc., described above can be implemented by computer program modules.

[0157] It should be noted that the collection, storage, use, processing, transmission, provision, disclosure, and application of user personal information in this disclosed technical solution comply with relevant laws and regulations, necessary confidentiality measures have been taken, and it does not violate public order and good morals. In this disclosed technical solution, user authorization or consent has been obtained before acquiring or collecting user personal information.

[0158] According to embodiments of this disclosure, program code for executing the computer programs provided in embodiments of this disclosure can be written in any combination of one or more programming languages. Specifically, these computational programs can be implemented using high-level procedural and / or object-oriented programming languages, and / or assembly / machine languages. Programming languages include, but are not limited to, languages such as Java, C++, Python, "C", or similar programming languages. The program code can execute entirely on a user's computing device, partially on a user's device, partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).

[0159] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0160] Those skilled in the art will understand that the features described in the various embodiments and / or claims of this disclosure can be combined and / or combined in various ways, even if such combinations or combinations are not explicitly described in this disclosure. In particular, the features described in the various embodiments and / or claims of this disclosure can be combined and / or combined in various ways without departing from the spirit and teachings of this disclosure. All such combinations and / or combinations fall within the scope of this disclosure.

[0161] The embodiments of this disclosure have been described above. However, these embodiments are for illustrative purposes only and are not intended to limit the scope of this disclosure. Although various embodiments have been described above, this does not mean that the measures in the various embodiments cannot be used advantageously in combination. The scope of this disclosure is defined by the appended claims and their equivalents. Various substitutions and modifications can be made by those skilled in the art without departing from the scope of this disclosure, and all such substitutions and modifications should fall within the scope of this disclosure.

Claims

1. A performance analysis method, characterized in that, The method includes: The number of operations of the instructions supported by the graphics processor, as well as the running data and runtime during the execution of the target function, are obtained. The running data includes the type and number of instructions executed, as well as the amount of read and write data in each of the multiple independent memory areas in the graphics processor. Based on the runtime, the runtime data, and the number of operations performed on the instructions executed during the execution of the target function (obtained using the number of operations and the type of instructions executed), the processing speed of the graphics processor and the computational intensity for multiple memory regions are determined. The computational intensity represents the number of operations performed per unit amount of data read and written in the corresponding memory region. The computing speed and the multiple computing intensities are combined respectively to serve as a performance indicator of the amount of data read and written to multiple memory areas during the execution of the target function by the graphics processor. If the bottleneck type of the objective function is determined to be either a computational bottleneck or a bandwidth bottleneck based on the positional relationship of the performance indicator relative to the roofline model, the performance of the objective function is analyzed based on the bottleneck type of the objective function, whereby the roofline model represents the relationship between the maximum computing power and computing intensity of the graphics processor.

2. The method according to claim 1, characterized in that, The step of determining the graphics processor's processing speed and computational intensity for multiple memory regions based on the runtime, the runtime data, and the number of operations performed on the instructions executed during the execution of the target function (obtained using the number of operations and the type of instructions executed) includes: Based on the number of operations of the execution instructions and the number of executions, determine the total number of operations of the multiple execution instructions during the execution of the target function; The processing speed of the graphics processor is determined by using the total number of operations and the runtime.

3. The method according to claim 2, characterized in that, The method further includes: The computational intensity of the graphics processor for the multiple memory regions is calculated using the total number of operations and the amount of data read and written by each of the multiple independent memory regions.

4. The method according to any one of claims 1-3, characterized in that, The method of combining the computing speed and the multiple computing intensities as performance indicators of the amount of data read and written to multiple memory areas during the execution of the target function by the graphics processor includes: The combination of the computing speed and the plurality of computing intensities is used as the target position of the plurality of memory regions relative to the roofline model. The target position indicates the performance of the plurality of memory regions in reading and writing data during the execution of the target function by the graphics processor.

5. The method according to claim 4, characterized in that, The roofline model includes a first roofline and a second roofline, where the computational intensity represented by the first roofline is greater than that represented by the second roofline. Determining the bottleneck type of the objective function based on the positional relationship of the performance indicator relative to the roofline model includes: If the target locations of the multiple memory regions match the first roofline, the bottleneck type of the objective function is determined to be a computational bottleneck. If at least one of the target locations of the memory region matches the second roofline, the bottleneck type of the objective function is determined to be a bandwidth bottleneck.

6. The method according to claim 1, characterized in that, The roofline model includes a first roofline and a second roofline, wherein the computational intensity represented by the first roofline is greater than that represented by the second roofline. The method further includes, before determining the bottleneck type of the objective function based on the positional relationship of the performance indicator relative to the roofline model: The test data is obtained during the execution of a test program with at least one matrix size by the graphics processor. The test data includes the total number of test operations, the test runtime, and the amount of test read and write data for multiple memory areas. The matrix size represents the amount of data input to the functions in the test program. When the matrix size is larger than the preset matrix size, the test processing speed of the graphics processor is calculated using the test data; Based on the maximum value of the test computing speed and the corresponding matrix size, a first roofline is generated during the execution of the test program by the graphics processor. The first roofline represents the relationship between the maximum computing power and computing intensity of the graphics processor when the matrix size is greater than a preset matrix size.

7. The method according to claim 6, characterized in that, The method further includes: When the matrix size is smaller than the preset matrix size, the test memory bandwidth for multiple memory regions is calculated using the test data; Using the maximum value of the test computation speed and the test memory bandwidth, calculate the maximum computational intensity of multiple memory regions; Based on the maximum computational intensity of the multiple memory regions, the maximum value of the test operation speed, and the test memory bandwidth, a second roofline is generated for each of the multiple memory regions during the execution of the test program by the graphics processor. The second roofline represents the relationship between the maximum computing power and computational intensity of the graphics processor when the matrix size is smaller than a preset matrix size.

8. The method according to claim 6, characterized in that, The test program includes a first test program and a second test program. The first test program is used to obtain the total number of test operations, and the second test program is used to obtain the amount of test read / write data for multiple memory areas. The method further includes: Obtain the preset instructions for the target computing unit and the address of the registers of the graphics processor; Using the preset instructions and the address, a first test program is generated to process data in the registers of the graphics processor according to the preset instructions; Obtain the storage capacity and type of the multiple memory regions, wherein the type includes any one of shared memory, high-bandwidth memory, and secondary cache; Based on the storage capacity and the type, the amount of data that supports reading and writing to the multiple memory areas is obtained; Using the data volume and read / write instructions for each of the multiple memory areas, a second test program for reading and writing data in the multiple memory areas is generated.

9. An electronic device, comprising: One or more processors; Storage device for storing one or more programs. Wherein, when the one or more programs are executed by the one or more processors, the one or more processors perform the method according to any one of claims 1 to 8.

10. A computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the method according to any one of claims 1 to 8.