A program analysis method and device, electronic equipment, storage medium and computer program product

By allocating a sampling buffer to the target program and updating the instruction counter using atomic operations, the problems of data overwriting and resource consumption caused by the circular buffer mechanism are solved, achieving efficient performance analysis and accurately locating performance bottlenecks.

CN121934893BActive Publication Date: 2026-06-26MOORE THREADS TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
MOORE THREADS TECH CO LTD
Filing Date
2026-03-31
Publication Date
2026-06-26

Smart Images

  • Figure CN121934893B_ABST
    Figure CN121934893B_ABST
Patent Text Reader

Abstract

The present disclosure relates to a program analysis method and device, electronic equipment, storage medium and computer program product, the method comprising: allocating a sampling buffer for a target program before the target program runs, the sampling buffer comprising a sampling buffer partition allocated for each instruction in the target program; when reaching any sampling moment during the running of the target program, performing instruction counter sampling, determining the sampling instruction, and updating the sampling counter in the target sampling buffer partition corresponding to the sampling instruction through atomic operation; after the running of the target program ends, determining the total sampling number of each instruction in the target program according to the sampling buffer; and performing performance analysis on the target program according to the total sampling number of each instruction in the target program to obtain a performance analysis result. According to the embodiments of the present disclosure, the performance analysis on the target program can be effectively implemented without affecting the running of the target program under the condition of ensuring sampling accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer technology, and in particular to a program analysis method and apparatus, electronic equipment, storage medium and computer program product. Background Technology

[0002] In existing technologies, the instruction counter is periodically sampled on the hardware running the program and written to a circular buffer, which is then periodically moved to system memory by the driver. However, in high-frequency sampling scenarios, the circular buffer mechanism can lead to data overwriting risks, bandwidth consumption, and resource depletion, thus interfering with the performance of the target program. Summary of the Invention

[0003] In view of this, the present disclosure provides a program analysis method and apparatus, electronic equipment, storage medium and computer program product.

[0004] According to one aspect of this disclosure, a program analysis method is provided, comprising: allocating a sampling buffer for the target program before its execution, wherein the sampling buffer includes a sampling buffer partition allocated for each instruction in the target program; when any sampling moment is reached during the execution of the target program, sampling is performed using an instruction counter to determine the current sampling instruction, and the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction is updated through atomic operations; after the execution of the target program is completed, the total number of samplings for each instruction in the target program is determined based on the sampling buffer; and performance analysis is performed on the target program based on the total number of samplings for each instruction in the target program to obtain performance analysis results.

[0005] In one possible implementation, allocating a sampling buffer for the target program before it runs includes: determining a sampling buffer coefficient based on a preset sampling counter size, the minimum instruction size of the target program, and a preset total number of instruction states; and determining the sampling buffer size of the sampling buffer based on the instruction buffer size of the target program and the sampling buffer coefficient.

[0006] In one possible implementation, the method further includes configuring the sampling counter in each sampling buffer partition to 0 before the target program runs.

[0007] In one possible implementation, the step of sampling the instruction counter at any sampling moment during the execution of the target program to determine the current sampling instruction includes: for any hardware processing unit executing the target program, selecting the sampling thread bundle for the current sampling from the valid thread bundles currently scheduled to the hardware processing unit by the target program; sampling the instruction counter of the sampling thread bundle for the current sampling to determine the instruction counter value and instruction status of the current sampling instruction.

[0008] In one possible implementation, updating the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction via atomic operations includes: determining the address of the target sampling buffer partition corresponding to the current sampling instruction based on the instruction counter value and instruction status of the current sampling instruction, the base address of the sampling buffer, the base address of the instruction buffer of the target program, the sampling buffer coefficient, and the preset sampling counter size; determining the target sampling buffer partition corresponding to the current sampling instruction based on the address of the target sampling buffer partition corresponding to the current sampling instruction; and updating the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction via atomic operations.

[0009] In one possible implementation, updating the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction through atomic operations includes: performing an increment operation of 1 on the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction through an atomic increment operation.

[0010] In one possible implementation, the step of selecting the sampling thread bundle for this sampling from the valid thread bundles currently scheduled to the hardware processing unit executing the target program includes: determining the number of valid thread bundles currently scheduled to the hardware processing unit; generating N pseudo-random numbers based on a preset random number generation algorithm, where N is a positive integer; and performing a modulo operation on the N pseudo-random numbers divided by the number of valid threads to determine the N sampling thread bundles for this sampling.

[0011] In one possible implementation, the method further includes: determining the sampling period corresponding to the hardware processing unit based on a preset number of sampling clock cycles and the clock frequency corresponding to the hardware processing unit; and determining the sampling time of the hardware processing unit based on the sampling period corresponding to the hardware processing unit.

[0012] In one possible implementation, the method further includes: after the target program finishes running, writing the sampled data in the sampling buffer into the target storage space.

[0013] In one possible implementation, the step of performing performance analysis on the target program based on the total number of samples for each instruction in the target program to obtain performance analysis results includes: determining the high-level language code segment corresponding to each instruction in the target program; determining the total number of samples for each high-level language code segment corresponding to the target program based on the total number of samples for each instruction in the target program; and performing performance analysis on the target program based on the total number of samples for each high-level language code segment corresponding to the target program to obtain performance analysis results.

[0014] In one possible implementation, the method further includes: constructing a visual performance view based on the total number of samples for each instruction in the target program, or based on the total number of samples for each high-level language code segment corresponding to the target program.

[0015] According to another aspect of this disclosure, a program analysis apparatus is provided, comprising: a sampling buffer allocation module, configured to allocate a sampling buffer for the target program before the target program runs, wherein the sampling buffer includes a sampling buffer partition allocated for each instruction in the target program; a sampling module, configured to sample an instruction counter when any sampling moment is reached during the execution of the target program, determine the current sampling instruction, and update the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction through atomic operations; a statistics module, configured to determine the total number of samplings for each instruction in the target program based on the sampling buffer after the target program runs; and an analysis module, configured to perform performance analysis on the target program based on the total number of samplings for each instruction in the target program, and obtain performance analysis results.

[0016] According to another aspect of this disclosure, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the steps of the above-described method.

[0017] According to another aspect of this disclosure, a non-volatile computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the steps of the above-described method.

[0018] According to another aspect of this disclosure, a computer program product is provided, including a computer program or a non-volatile computer-readable storage medium carrying the computer program, wherein the computer program, when executed by a processor, implements the steps of the above-described method.

[0019] The program analysis method of this disclosure allocates a sampling buffer for the target program before it runs. The sampling buffer includes a sampling buffer partition allocated for each instruction in the target program. Since the instruction counter is used to store the address of the currently executing instruction, when any sampling moment is reached during the execution of the target program, after the instruction to be sampled is determined by the instruction counter, the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction can be updated through atomic operations. Thus, the sampling buffer can be used to determine the real-time sampling count of each instruction in the target program, and after the target program finishes running, the total sampling count of each instruction in the target program can be directly determined. This ensures that the running of the target program is not affected while ensuring sampling accuracy. Furthermore, based on the total sampling count of each instruction in the target program, the execution time of each instruction in the target program can be analyzed, thereby locating the performance bottleneck in the target program, effectively realizing the performance analysis of the target program, and obtaining accurate performance analysis results.

[0020] Other features and aspects of this disclosure will become clear from the following detailed description of exemplary embodiments with reference to the accompanying drawings. Attached Figure Description

[0021] The accompanying drawings, which are included in and form part of this specification, illustrate exemplary embodiments, features, and aspects of this disclosure together with the specification and serve to explain the principles of this disclosure.

[0022] Figure 1 A flowchart of a program analysis method according to an embodiment of the present disclosure is shown.

[0023] Figure 2 A flowchart of a program analysis method according to an embodiment of the present disclosure is shown.

[0024] Figure 3 A schematic diagram of a sampling architecture according to an embodiment of the present disclosure is shown.

[0025] Figure 4 A block diagram of a program analysis apparatus according to an embodiment of the present disclosure is shown.

[0026] Figure 5 A block diagram of an electronic device according to an embodiment of the present disclosure is shown. Detailed Implementation

[0027] Various exemplary embodiments, features, and aspects of this disclosure will now be described in detail with reference to the accompanying drawings. The same reference numerals in the drawings denote elements that have the same or similar functions. Although various aspects of the embodiments are shown in the drawings, they are not necessarily drawn to scale unless specifically indicated otherwise.

[0028] As used herein, the terms “comprising,” “including,” “having,” or variations thereof are open-ended and include one or more of the stated features, integrals, elements, steps, components, or functions, but do not exclude the presence or addition of one or more other features, integrals, elements, steps, components, functions, or groups thereof.

[0029] When an element is referred to as “connected,” “coupled,” “responding,” or a variation thereof relative to another element, it may be directly connected, coupled, or responding to another element, or there may be an intermediate element present.

[0030] Although the terms first, second, third, etc., may be used herein to describe various elements / operations, these elements / operations should not be limited by these terms. These terms are only used to distinguish one element / operation from another. Therefore, without departing from the teachings of the inventive concept, a first element / operation in some embodiments may be referred to as a second element / operation in other embodiments.

[0031] The term “exemplary” as used herein means “serving as an example, embodiment, or illustration.” Any embodiment illustrated herein as “exemplary” is not necessarily to be construed as superior to or better than other embodiments.

[0032] Furthermore, to better illustrate this disclosure, numerous specific details are set forth in the following detailed description. Those skilled in the art will understand that this disclosure can be practiced without certain specific details. In some instances, methods, means, components, and circuits well known to those skilled in the art have not been described in detail in order to highlight the main points of this disclosure.

[0033] It should be noted that the information (including but not limited to user device information, user personal information, etc.), data (including but not limited to data used for analysis, data stored, data displayed, etc.) and signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant regions.

[0034] In existing technologies, the hardware instruction counter content of the program being analyzed is periodically collected (called sampling) on ​​the running hardware, and the collected content is periodically written into a pre-allocated sampling buffer. The driver periodically reads the data from the buffer. After the program being analyzed finishes, the driver summarizes all the read data and uses the instruction counter (assembly instructions) as the independent variable to generate a frequency histogram, thereby analyzing the program's execution characteristics.

[0035] Existing technology analyzes program performance by periodically sampling hardware instruction counters and writing the data into a circular buffer, which is then periodically read and summarized by the driver to generate a frequency histogram. To improve the accuracy of the frequency histogram, the sampling frequency often needs to be increased, but this can lead to the following problems.

[0036] Data overwrite risk: High-frequency sampling can quickly fill a fixed-size buffer. If data is not copied in time, older data may be overwritten and lost. To avoid data loss, either the sampling buffer needs to be increased (sacrificing the available memory space of the analyzed program, affecting program behavior); or a more sophisticated synchronization mechanism needs to be adopted (requiring rapid cooperation between hardware and drivers), increasing system complexity and development difficulty.

[0037] Bandwidth usage and resource consumption: High-frequency sampling means frequently moving data from the buffer to system storage. This not only consumes memory bandwidth that should be used to serve the program being analyzed, but also consumes more driver resources and energy, reducing overall system efficiency and increasing energy consumption. Furthermore, frequent data movement itself can interfere with the performance of the program being analyzed.

[0038] These drawbacks mean that in high-frequency sampling scenarios, the circular buffer method has to make a trade-off between sampling accuracy and system overhead, thus limiting the analysis results.

[0039] To address the aforementioned problems, embodiments of this disclosure provide a program analysis method that simplifies the workflow of transferring sampled data from on-chip to system storage, reducing the risk of data overwriting and loss, minimizing the impact on the target program, and simultaneously improving system execution efficiency and energy efficiency. A flowchart of one such program analysis method is described below.

[0040] Figure 1 A flowchart illustrating a program analysis method according to an embodiment of this disclosure is shown. Figure 1 As shown, the method may include:

[0041] In step S11, before the target program runs, a sampling buffer is allocated for the target program, wherein the sampling buffer includes a sampling buffer partition allocated for each instruction in the target program.

[0042] In step S12, when any sampling moment is reached during the execution of the target program, the instruction counter is sampled to determine the current sampling instruction, and the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction is updated through atomic operations.

[0043] In step S13, after the target program finishes running, the total number of samples for each instruction in the target program is determined based on the sampling buffer.

[0044] In step S14, the target program is subjected to performance analysis based on the total number of samplings for each instruction in the target program, and the performance analysis results are obtained.

[0045] The program analysis method of this disclosure allocates a sampling buffer for the target program before it runs. The sampling buffer includes a sampling buffer partition allocated for each instruction in the target program. Since the instruction counter is used to store the address of the currently executing instruction, when any sampling moment is reached during the execution of the target program, after the instruction to be sampled is determined by the instruction counter, the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction can be updated through atomic operations. Thus, the sampling buffer can be used to determine the real-time sampling count of each instruction in the target program, and after the target program finishes running, the total sampling count of each instruction in the target program can be directly determined. This ensures that the running of the target program is not affected while ensuring sampling accuracy. Furthermore, based on the total sampling count of each instruction in the target program, the execution time of each instruction in the target program can be analyzed, thereby locating the performance bottleneck in the target program, effectively realizing the performance analysis of the target program, and obtaining accurate performance analysis results.

[0046] Figure 2 A flowchart illustrating a program analysis method according to an embodiment of this disclosure is shown. Figure 2 As shown, the sampling buffer is initialized before the target program runs.

[0047] In one possible implementation, before the target program runs, a sampling buffer is allocated to the target program, including: determining the sampling buffer coefficient based on the preset sampling counter size, the minimum instruction size of the target program, and the preset total number of instruction states; and determining the sampling buffer size of the sampling buffer based on the instruction buffer size of the target program and the sampling buffer coefficient.

[0048] A sampling counter is a counter used to record the number of times a state of an instruction is sampled. The default sampling counter size (CntSize) refers to the number of bytes required to store one sampling counter. For example, if the default sampling counter size (CntSize) is 4 bytes, then a single sampling counter can record a maximum of 2 samples. 32 -1 is sufficient to meet the sampling requirements.

[0049] The preset total number of instruction states (StatusNum) is related to the specific implementation of the system instruction control unit or the instruction states that the user is concerned about. Its specific value is a positive integer greater than or equal to 1, and this disclosure does not make any specific limitation on it.

[0050] In one example, if we only care whether an instruction is executed and do not distinguish between different stages of the execution pipeline, the default total number of instruction states (StatusNum) is 1, specifically including: the instruction state StatusID=0 used to indicate instruction execution.

[0051] In one example, when differentiating instruction execution pipeline stages, the preset total number of instruction states (StatusNum) is a positive integer greater than 1, with the specific value determined based on the specific pipeline stage. For example, the preset total number of instruction states (StatusNum) is 4, specifically including: instruction state StatusID=0 for the Fetch pipeline stage, instruction state StatusID=1 for the Decode pipeline stage, instruction state StatusID=2 for the Execute pipeline stage, and instruction state StatusID=3 for the Writeback pipeline stage.

[0052] The sampling buffer size = the target program's instruction buffer size × sampling buffer coefficient, where the sampling buffer coefficient = (preset sampling counter size × preset instruction status total number) / target program's minimum instruction size = (CntSize × StatusNum) / InstrSize.

[0053] In a fixed-length instruction set (WLS) scenario, each instruction in the target program has the same instruction length. In this case, the minimum instruction size (InstrSize) of the target program is the length of a single instruction. For example, in a WLS scenario, if each instruction in the target program has a length of 4 bytes, then the minimum instruction size (InstrSize) of the target program is 4 bytes.

[0054] In a variable-length instruction set (VLS) scenario, the instruction length of each instruction in the target program is different. In this case, the minimum instruction size (InstrSize) of the target program is the smallest and indivisible addressing unit in a single instruction. For example, in a VLS scenario, the minimum instruction size (InstrSize) of the target program is 1 byte.

[0055] The sampling buffer allocated in this embodiment of the target program can allocate a sampling counter for each instruction state of each instruction in the target program, so as to record the number of times each instruction state of each instruction in the target program is sampled during the execution of the target program.

[0056] In one example, the target program can be executed on a target hardware processor, which may include multiple hardware processing units.

[0057] For example, when the target hardware processor is a central processing unit (CPU), the CPU includes multiple processing cores (hardware processing units); or, when the target hardware processor is a graphics processing unit (GPU), the GPU includes multiple stream processors (hardware processing units). The target hardware processor and its multiple hardware processing units can be flexibly changed according to the actual hardware executing the target program, and this disclosure does not impose specific limitations on them.

[0058] The sampling buffer allocated for the target program can be the chip storage (e.g., a buffer) of the target hardware processor, which is shared by multiple hardware processing units included in the target hardware processor, so that each hardware processing unit executing the target program can update the sampling counter in the target sampling buffer partition corresponding to the sampling instruction sampled by the instruction counter through atomic operations.

[0059] In one possible implementation, the method further includes configuring the sampling counter in each sampling buffer partition to 0 before the target program runs.

[0060] Before the target program runs, the sampling buffer allocated to the target program is initialized, and the sampling counter in each sampling buffer partition in the sampling buffer is configured to 0, in order to prepare for the subsequent recording of the number of samples.

[0061] Then, the base address (SampleBase) of the sampling buffer is notified to each hardware processing unit executing the target program.

[0062] like Figure 2 As shown, after initializing the sampling buffer, the target program begins execution. During the execution of the target program, data can be periodically sampled from the instruction counter corresponding to any hardware processing unit executing the target program.

[0063] like Figure 2 As shown, it determines whether the target program has finished running. If not, it further determines whether sampling should be triggered. If the sampling time has been reached, it is determined that sampling should be triggered; if the sampling time has not been reached, it is determined that sampling should not be triggered. At this point, it continues to determine whether the target program has finished running.

[0064] In one possible implementation, the method further includes: for any hardware processing unit executing the target program, determining the sampling period corresponding to the hardware processing unit based on a preset number of sampling clock cycles and the clock frequency corresponding to the hardware processing unit; and determining the sampling time of the hardware processing unit based on the sampling period corresponding to the hardware processing unit.

[0065] By configuring a preset number of sampling clock cycles, the sampling period of the hardware processing unit can be determined based on the preset number of sampling clock cycles and the clock frequency of the hardware processing unit, thereby determining the sampling time of the hardware processing unit. This allows the hardware processing unit to be triggered to perform data sampling when the sampling time is reached.

[0066] In one example, the clock frequency corresponding to a hardware processing unit depends on the clock frequency of the hardware processor in which the hardware processing unit resides. The clock frequency is the same for each hardware processing unit included in the same hardware processor.

[0067] In one example, the clock frequency of the hardware processor housing the hardware processing unit is F Hz, and the preset number of sampling clock cycles is P. Therefore, the sampling period corresponding to the hardware processing unit is P / F. After the hardware processing unit starts data sampling, data sampling is performed once every P / F time intervals.

[0068] The clock frequency F corresponding to the hardware processing unit depends on the hardware's own properties; the specific value of the preset sampling clock cycle number P can be flexibly configured according to the actual application scenario's requirements for the sampling frequency, and this disclosure does not impose specific limitations on it.

[0069] In one possible implementation, the instruction counter is sampled at any sampling moment during the execution of the target program to determine the sampling instruction. This includes: for any hardware processing unit executing the target program, selecting the sampling thread bundle for this sampling from the valid thread bundles currently scheduled to that hardware processing unit; and sampling the instruction counter of the sampling thread bundle for this sampling to determine the instruction counter value and instruction status of the sampling instruction.

[0070] For any hardware processing unit executing the target program, when the sampling time of the hardware processing unit is reached, the current sampling of the hardware processing unit is initiated, and the sampling thread bundle for this sampling is selected from the valid thread bundles currently scheduled to the hardware processing unit by the target program.

[0071] In one possible implementation, for any hardware processing unit executing the target program, the sampling thread bundle for this sampling is selected from the valid thread bundles currently scheduled to that hardware processing unit by the target program. This includes: determining the number of valid thread bundles currently scheduled to that hardware processing unit by the target program; generating N pseudo-random numbers based on a preset random number generation algorithm, where N is a positive integer; and performing a modulo operation on the N pseudo-random numbers with respect to the number of valid threads to determine the N sampling thread bundles for this sampling.

[0072] For any hardware processing unit executing the target program, when the sampling time of the hardware processing unit is reached, the valid thread bundles currently scheduled to the hardware processing unit are sampled by generating random numbers and performing modulo operations to determine the N sampled thread bundles on the hardware processing unit for this sampling.

[0073] The specific value of N can be flexibly configured according to the actual application scenario's requirements for the amount of sampled data, and this disclosure does not impose specific limitations on it.

[0074] In one example, the default random number generation algorithm is configured as linear congruential method. Based on the linear congruential method, N pseudo-random numbers are generated. The specific generation process can be found in relevant technologies, and this disclosure does not impose specific limitations on it.

[0075] In addition to the linear congruential method mentioned above, the preset random number generation algorithm can also be configured with other random number generation algorithms according to the actual application scenario. This disclosure does not make any specific limitations on this.

[0076] For each generated pseudo-random number, take the remainder of the number of valid thread bundles currently scheduled to the hardware processing unit to determine the N sampled thread bundles on the hardware processing unit for this sampling. For example, if the value of any pseudo-random number after taking the remainder of the number of valid threads is i, then the i-th valid thread bundle currently scheduled to the hardware processing unit is determined as a sampled thread bundle.

[0077] In practical applications, a data sampling architecture can be configured for the hardware processing unit, and each sampling of the hardware processing unit can be completed based on the data sampling architecture. Figure 3 A schematic diagram of a sampling architecture according to an embodiment of the present disclosure is shown. (As shown) Figure 3 As shown, the sampling architecture includes: a sampling trigger counter, an instruction scheduling unit, an upstream data interface, an instruction execution unit (the unit that executes instructions in the hardware processing unit), and a sample address generation unit in a single hardware processing unit; and a sampling buffer allocated in the chip memory of the hardware processor where the hardware processing unit is located.

[0078] In the initial state (when sampling of the hardware processing unit is not initiated), the sampling trigger counter does not count. After configuring the preset sampling clock cycle number P and the start sampling signal through the driver, the sampling trigger counter starts counting.

[0079] When the sampling trigger counter reaches the configured preset sampling clock cycle number P, the sampling time of the hardware processing unit is reached. At this time, the sampling trigger counter sends a sampling signal to the instruction scheduling unit and resets the sampling trigger counter.

[0080] After receiving the sampling signal, the instruction scheduling unit generates N pseudo-random numbers based on a preset random number generation algorithm, and determines the number of valid thread bundles currently scheduled to the hardware processing unit through the upstream data interface; then, by taking the remainder of each generated pseudo-random number with respect to the number of valid thread bundles, the N sampling thread bundles on the hardware processing unit for this sampling are determined.

[0081] Then, the instruction scheduling unit samples the instruction counter of each sampling thread bundle in this sampling and sends it to the sample address generation unit to complete this sampling on the hardware processing unit and determine N sample data. Each sample data includes: the instruction counter value (PC) of the sampling thread bundle and the instruction status (StatusID).

[0082] The instruction counter value of the sampling thread bundle indicates the offset address of the currently executing sampled instruction relative to the base address (InstrBase) of the target program's instruction buffer. That is, each sampled data includes: the instruction counter value (PC) of the sampled instruction and the instruction status (StatusID).

[0083] like Figure 2 As shown, for this sampling, the sample address is calculated to determine the target sampling buffer partition address corresponding to this sampling instruction.

[0084] In one possible implementation, updating the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction through atomic operations includes: determining the address of the target sampling buffer partition corresponding to the current sampling instruction based on the instruction counter value and instruction status of the current sampling instruction, the base address of the sampling buffer, the base address of the instruction buffer of the target program, the sampling buffer coefficient, and the preset sampling counter size; determining the target sampling buffer partition corresponding to the current sampling instruction based on the address of the target sampling buffer partition corresponding to the current sampling instruction; and updating the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction through atomic operations.

[0085] For any hardware processing unit executing the target program, after determining the instruction counter value and instruction status of the current sampling instruction on the hardware processing unit, the target sampling buffer partition address corresponding to the current sampling instruction is further determined.

[0086] by Figure 3 For example, after the instruction scheduling unit determines each sampled data, it sends the instruction counter value (PC) and instruction status (StatusID) of the current sampled instruction, which are included in the sampled data, to the sample address processing unit.

[0087] The sample address generation unit determines the target sampling buffer partition address (Target) corresponding to the current sampling instruction based on the instruction counter value (PC) and instruction status (StatusID) of the current sampling instruction using the following formula.

[0088] Target sampling buffer partition address = base address of sampling buffer + (instruction counter value - base address of instruction buffer) × sampling buffer coefficient + preset sampling counter size × instruction status = SampleBase + (PC - InstrBase) × [(CntSize × StatusNum) / InstrSize] + StatusID × CntSize.

[0089] The sample address generation unit is the target sampling buffer partition address determined by the current sampling instruction, which points to the target sampling buffer partition corresponding to the current sampling instruction in the sampling buffer. Furthermore, the sample address generation unit can update the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction through atomic operations.

[0090] like Figure 2 As shown, for this sampling, after calculating the sample address, the sample of this sampling is written into the sampling buffer.

[0091] In one possible implementation, the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction is updated through atomic operations, including: performing an atomic increment operation on the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction.

[0092] The atomic addition operation increments the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction by 1, thus recording the number of samples for the current sampling instruction.

[0093] Based on the aforementioned sampling and atomic operations, the real-time sampling count for each instruction in the target program can be determined using the sampling buffer, and the total sampling count for each instruction in the target program can be directly determined after the target program finishes running. Since there is no need to frequently write the data in the sampling buffer, the operation of the target program can be maintained without affecting the sampling accuracy.

[0094] like Figure 2 As shown, after writing the current sample to the sampling buffer, it is then determined whether the target program has finished running. If so, the sampled data is written to system storage to release the sampling buffer resources.

[0095] In one possible implementation, the method further includes writing the sampled data in the sampling buffer to the target storage space after the target program finishes running.

[0096] Because the sampling buffer is located in high-speed but expensive chip storage on the hardware processing unit, after the target program finishes running, the sampled data in the sampling buffer is written to the external target storage space to clear and release the buffer for use by other tasks or other programs to be analyzed. Furthermore, the target program's sampled data, after being written to the external target storage space, can be persistently stored for repeated use later.

[0097] like Figure 2 As shown, data analysis is performed based on the sampled data to obtain performance analysis results.

[0098] Based on the total number of samples for each instruction in the target program, a performance analysis is performed on the target program to obtain the performance analysis results.

[0099] In one example, a first frequency histogram is plotted based on the total number of samples for each instruction in the target program. The horizontal axis of the first frequency histogram represents each instruction in the target program, and the vertical axis represents the total number of samples for each instruction. Based on the first frequency histogram, the location of performance bottlenecks in the target program can be quickly identified.

[0100] For example, instructions whose total number of samples exceeds the threshold are the location of the performance bottleneck in the target program.

[0101] In one possible implementation, performance analysis is performed on the target program based on the total number of samples for each instruction in the target program to obtain performance analysis results. This includes: determining the high-level language code segment corresponding to each instruction in the target program; determining the total number of samples for each high-level language code segment corresponding to the target program based on the total number of samples for each instruction in the target program; and performing performance analysis on the target program based on the total number of samples for each high-level language code segment corresponding to the target program to obtain performance analysis results.

[0102] To reduce the professional requirements for program performance analysis, the number of samplings for each high-level language code segment in the target program can be determined based on the mapping relationship between each instruction in the target program and the high-level language code segment. This allows for direct performance analysis of the high-level language code segment corresponding to the target program, yielding performance analysis results.

[0103] In one example, a second frequency histogram is plotted based on the total number of samples for each high-level language code segment corresponding to the target program. The horizontal axis of the second frequency histogram represents each high-level language code segment of the target program, and the vertical axis represents the total number of samples for each high-level language code segment. Based on the second frequency histogram, the location of performance bottlenecks in the target program can be quickly identified.

[0104] For example, the high-level language code segment whose total number of samples exceeds the threshold is the location of the performance bottleneck in the target program.

[0105] In one possible implementation, the method further includes: constructing a visual performance view based on the total number of samples for each instruction in the target program, or based on the total number of samples for each high-level language code segment corresponding to the target program.

[0106] To further help program developers intuitively understand performance data, a visual performance view can be built using visualization tools for performance analysis, based on the total number of samples for each instruction in the target program, or the total number of samples for each high-level language code segment corresponding to the target program. This will help program developers quickly identify the location of performance bottlenecks in the target program.

[0107] In one example, a visual performance view could be a flame graph. The specific form of the visual performance view depends on the visualization tools that are flexibly configured in the actual application scenario, and this disclosure does not impose any specific limitations on it.

[0108] Quickly locating the performance bottlenecks in a target program's execution process enables program developers to effectively optimize the program, greatly improving optimization efficiency and effectiveness.

[0109] It is understood that the various method embodiments mentioned above in this disclosure can be combined with each other to form combined embodiments without violating the principle and logic. Due to space limitations, this disclosure will not elaborate further. Those skilled in the art will understand that in the above methods of specific implementation, the specific execution order of each step should be determined by its function and possible internal logic.

[0110] In addition, this disclosure also provides a program analysis apparatus, an electronic device, a computer-readable storage medium, and a program, all of which can be used to implement any of the program analysis methods provided in this disclosure. The corresponding technical solutions and descriptions are described in the corresponding section of the method and will not be repeated here.

[0111] Figure 4 A block diagram of a program analysis apparatus according to an embodiment of the present disclosure is shown. Figure 4 As shown, the device 40 includes:

[0112] The sampling buffer allocation module 41 is used to allocate a sampling buffer for the target program before the target program runs, wherein the sampling buffer includes a sampling buffer partition allocated for each instruction in the target program;

[0113] The sampling module 42 is used to sample the instruction counter when any sampling moment is reached during the execution of the target program, determine the current sampling instruction, and update the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction through atomic operations.

[0114] The statistics module 43 is used to determine the total number of samples for each instruction in the target program based on the sampling buffer after the target program has finished running.

[0115] Analysis module 44 is used to perform performance analysis on the target program based on the total number of samplings for each instruction in the target program, and obtain the performance analysis results.

[0116] In one possible implementation, the sampling buffer allocation module 41 is used for

[0117] The sampling buffer coefficient is determined based on the preset sampling counter size, the minimum instruction size of the target program, and the preset total number of instruction states;

[0118] The sampling buffer size is determined based on the instruction buffer size and sampling buffer coefficient of the target program.

[0119] In one possible implementation, device 40 further includes:

[0120] The sampling buffer initialization module is used to configure the sampling counter in each sampling buffer partition to 0 before the target program runs.

[0121] In one possible implementation, the sampling module 42 is used for:

[0122] For any hardware processing unit executing the target program, select the sampling thread bundle for this sampling from the valid thread bundles currently scheduled to that hardware processing unit by the target program;

[0123] Data is sampled from the instruction counter of the sampling thread bundle for this sampling to determine the instruction counter value and instruction status of the sampling instruction.

[0124] In one possible implementation, the sampling module 42 is used for:

[0125] Based on the instruction counter value and instruction status of this sampling instruction, the base address of the sampling buffer, the base address of the instruction buffer of the target program, the sampling buffer coefficient, and the preset sampling counter size, determine the target sampling buffer partition address corresponding to this sampling instruction;

[0126] Based on the target sampling buffer partition address corresponding to this sampling instruction, determine the target sampling buffer partition corresponding to this sampling instruction;

[0127] The sampling counter in the target sampling buffer partition corresponding to this sampling instruction is updated through atomic operations.

[0128] In one possible implementation, the sampling module 42 is used for:

[0129] The sampling counter in the target sampling buffer partition corresponding to this sampling instruction is incremented by 1 through an atomic addition operation.

[0130] In one possible implementation, the sampling module 42 is used for:

[0131] Determine the number of valid thread bundles currently scheduled to this hardware processing unit for the target program;

[0132] Based on a preset random number generation algorithm, N pseudo-random numbers are generated, where N is a positive integer;

[0133] The N pseudo-random numbers are moduloed by the number of valid threads to determine the N sampling thread bundles for this sampling.

[0134] In one possible implementation, device 40 further includes: a sampling time determination module, used for:

[0135] The sampling period corresponding to the hardware processing unit is determined based on the preset number of sampling clock cycles and the clock frequency corresponding to the hardware processing unit.

[0136] The sampling time of the hardware processing unit is determined based on the sampling period corresponding to the hardware processing unit.

[0137] In one possible implementation, device 40 further includes:

[0138] The write data module is used to write the sampled data in the sampling buffer to the target storage space after the target program finishes running.

[0139] In one possible implementation, the analysis module 44 is used for:

[0140] Identify the high-level language code segment corresponding to each instruction in the target program;

[0141] The total number of samples for each high-level language code segment corresponding to the target program is determined based on the total number of samples for each instruction in the target program.

[0142] Based on the total number of samples for each high-level language code segment corresponding to the target program, a performance analysis is performed on the target program to obtain the performance analysis results.

[0143] In one possible implementation, device 40 further includes:

[0144] The visualization module is used to build a visual performance view based on the total number of samples for each instruction in the target program, or based on the total number of samples for each high-level language code segment corresponding to the target program.

[0145] In some embodiments, the functions or modules of the apparatus provided in this disclosure can be used to perform the methods described in the above method embodiments. The specific implementation can be referred to the description of the above method embodiments, and for the sake of brevity, it will not be repeated here.

[0146] This disclosure also provides an electronic device, including a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the steps of the above method.

[0147] This disclosure also provides a non-volatile computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the above-described method.

[0148] This disclosure also provides a computer program product, including a computer program or a non-volatile computer-readable storage medium carrying the computer program, wherein the computer program, when executed by a processor, implements the steps of the above method.

[0149] Figure 5 A block diagram of an electronic device according to an embodiment of the present disclosure is shown. (Refer to...) Figure 5 The electronic device 1900 can be provided as a server or a terminal device. (See reference...) Figure 5 The electronic device 1900 includes a processing component 1922, which further includes one or more processors, and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by the processing component 1922. The application programs stored in memory 1932 may include one or more modules, each corresponding to a set of instructions. Furthermore, the processing component 1922 is configured to execute instructions to perform the methods described above.

[0150] Electronic device 1900 may also include a power supply component 1926 configured to perform power management of electronic device 1900, a wired or wireless network interface 1950 configured to connect electronic device 1900 to a network, and an input / output interface 1958 (I / O interface). Electronic device 1900 can operate on an operating system, such as Windows Server, stored in memory 1932. TM Mac OS X TM Unix TM Linux TM FreeBSD TM Or similar.

[0151] In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as a memory 1932 including computer program instructions that can be executed by a processing component 1922 of an electronic device 1900 to perform the above-described method.

[0152] Computer-readable storage media can be tangible devices capable of holding and storing programs / instructions used by instruction execution devices. Computer-readable storage media can be, for example—but not limited to—electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable compact disc read-only memory (CD-ROM), digital multifunction disc (DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards or recessed protrusions storing instructions thereon, and any suitable combination of the foregoing. The computer-readable storage media used herein are not to be construed as transient signals themselves, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.

[0153] The computer program (or computer-readable program instructions) described herein can be downloaded from a computer-readable storage medium to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage medium in the respective computing / processing device.

[0154] The computer program (or computer program instructions) used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages ​​such as Smalltalk, C++, etc., and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry, such as programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), is personalized by utilizing state information from the computer-readable program instructions to implement various aspects of this disclosure.

[0155] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0156] These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processor of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner; thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0157] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0158] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0159] The various embodiments of this disclosure have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or technical improvements to the embodiments in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A program analysis method, characterized in that, include: Before the target program runs, a sampling buffer is allocated to the target program. The sampling buffer includes a sampling buffer partition allocated for each instruction in the target program. The sampling buffer partition allocated for each instruction includes a sampling counter for recording the number of times the state of each instruction is sampled. When the target program reaches any sampling moment during its execution, the instruction counter is sampled to determine the current sampling instruction, and the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction is updated through atomic operations. After the target program finishes running, the total number of samples for each instruction in the target program is determined based on the sampling buffer. Based on the total number of samples for each instruction in the target program, a performance analysis is performed on the target program to obtain the performance analysis results.

2. The method according to claim 1, characterized in that, The step of allocating a sampling buffer for the target program before it runs includes: The sampling buffer coefficient is determined based on the preset sampling counter size, the minimum instruction size of the target program, and the preset total number of instruction states; The sampling buffer size is determined based on the instruction buffer size of the target program and the sampling buffer coefficient.

3. The method according to claim 1, characterized in that, The method further includes: Before the target program runs, the sampling counter in each sampling buffer partition is configured to 0.

4. The method according to claim 2, characterized in that, The step of sampling the instruction counter at any sampling moment during the execution of the target program to determine the sampling instruction includes: For any hardware processing unit executing the target program, select the sampling thread bundle for this sampling from the valid thread bundles currently scheduled to that hardware processing unit by the target program; Data is sampled from the instruction counter of the sampling thread bundle for this sampling to determine the instruction counter value and instruction status of the sampling instruction.

5. The method according to claim 4, characterized in that, The step of updating the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction through atomic operations includes: Based on the instruction counter value and instruction status of this sampling instruction, the base address of the sampling buffer, the base address of the instruction buffer of the target program, the sampling buffer coefficient, and the preset sampling counter size, the target sampling buffer partition address corresponding to this sampling instruction is determined; Based on the target sampling buffer partition address corresponding to this sampling instruction, determine the target sampling buffer partition corresponding to this sampling instruction; The sampling counter in the target sampling buffer partition corresponding to this sampling instruction is updated through atomic operations.

6. The method according to claim 5, characterized in that, The step of updating the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction through atomic operations includes: The sampling counter in the target sampling buffer partition corresponding to this sampling instruction is incremented by 1 through an atomic addition operation.

7. The method according to claim 4, characterized in that, The step of selecting the sampling thread bundle for this sampling from the valid thread bundles currently scheduled to that hardware processing unit for executing the target program includes: Determine the number of valid thread bundles currently scheduled to the hardware processing unit for the target program; Based on a preset random number generation algorithm, N pseudo-random numbers are generated, where N is a positive integer; The N pseudo-random numbers are moduloed by the number of valid threads to determine the N sampling thread bundles for this sampling.

8. The method according to claim 4, characterized in that, The method further includes: The sampling period corresponding to the hardware processing unit is determined based on the preset number of sampling clock cycles and the clock frequency corresponding to the hardware processing unit. The sampling time of the hardware processing unit is determined based on the sampling period corresponding to the hardware processing unit.

9. The method according to claim 1, characterized in that, The method further includes: After the target program finishes running, the sampled data in the sampling buffer is written to the target storage space.

10. The method according to claim 1, characterized in that, The step of performing performance analysis on the target program based on the total number of samples for each instruction in the target program, and obtaining the performance analysis results, includes: Identify the high-level language code segment corresponding to each instruction in the target program; The total number of samples for each high-level language code segment corresponding to the target program is determined based on the total number of samples for each instruction in the target program. Based on the total number of samples for each high-level language code segment corresponding to the target program, a performance analysis is performed on the target program to obtain the performance analysis results.

11. The method according to claim 1 or 10, characterized in that, The method further includes: A visual performance view is constructed based on the total number of samples for each instruction in the target program, or based on the total number of samples for each high-level language code segment corresponding to the target program.

12. A program analysis device, characterized in that, include: A sampling buffer allocation module is used to allocate a sampling buffer for the target program before the target program runs. The sampling buffer includes a sampling buffer partition allocated for each instruction in the target program, and the sampling buffer partition allocated for each instruction includes a sampling counter for recording the number of times the state of each instruction is sampled. The sampling module is used to sample the instruction counter when any sampling moment is reached during the execution of the target program, determine the current sampling instruction, and update the sampling counter in the target sampling buffer partition corresponding to the current sampling instruction through atomic operations. The statistics module is used to determine the total number of samples for each instruction in the target program after the target program has finished running, based on the sampling buffer. The analysis module is used to perform performance analysis on the target program based on the total number of samplings for each instruction in the target program, and obtain the performance analysis results.

13. An electronic device comprising a memory, a processor, and a computer program stored in the memory, characterized in that, The processor executes the computer program to implement the steps of the method according to any one of claims 1 to 11.

14. A non-volatile computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 11.

15. A computer program product comprising a computer program, or a non-volatile computer-readable storage medium carrying a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 11.