A method of collecting micro-architectural event information

By alternating sampling of multiple CPU microarchitecture event groups, dividing time periods according to program performance fluctuations and adjusting the sampling frequency, the problem of low efficiency in existing technologies is solved, and the effect of efficiently collecting CPU microarchitecture event information is achieved.

CN114579291BActive Publication Date: 2026-06-16SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI
Filing Date
2020-12-01
Publication Date
2026-06-16

Smart Images

  • Figure CN114579291B_ABST
    Figure CN114579291B_ABST
Patent Text Reader

Abstract

The application discloses a method for collecting micro-architecture event information. The method comprises the following steps: setting a counter to collect a micro-architecture event and recording the running time of a program when the program is run for the first time; dividing the running of the program into multiple fluctuation type periods according to the performance fluctuation during the first running of the program; setting corresponding micro-architecture event groups and collection frequencies for each fluctuation type period when the program is run subsequently, and collecting micro-architecture event information by using the corresponding micro-architecture event groups in an alternating frequency mode for different fluctuation type periods. The application sets multiple micro-architecture event groups for each running of the program, and performs alternating sampling, so that the collection efficiency is significantly improved, and the quality of the collected information is not affected.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, and more specifically, to a method for collecting microarchitecture event information. Background Technology

[0002] Modern processors typically provide a small number of hardware performance counters to capture a large number of microarchitectural events. With the rapid development of internet technology, large-scale clusters with thousands of servers operate "24 / 7 / 365." Mainstream companies are eager to understand the performance behavior of these clusters because even small performance improvements can save significant amounts of money. These counters can easily collect large amounts (e.g., GB per day) of CPU microarchitectural event information, such as cache and TLB misses. These CPU microarchitectural events often explain the root causes of performance bottlenecks in computer systems. It can be seen that this CPU microarchitectural event information provides a valuable foundation for root cause analysis of performance bottlenecks, architectural and compiler optimizations, and more.

[0003] The current method for collecting CPU microarchitecture events is "one counter per event," meaning that a counter counts only one event throughout the entire program's execution. After the program finishes running, the number of CPU microarchitecture events collected equals the number of counters. Then, with other parameters remaining constant, new CPU microarchitecture events are selected to be collected, and the program is run repeatedly several times until all CPU microarchitecture events have been collected.

[0004] Typically, every modern processor has a logical unit called a Performance Monitoring Unit (PMU), which consists of a set of hardware counters. These counters count how many times a particular event occurs within a time interval of program execution. The number of counters can vary depending on the microarchitecture. The events that can be measured by the hardware counters are predefined by the processor vendor. For different microarchitectures, the number of events that can be measured by the hardware counters can vary significantly. Currently, commonly used CPUs have a predefined microarchitecture with 236 events, while the hardware counters only have 6. Therefore, the number of events far exceeds the number of hardware counters.

[0005] Currently, existing methods for collecting CPU microarchitecture events use a one-counter-one-event approach, where a single counter counts only one event throughout the program's execution. While this method provides accurate event counts, it only collects the number of performance events equal to the number of performance counters after each program run. To characterize program performance, two additional performance counters are needed to monitor cycles and instructions. This leaves only four performance counters to monitor the performance events being tested. To measure a complete set of performance events, the program needs to be run 59 times. Since measuring a large number of events is typically necessary to determine the root cause of unknown system performance bottlenecks, the one-counter-one-event method results in significant inefficiency due to the numerous program runs required. In real-world production environments, a single program run can take several hours; therefore, repeating the process 59 times would be extremely time-consuming in collecting CPU microarchitecture event information. Summary of the Invention

[0006] The purpose of this invention is to address the problem of high time cost in collecting CPU microarchitecture event information by providing a new technical solution for collecting microarchitecture event information. This solution involves setting multiple groups of CPU microarchitecture events for each program run and sampling them alternately, thereby significantly shortening the time for collecting CPU microarchitecture events and improving collection efficiency.

[0007] The technical solution of the present invention is to provide a method for collecting microarchitecture event information, the method comprising the following steps:

[0008] When the program runs for the first time, a counter is set to collect a microarchitectural event and the program's runtime is recorded.

[0009] Based on the performance fluctuations during the first program run, the program run is divided into multiple fluctuation-type periods;

[0010] During subsequent program execution, for each fluctuation type period, a corresponding microarchitecture event group and collection frequency are set, and microarchitecture event information is collected in an alternating frequency manner using the corresponding microarchitecture event group for different fluctuation type periods.

[0011] Compared with the prior art, the advantages of this invention are that it proposes a method for alternately collecting multiple CPU microarchitecture event groups when collecting CPU microarchitecture event information. While ensuring the quality of event information, it can effectively reduce the collection time of CPU microarchitecture event information by reducing the number of times the program is repeatedly run. The method of classifying the running time according to the fluctuation of the program's runtime performance and adopting different sampling strategies according to the current fluctuation type of the program at different stages of program execution improves the collection efficiency and reduces system overhead.

[0012] Other features and advantages of the invention will become clear from the following detailed description of exemplary embodiments of the invention with reference to the accompanying drawings. Attached Figure Description

[0013] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments of the invention and, together with their description, serve to explain the principles of the invention.

[0014] Figure 1 This is a flowchart of a method for collecting microarchitecture event information according to an embodiment of the present invention;

[0015] Figure 2 This is a schematic diagram of the overall process of a method for collecting microarchitecture event information according to an embodiment of the present invention;

[0016] Figure 3 This is a schematic diagram of event group alternation in a CPU microarchitecture according to an embodiment of the present invention;

[0017] Figure 4 This is a schematic diagram illustrating the time required to collect all CPU microarchitecture event information according to an embodiment of the present invention;

[0018] Figure 5 This is a schematic diagram illustrating the verification of the quality of collected CPU microarchitecture event information according to an embodiment of the present invention. Detailed Implementation

[0019] Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that, unless otherwise specifically stated, the relative arrangement, numerical expressions, and values ​​of the components and steps set forth in these embodiments do not limit the scope of the invention.

[0020] The following description of at least one exemplary embodiment is merely illustrative and is in no way intended to limit the invention or its application or use.

[0021] Techniques, methods, and equipment known to those skilled in the art may not be discussed in detail, but where appropriate, such techniques, methods, and equipment should be considered part of the specification.

[0022] In all the examples shown and discussed herein, any specific values ​​should be interpreted as merely exemplary and not as limitations. Therefore, other examples of exemplary embodiments may have different values.

[0023] It should be noted that similar labels and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be discussed further in subsequent figures.

[0024] This invention involves continuously and repeatedly running a program under fixed parameters to obtain corresponding CPU microarchitecture event information. During the first program run, a "one counter, one event" method is used to collect CPU microarchitecture events, and the program execution time is recorded. Simultaneously, based on performance fluctuations during the first program run, the overall execution time is segmented and categorized. Then, an alternating strategy for microarchitecture event groups is used to collect the microarchitecture events generated during program execution.

[0025] This invention uses a set of randomly generated feasible configuration parameters and reduces the time required for event information collection by alternating the collection of CPU microarchitecture event groups. In short, it combines... Figure 1 and Figure 2 As shown, the method for collecting microarchitecture event information provided by the present invention includes: step S110, during the first run of the program, setting a counter to collect a microarchitecture event and recording the program run time; step S120, dividing the program run into multiple fluctuation type time periods based on the performance fluctuations during the first program run; step S130, during subsequent program runs, setting a corresponding microarchitecture event group and collection frequency for each fluctuation type time period, and collecting microarchitecture event information in an alternating frequency manner using the corresponding microarchitecture event group for different fluctuation type time periods.

[0026] Specifically, microarchitectural events refer to performance metrics reflecting interactions with a microarchitecture (e.g., a CPU). For example, microarchitectural events include, but are not limited to, the number of instructions, cycles, and the number of macrobranch instructions retired due to misprediction. Preferably, the number of instructions and cycles are always present during collection and can be used to calculate the number of instructions executed per cycle, denoted as IPC (instruction per cycle), which is the ratio of instructions to cycles.

[0027] Suppose we have n hardware counters and m predefined CPU microarchitecture events. To characterize program performance, we use IPC as the evaluation metric. According to the IPC calculation formula, IPC = instructions / cycles, each CPU microarchitecture event group needs two fixed events: Cycles and Instructions. Therefore, only (n-2) performance counters remain to monitor the remaining performance events to be measured.

[0028] To more intuitively explain CPU microarchitecture event groups, taking a counter count of 6 as an example, a specific CPU microarchitecture event group can be represented as: (cycles, instructions, event1, event2, event3, event4). Where event... i (i = 1, 2, 3, 4) correspond to four different CPU microarchitecture events. Therefore, it can be concluded that in order to collect information on all CPU microarchitecture events, the program needs to be run repeatedly. Second-rate.

[0029] During the first run of the program, a traditional "one counter, one event" method is used to collect information on a set of CPU microarchitecture events. After the program finishes running, the program's running status can be analyzed based on the calculated IPC, and the program's running time can be divided into periods based on the fluctuations in IPC.

[0030] For example, periods of volatility can be divided into stable periods and volatile periods, and the specific division method is as follows:

[0031] Based on the program's execution time t and the pre-defined counter interval, a collection of lengths of [length missing] can be generated. The IPC data is written as an array, denoted as IPC[]. When judging IPC volatility, the volatility threshold is assumed to be δ. thres (e.g. δ) thres =5%); at the same time, the concept of step size is introduced, denoted as step. The indices of the current starting point are denoted as cur and pre.

[0032] Next, calculate the magnitude of IPC volatility within the interval [cur, cur+step]. If the volatility of IPC... Greater than δ thres Therefore, if the IPC volatility is considered high within the range [cur, cur+step], then update cur = cur+step. Repeat the above steps until the IPC volatility δ is less than δ. thresIf we consider that the IPC has stabilized within the interval [cur, cur+step], then we can classify the interval [pre, cur] as the IPC fluctuation period and update pre = cur. Similarly, we can divide the IPC stable period, which will not be elaborated upon here.

[0033] Because the program runs continuously and repeats with the same parameters, even with potential fluctuations in machine performance, the program's running state remains highly similar. Therefore, the time period classification results obtained after the first run are also applicable to subsequent repeated executions.

[0034] At this point, the alternation frequency of the two sets of CPU microarchitecture event groups is set to freq during subsequent program execution. low (e.g., freq) low =1 / 8) and freq high (e.g., freq) high =1 / 4). During the fluctuation period, the program's running state changes significantly, and the CPU microarchitecture events also change relatively significantly within each counter interval. The CPU microarchitecture events with a higher frequency are selected for alternation, i.e., freq. high To retain as much event information as possible; during stable periods, select a lower frequency (freq). low This reduces the system overhead when exchanging CPU microarchitecture event groups.

[0035] Next, we will describe how to alternately collect CPU microarchitecture event information during the subsequent repeated execution of the program. If the current program is in a fluctuating period, we will set the frequency to be adjusted every (1 / freq). high The collection duration for the current CPU microarchitecture event group is set to `length` seconds, meaning the currently collected CPU microarchitecture event group is swapped every `length` seconds. Similarly, if the current program is in a stable period, the collection duration is set to `length` seconds every (1 / freq) seconds. low The second represents the collection duration (length) of the current CPU microarchitecture event group. Within each collection duration (length), a counter collects event information every 0.5 seconds.

[0036] For clarity, combined Figure 3 As shown, a specific example illustrates this:

[0037] CPU microarchitecture event group 1: (cycles, instructions, event1, event2, event3, event4);

[0038] CPU Microarchitecture Event Group 2: (cycles, instructions, event5, event6, event7, event8).

[0039] In odd-numbered collection durations, the event information is retrieved from CPU microarchitecture event group 1 according to the collection duration corresponding to the fluctuation type to which the current program belongs. In even-numbered collection durations, the event information is retrieved from CPU microarchitecture event group 2 according to the collection duration corresponding to the fluctuation type to which the current program belongs. Specifically, the difference in CPU microarchitecture event groups is reflected in: in t1, t3, t5, ..., t... (2n-1) Time period for collecting CPU microarchitecture event group 1; at t2, t4, t6, ..., t (2n) The CPU microarchitecture event group 2 is collected over a time period. After one program execution, information on 8 CPU microarchitecture events can be obtained. Compared to the traditional "one counter per event" method, the collection efficiency is twice as high.

[0040] It should be noted that the metrics used to characterize program performance can be selected based on different application scenarios, and the segmentation strategy for program execution phases can also be changed according to business needs, such as including more fluctuation-type time periods and setting corresponding multiple microarchitecture event groups. Furthermore, this invention can not only collect CPU microarchitecture event information under the Spark framework, but can also be used in other types of large-scale multi-parameter computing systems, such as blockchain systems and cloud operating systems, where CPU microarchitecture event information can also be collected.

[0041] To further verify the effectiveness of the present invention, simulation experiments were conducted. Using the HiBench program, CPU microarchitecture event information was collected under fixed Spark configuration parameters. The K-means algorithm from HiBench was selected for feasibility testing. The present invention was compared with existing technologies, and the quality of the CPU microarchitecture event information collected by the present invention was verified.

[0042] Figure 4 This represents the time spent by K-means in collecting CPU microarchitecture events. Both a counter-based approach (labeled "traditional") and this invention (labeled "alternating, frequency-varying") collect information on all predefined CPU microarchitecture events under the same Spark configuration parameters. Because this invention can alternately collect two sets of CPU microarchitecture events during program execution, it achieves twice the result with half the effort, thus improving collection efficiency. Furthermore, this invention adjusts the frequency of alternating CPU microarchitecture event groups during collection based on program performance changes, further reducing system performance overhead. Figure 4 Experimental results show that the existing technology takes about twice as long as the present invention to collect event information of CPU microarchitecture.

[0043] Figure 5 This represents the distance between CPU microarchitecture event information collected using the present invention and the traditional "one counter, one event" method in K-means. It can be seen that over 90% of the events collected by the "one counter, one event" method and the present invention satisfy a distance of less than 0.1. This indicates that the present invention does not compromise the quality of the collected information by collecting more CPU microarchitecture events per program run.

[0044] Experimental results show that by setting two sets of CPU microarchitecture event groups during the CPU microarchitecture event information collection process and adjusting the alternation frequency according to the performance status of the program, the time required for CPU microarchitecture event information collection can be significantly reduced. Compared with existing technologies, this invention can collect CPU microarchitecture event information of similar quality and can improve the data collection speed by 2 times.

[0045] In summary, compared with the prior art, the present invention has the following main advantages:

[0046] 1) Existing methods for collecting CPU microarchitecture event information require extensive repetitive program execution. This invention significantly improves the efficiency of CPU microarchitecture event information collection by alternating CPU microarchitecture event groups during program execution, thereby ensuring the quality of collected information while reducing the number of repetitive program executions.

[0047] 2) Based on the variation of IPC, the program execution is divided into several stages. When the program execution state is stable, the number of CPU microarchitecture events occurring within the program execution time interval does not change significantly. Therefore, the alternation frequency of CPU microarchitecture event groups can be reduced, thereby reducing the performance overhead in the CPU microarchitecture event collection process.

[0048] 3) This invention provides an efficient data collection method for finding optimal configuration parameters through program analysis at the microarchitecture level. Compared with other existing artificial intelligence methods for finding optimal configurations, adjusting parameters at the microarchitecture level is more interpretable.

[0049] This invention can be a system, method, and / or computer program product. A computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the invention.

[0050] Computer-readable storage media can be tangible devices capable of holding and storing instructions for use by an instruction execution device. Computer-readable storage media can be, for example—but not limited to—electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination thereof. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable compact disc read-only memory (CD-ROM), digital multifunction disc (DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards or recessed protrusions storing instructions thereon, and any suitable combination thereof. The computer-readable storage media used herein are not to be construed as transient signals themselves, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.

[0051] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.

[0052] The computer program instructions used to perform the operations of this invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages ​​such as Smalltalk, C++, etc., and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The computer-readable program instructions may be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry, such as programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), is personalized by utilizing state information from the computer-readable program instructions. This electronic circuitry can execute the computer-readable program instructions to implement various aspects of the invention.

[0053] Various aspects of the present invention are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0054] These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processor of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner; thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0055] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0056] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions. It will be known to those skilled in the art that implementation in hardware, implementation in software, and implementation using a combination of software and hardware are equivalent.

[0057] The various embodiments of the present invention have been described above. These descriptions are exemplary and not exhaustive, and are not limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or technical improvements to the embodiments in the market, or to enable others skilled in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A method for collecting microarchitecture event information, comprising the following steps: When the program runs for the first time, a counter is set to collect a microarchitectural event and the program's runtime is recorded. Based on the performance fluctuations during the first program run, the program run is divided into multiple fluctuation-type periods; During subsequent program execution, for each fluctuation type period, a corresponding microarchitecture event group and collection frequency are set, and microarchitecture event information is collected in an alternating frequency manner using the corresponding microarchitecture event group for different fluctuation type periods. The fluctuation type period includes a stable period and a fluctuation period. The fluctuation type period is divided according to the number of instructions executed per cycle, and is represented as follows: in, Indicates the number of instructions. Indicates the number of cycles; The step of collecting microarchitecture event information using corresponding microarchitecture event groups in an alternating frequency conversion manner for different fluctuation types of time periods includes the following steps: If the current program is in a period of fluctuation, set each The seconds represent the collection duration for the corresponding first microarchitecture event group. ; If the current program is in a stable period, set each The seconds represent the collection duration for the corresponding second microarchitecture event group. ,in Less than ; For each collection duration In the middle, the counter interval Collect microarchitecture event information once per second.

2. The method according to claim 1, wherein, The fluctuation period is determined according to the following steps: Based on the execution time of the program's first run and the counter's counting interval Collection length is The number of instructions executed per cycle is recorded and written as an array. ; When assessing performance fluctuations, set a volatility threshold. and step length And record the index of the current starting point as and ; calculate The magnitude of IPC volatility within this range, if IPC volatility Greater than the threshold Then it is believed Large fluctuations in IPC within the range, update at this time Until the volatility of IPC Less than the threshold and will The interval is classified as the fluctuation period.

3. The method according to claim 1, wherein, The counter interval Set to 0.5 seconds.

4. A computer-readable storage medium having a computer program stored thereon, wherein, When the program is executed by the processor, it implements the steps of the method according to any one of claims 1 to 3.

5. A computer device comprising a memory and a processor, wherein a computer program capable of running on the processor is stored in the memory, characterized in that, When the processor executes the program, it implements the steps of the method according to any one of claims 1 to 3.