Program segment classification method and device, electronic equipment and readable storage medium

By combining the program fragment's own feature information and instruction context information to generate composite vectors for clustering, the problem of inaccurate program fragment classification in the prior art is solved, and the classification accuracy and microprocessor performance testing efficiency are improved.

CN117093926BActive Publication Date: 2026-06-16LOONGSON TECH CORP

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
LOONGSON TECH CORP
Filing Date
2023-07-19
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In existing technologies, when classifying program segments based solely on their own information, it is difficult to accurately segment program segments in complex test programs, resulting in poor classification performance and low accuracy.

Method used

By acquiring the self-feature information and instruction context information of program segments, composite vectors are generated for clustering, which uncovers the deep connections between different basic blocks or program segments and improves classification accuracy.

🎯Benefits of technology

It improves the accuracy of program fragment classification and the efficiency of microprocessor performance testing, while reducing the consumption of computing resources.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117093926B_ABST
    Figure CN117093926B_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide a program segment classification method and device, electronic equipment and storage medium, which are applied to the field of computers, and the method comprises the following steps: dividing a target program into at least two program segments; obtaining the self feature information of the program segments and the instruction context information corresponding to the program segments; clustering the program segments according to the self feature information of the program segments and the instruction context information, and obtaining a classification result. Through the above method, the deep relationship between different basic blocks or program segments can be mined, and the clustering effect and the accuracy of classification for different program segments can be improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computers, and in particular to a method, apparatus, electronic device, and readable storage medium for classifying program fragments. Background Technology

[0002] Microprocessor performance is an important factor in measuring computer performance. In the process of analyzing microprocessor performance, it is usually necessary to divide the various program segments in the test program into different categories, and then select program segments of different categories from the test program for use in microprocessor performance testing.

[0003] In related technologies, the characteristic information of a program segment, such as its stored procedures and register usage, is usually used to measure and compare different parts of the test program execution. Then, clustering methods are used to measure the similarity between program segments, and program segments with less similarity are grouped into one category, thus obtaining the classification results of each program segment in the test program.

[0004] However, relying solely on information from the program fragments themselves to measure the different parts of a test program's execution results in poor classification performance and low accuracy when classifying program fragments in test programs with complex behaviors. Summary of the Invention

[0005] In view of the above problems, embodiments of the present invention propose a method, apparatus, electronic device, and readable storage medium for classifying program fragments, which saves computer resources and improves the development and testing efficiency of the system.

[0006] To address the above problems, embodiments of the present invention disclose a method for classifying program fragments, the method comprising:

[0007] Divide the target program into at least two program segments;

[0008] Obtain the program segment's own characteristic information and the instruction context information corresponding to the program segment;

[0009] Based on the program fragment's own characteristic information and the instruction context information, the program fragment is clustered to obtain a classification result.

[0010] Optionally, the program segments are clustered based on their own feature information and the instruction context information to obtain classification results, including:

[0011] For each program segment, a first composite vector of the program segment is generated based on the segment's own feature information and the instruction context information.

[0012] Clustering is performed on the at least two program segments based on the first composite vector to obtain the classification result.

[0013] Optionally, the instruction context information includes the first adjacency distance corresponding to the program segment;

[0014] For each program segment, generating a first composite vector for that program segment based on its own feature information and the instruction context information includes:

[0015] Based on the first adjacency distance and the self-feature information, a first composite vector is generated corresponding to each program segment.

[0016] Optionally, obtaining the instruction context information corresponding to the program fragment includes:

[0017] For each program segment, the target vector of the program segment is obtained based on its own feature information;

[0018] If the program segment is the first program segment of the target program, then the warm-up information during the execution of the first program segment is obtained; the warm-up information is used to reflect whether a warm-up process exists during the execution of the first program segment.

[0019] The first adjacency distance corresponding to the first program segment is determined based on the preheating information;

[0020] If the program fragment is not the first program fragment of the target program, then the distance between the target vector corresponding to the program fragment and the target vector corresponding to the previous program fragment is calculated to obtain the first adjacency distance of the program fragment.

[0021] Optionally, the self-feature information includes the self-attribute information of each basic block in the program segment; the step of generating a first composite vector corresponding to each program segment based on the first adjacency distance and the self-feature information includes:

[0022] The first adjacency distance is weighted to obtain the second adjacency distance;

[0023] For each program segment, the self-attribute information of a basic block and the second adjacency distance corresponding to the program segment are respectively used as information of a single dimension of the vector. Based on the self-attribute information and the second adjacency distance, a first composite vector corresponding to each program segment is generated.

[0024] Optionally, determining the first adjacency distance corresponding to the first program segment based on the preheating information includes:

[0025] If there is no warm-up process during the execution of the first program segment, then calculate the distance between the target vector and the all-zero vector corresponding to the first program segment to obtain the first adjacency distance corresponding to the program segment;

[0026] If a warm-up process exists during the execution of the first program segment, then the first adjacency distance corresponding to the program segment is determined to be zero.

[0027] Optionally, obtaining the target vector of each program segment based on its own feature information includes:

[0028] For each program segment, a first vector is generated based on the segment's own feature information;

[0029] The first vector is subjected to dimensionality reduction processing to obtain the target vector; the dimension of the target vector is smaller than that of the first vector.

[0030] Optionally, the instruction context information includes: the number of cache lines accessed by each basic block in the program segment and the program counter jump information corresponding to each basic block;

[0031] For each program segment, generating a first composite vector for that program segment based on its own feature information and the instruction context information includes:

[0032] Based on the self-feature information, the cache line number information, and the program counter jump information, a first composite vector corresponding to each program segment is generated.

[0033] Optionally, the self-feature information includes the self-attribute information of each basic block in the program segment; the step of generating a first composite vector corresponding to each program segment based on the self-feature information, the cache line number information, and the program counter jump information includes:

[0034] For each program segment, the self-attribute information of a basic block, the cache line count information of a basic block, and the program counter jump information corresponding to a basic block are respectively used as information of a single dimension of a vector. Based on the self-attribute information, the cache line count information, and the program counter jump information, a second composite vector corresponding to each program segment is generated.

[0035] The second composite vector is subjected to dimensionality reduction processing to obtain the first composite vector corresponding to each program segment.

[0036] Optionally, the second composite vector includes: multiple vector sub-blocks; each vector sub-block is used to record the execution frequency information, cache line number information and program counter jump information of the basic block, and the same basic block is represented at the same position in each vector sub-block.

[0037] On the other hand, embodiments of the present invention disclose a program fragment classification device, the device comprising:

[0038] The program segmentation module is used to divide the target program into at least two program segments;

[0039] The information acquisition module acquires the program segment's own characteristic information and the instruction context information corresponding to the program segment.

[0040] The clustering module is used to cluster the program fragments based on their own feature information and the instruction context information to obtain classification results.

[0041] In another aspect, embodiments of the present invention also disclose an electronic device, the electronic device including a memory and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by one or more processors using the aforementioned method for classifying program segments.

[0042] This invention also discloses a readable storage medium, which, when the instructions in the storage medium are executed by the processor of an electronic device, enables the electronic device to execute the aforementioned method for classifying program segments.

[0043] The method, apparatus, electronic device, and storage medium for classifying program fragments provided in this invention have the following advantages:

[0044] This invention provides a method for classifying program fragments. After dividing a target program into at least two program fragments, the method obtains the intrinsic feature information of each program fragment and the corresponding instruction context information. Then, based on the intrinsic feature information and the instruction context information, the program fragments are clustered to obtain a classification result. This method of clustering program fragments based on their intrinsic feature information and corresponding instruction context information, compared to clustering only using the intrinsic feature information, is beneficial for uncovering deeper connections between different basic blocks or program fragments, thus improving the clustering effect and classification accuracy for different program fragments. Attached Figure Description

[0045] Figure 1 This is a flowchart illustrating the steps of an embodiment of a method for classifying program fragments according to the present invention;

[0046] Figure 2 This is a structural block diagram of an embodiment of a program fragment classification device according to the present invention;

[0047] Figure 3 This is a structural block diagram of an electronic device provided by an example of the present invention. Detailed Implementation

[0048] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0049] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0050] Method Implementation Examples

[0051] Reference Figure 1 The diagram illustrates a flowchart of an embodiment of a method for classifying program fragments according to the present invention, which may specifically include the following steps:

[0052] Step 101: Divide the target program into at least two program segments;

[0053] Step 102: Obtain the self-feature information of the program segment and the instruction context information corresponding to the program segment;

[0054] Step 103: Based on the program segment's own feature information and the instruction context information, cluster the program segment to obtain the classification result.

[0055] Here, the target program refers to a program used for microprocessor performance testing, and a program segment refers to an instruction set obtained by dividing the target program according to a preset number of instructions. For example, if the target program includes 1000 instructions, and 100 instructions are used as a standard, 10 program segments can be divided from the target program. Of course, in the embodiments of this application, the program can also be divided according to different execution functions, dividing the target program into program segments that execute different functions, with one program segment executing one function. This application does not limit this.

[0056] The intrinsic characteristic information of a program segment refers to its own attribute information, including the attribute information reflected by the program segment as a whole and the intrinsic attribute information of each basic block within the program segment. Examples include the function performed by the program segment, the execution frequency of basic blocks within their respective program segments, stored procedures, register usage, instruction combinations, opcodes, and memory access patterns. Furthermore, the intrinsic characteristic information of a program segment does not include the context information between adjacent basic blocks. A basic block refers to a sequentially executed sequence of statements in a program; one or more basic blocks can constitute a program segment.

[0057] Specifically, the characteristic information of a program segment itself can be obtained from the execution logic of the instructions in the program segment, such as instruction combinations, memory access patterns, and other characteristic information; it can also be obtained from the information recorded in the memory such as registers and caches during the execution of the target program, such as the execution frequency of basic blocks in a certain program segment, stored procedures, and other characteristic information.

[0058] Instruction context information refers to the context information between adjacent program segments or adjacent basic blocks in a program. For example, the Euclidean distance between two adjacent program segments calculated based on their own characteristic information, the execution order between two adjacent basic blocks, and the different performance of adjacent basic blocks during execution.

[0059] Accordingly, the instruction context information corresponding to the program segment in step 102 may include the context information between the program segment and adjacent program segments, and the context information between adjacent basic blocks in the program segment; adjacent program segments refer to two program segments that are adjacent to each other, that is, adjacent program segments can be a program segment and another program segment adjacent to the position before the program segment, or a program segment and another program segment adjacent to the position after the program segment; adjacent basic blocks refer to two basic blocks that are adjacent to each other, that is, adjacent basic blocks can be a basic block and another basic block adjacent to the position before the basic block, or a basic block and another basic block adjacent to the position after the basic block.

[0060] Of course, if we want to obtain the context information of a program segment and its preceding segment, and this program segment is the first segment of the target program (meaning the first segment has no preceding segment in the target program), then the instruction context information corresponding to the first segment can be set to empty, or a value can be calculated based on the segment's own characteristic information, and this value can be used as the instruction context information corresponding to the first segment. Similarly, the same applies to obtaining the context information of a program segment and its following segment.

[0061] Specifically, the instruction context information corresponding to a program segment can be obtained during the execution of the target program based on information recorded in registers, caches, and other memories. This includes, for example, the number of cache line accesses for basic blocks within the program segment, and the program counter jump address for each basic block. Furthermore, the instruction context information corresponding to a program segment can also be further calculated from the segment's own characteristic information. For instance, the information entropy of two adjacent program segments can be calculated to measure their complexity; or the difference between two adjacent program segments can be measured by the distance between the execution frequencies of their basic blocks in Euclidean space.

[0062] Clustering program segments based on their own feature information and the instruction context information corresponding to the program segments is more effective than clustering only using the program segments' own feature information. This helps to uncover deeper connections between different basic blocks or program segments, and improves the clustering effect and classification accuracy for different program segments.

[0063] Especially in complex programming environments, relying solely on the inherent characteristics of program fragments is insufficient to accurately deduce the relationships between different program modules. This can easily lead to overly coarse classification of program fragments; for example, program fragments with similar code structures but significantly different performance and functionality might be grouped into the same category, greatly reducing the accuracy of fragment classification. Clustering based on the inherent characteristics of program fragments and their corresponding instruction context information can uncover deeper connections between different basic blocks or program fragments, improving the granularity of fragment classification and thus increasing classification accuracy.

[0064] Furthermore, after obtaining the classification results of program segments in the target program, one or a few representative program segments can be selected from each category for microprocessor performance testing. Using representative program segments corresponding to all categories in the target program to perform microprocessor performance testing is equivalent to using the entire target program to perform microprocessor performance testing, but without executing as many instructions, thus improving the efficiency of microprocessor performance testing. Simultaneously, by improving the accuracy of program segment classification using the above method, it also helps to improve the accuracy of microprocessor performance testing.

[0065] Optionally, step 103, which involves clustering the program segment based on its own feature information and the instruction context information to obtain a classification result, includes:

[0066] Step S11: For each program segment, generate a first composite vector of the program segment based on the segment's own feature information and the instruction context information;

[0067] Step S12: Cluster the at least two program segments based on the first composite vector to obtain the classification result.

[0068] The first composite vector, also known as the composite code signature vector (CCSV), is used to record the self-characteristic information and corresponding instruction context information of a single program segment. One program segment corresponds to one first composite vector.

[0069] In this embodiment of the invention, the self-feature information of the program segment and the corresponding instruction context information are integrated into a vector to obtain a first composite vector. This facilitates the measurement of the differences between different program segments based on the distance between the different first composite vectors in Euclidean space. Then, cluster analysis of the program segments is performed based on the distance, and program segments with smaller differences are classified into one category to obtain the classification result of the program segments.

[0070] Specifically, K program fragments can be randomly selected as initial cluster centers. The distance between each program fragment and each cluster center (CCSV) is calculated, and each program fragment is then assigned to the cluster center with the closest CCSV distance. Each cluster center and the program fragment assigned to it represent a cluster. During the clustering process, after each program fragment is assigned, the cluster centers are recalculated based on the existing set of program fragments in the cluster. This process is repeated until a termination condition is met. The termination condition can be at least one of the following: no (or a minimum number) program fragments are reassigned to different clusters; no (or a minimum number) cluster centers change; or the sum of squared errors reaches a local minimum. To further improve the clustering effect and efficiency, a maximum value of K—maxK—can be preset, and a clustering scheme with sufficiently small K that also meets the preset requirements can be used for cluster analysis.

[0071] Optionally, the instruction context information includes the first adjacency distance corresponding to the program segment;

[0072] Step S11, which involves generating a first composite vector for each program segment based on its own feature information and the instruction context information, includes:

[0073] Step S21: Based on the first adjacency distance and the self-feature information, generate a first composite vector corresponding to each program segment.

[0074] The first adjacency distance, as the name suggests, refers to the distance between adjacent program segments. In essence, it is a spatial mapping of the differences between two adjacent program segments. It is used to measure the degree of difference in their own feature information between the two program segments and belongs to the context information between program segments.

[0075] Of course, the first adjacency distance refers to the distance between a program segment and its preceding program segment. If the program segment is the first program segment in the target program, meaning the first program segment has no preceding program segment in the target program, then the first adjacency distance corresponding to the first program segment can be determined to be zero. Alternatively, a value can be calculated based on the program segment's own characteristic information, and this value can be used as the first adjacency distance corresponding to the first program segment. Similarly, the distance between a program segment and its next program segment is also determined accordingly.

[0076] Based on the first adjacency distance and its own feature information, a first composite vector is generated for each program segment. Then, cluster analysis is performed, which helps to uncover the deep differences between adjacent feature information and improve the clustering effect of program segments.

[0077] Optionally, when the instruction context information is the first adjacency distance corresponding to the program segment, step 102, obtaining the instruction context information corresponding to the program segment, includes:

[0078] Step S31: For each program segment, obtain the target vector of the program segment based on its own feature information;

[0079] Step S32: If the program segment is the first program segment of the target program, then obtain the warm-up information during the execution of the first program segment; the warm-up information is used to reflect whether a warm-up process exists during the execution of the first program segment.

[0080] Step S33: Determine the first adjacency distance corresponding to the first program segment based on the preheating information;

[0081] Step S34: If the program segment is not the first program segment of the target program, calculate the distance between the target vector corresponding to the program segment and the target vector corresponding to the previous program segment of the program segment to obtain the first adjacency distance of the program segment.

[0082] Understandably, when a computing device receives the first program request, the response to the first request will be very slow because some cached data has not yet been loaded into the cache. In order to speed up the response process for the first request, a specific strategy is needed to handle the preloading of cached data for the first request, so as to ensure a fast response for the first request. This process is called the warmup process.

[0083] Here, the target vector refers to a vector that records the self-feature information of a program segment. One target vector is used to record the self-feature information of a program segment. When the self-feature information is the information reflected by the program segment as a whole, the target vector corresponding to a program segment is 1-dimensional. When the self-feature information is the self-attribute information of each basic block in the program segment, one self-attribute information of a program segment can be used as the information of a single dimension of the vector. Then, the target vector is constructed based on the obtained self-attribute information. For example, if program segment A includes n basic blocks, then when only one self-attribute information B is obtained, the target vector has n dimensions, and the n-dimensional data records the self-attribute information B of the n basic blocks respectively. When multiple self-attribute information is obtained, the target vector expands accordingly.

[0084] In this embodiment of the invention, the first adjacency distance refers to the distance between the target vector corresponding to the program segment and the target vector corresponding to the previous program segment. Therefore, considering that the first program segment does not have a corresponding previous program segment, the first adjacency distance corresponding to the first program segment needs to be specifically processed. Furthermore, considering the potential impact of the preheating process during the execution of the first program segment on its performance, the first adjacency distance corresponding to the first program segment needs to be processed according to the specific preheating situation to avoid the preheating process of the computing device affecting the authenticity of the obtained feature information of the first program segment.

[0085] If the program fragment is not the first program fragment of the target program, the first adjacency distance of the program fragment can be calculated as the distance between the target vector corresponding to the program fragment and the target vector corresponding to the previous program fragment. The first adjacency distance includes, but is not limited to, Euclidean distance, Chebyshev distance, and Mahalanobis distance.

[0086] Optionally, the self-feature information includes the self-attribute information of each basic block in the program segment; step S21, generating a first composite vector corresponding to each program segment based on the first adjacency distance and the self-feature information, includes:

[0087] Step S41: Weight the first adjacency distance to obtain the second adjacency distance;

[0088] Step S42: For each program segment, the self-attribute information of a basic block and the second adjacency distance corresponding to the program segment are respectively used as information of a single dimension of the vector. Based on the self-attribute information and the second adjacency distance, a first composite vector corresponding to each program segment is generated.

[0089] The self-characteristic information includes the self-attribute information of each basic block in the program segment. This means that the self-characteristic information of a single program segment includes the self-attribute information of each basic block that constitutes the program segment, such as the execution frequency and the function performed by the basic block in the program segment.

[0090] To avoid the first adjacency distance being too large, which could negatively impact clustering performance (e.g., program segments that should be clustered might be misclassified), a weighted approach can be applied to the first adjacency distance. This involves multiplying the first adjacency distance by a weight parameter w, where 0.05 ≤ w ≤ 0.5. For example, w could be 0.25. Of course, the specific range and value of the weight parameter can be adjusted based on the clustering results, and this invention does not impose any limitations on this.

[0091] Specifically, taking the execution frequency of each basic block in the program segment as an example, the weighted second adjacency distance can be expressed as:

[0092]

[0093] Where, d i b represents the first adjacency distance corresponding to the i-th program segment. i,j This represents the execution frequency of the j-th basic block in the i-th program segment, warmup indicates the presence of a warm-up process during the execution of the first program segment, and nowarmup indicates the absence of a warm-up process during the execution of the first program segment.

[0094] After obtaining the second adjacency distance, since the feature information of a program segment can be the attribute information of the basic blocks it contains, the attribute information of one basic block can be used as a single dimension of the vector, and the second adjacency distance corresponding to the program segment can be used as another single dimension to construct the first composite vector. For example, if program segment A consists of n basic blocks, then the first composite vector corresponding to A has a dimension of n+1, where the n-dimensional data are used to record the attribute information of the n basic blocks, and the (n+1)-dimensional data records the second adjacency distance corresponding to A.

[0095] Optionally, step S33, determining the first adjacency distance corresponding to the first program segment based on the preheating information, includes:

[0096] Step S51: If there is no warm-up process during the execution of the first program segment, calculate the distance between the target vector and the all-zero vector corresponding to the first program segment to obtain the first adjacency distance corresponding to the program segment.

[0097] Step S52: If a warm-up process exists during the execution of the first program segment, then the first adjacency distance corresponding to the program segment is determined to be zero.

[0098] A zero vector is a vector whose magnitude is zero, meaning that the value in each dimension is zero.

[0099] Specifically, the presence of a preheating process during the execution of the first program segment can be determined based on the CPU temperature information of the computing device executing the target program. If a preheating process exists during the execution of the first program segment, the first adjacency distance corresponding to the first program segment can be set to zero to eliminate the influence of the preheating process on the characteristic information of the first program segment itself. If a preheating process exists during the execution of the first program segment, the distance between the target vector corresponding to the first program segment and the all-zero vector can be calculated and used as the first adjacency distance corresponding to the first program segment. The first adjacency distance includes, but is not limited to, Euclidean distance, Chebyshev distance, and Mahalanobis distance.

[0100] Optionally, step S31, which involves obtaining the target vector of each program segment based on its own feature information, includes:

[0101] Step S61: For each program segment, generate a first vector corresponding to each program segment based on the segment's own feature information;

[0102] Step S62: Perform dimensionality reduction processing on the first vector to obtain the target vector; the dimension of the target vector is smaller than that of the first vector.

[0103] The intrinsic characteristics of a program segment can be described by the attribute information of each basic block within the segment, such as the execution frequency and execution duration of each basic block within its segment. Therefore, the first vector can be divided into different dimensions based on different basic blocks. For example, if program segment A includes n basic blocks, then the first vector is an n-dimensional vector, with each dimension corresponding to the attribute information of one basic block.

[0104] Of course, when the dimension of the first vector is too high, in order to reduce the computational complexity and improve the clustering efficiency, dimensionality reduction processing can be performed on the first vectors corresponding to each program segment, such as principal component analysis, random mapping, etc. The present invention does not limit this.

[0105] Specifically, in order to improve the efficiency, the first vectors corresponding to each program segment can be formed into a matrix, and then dimensionality reduction processing based on the matrix is performed.

[0106] Taking the execution frequency of each basic block in the program segment as the self - characteristic information, after obtaining the first vectors corresponding to each program segment, assuming there are m program segments and the original first vectors are all n - dimensional, a matrix X of m×n is formed. If it is to be reduced to a matrix X' of m×n', n' < n, then an n×n' matrix P is introduced, and each element in P is randomly selected from the interval [-1,1]. Then there is:

[0107] X' = X×P, (2)

[0108] Among them, the dimension of the first vector after dimensionality reduction, that is, the target vector, is n'.

[0109] After obtaining the target vector, calculate the first adjacency distance corresponding to each program segment according to the target vector, and extend the calculated first adjacency distance into the target vector to obtain the first composite vector, that is, CCSV.

[0110] Referring to Table 1, it shows the structure diagram of CCSV corresponding to a program segment provided by an embodiment of the present invention. Among them, di represents the i - th dimension of the target vector, i < n' + 2, fi represents the execution frequency of the mapped basic block recorded on the i - th dimension of the target vector, and Distance represents the first adjacency distance corresponding to this program segment.

[0111] Table 1

[0112]

[0113] Optionally, the instruction context information includes: the number of cache lines accessed by each basic block in the program segment and the program counter jump information corresponding to each basic block;

[0114] Step 103 of generating the first composite vector of each program segment according to the self - characteristic information and the instruction context information of the program segment includes:

[0115] Step S71: Generate the first composite vector corresponding to each program segment based on the self - characteristic information, the cache line number information, and the program counter jump information.

[0116] Considering that instability when executing different program segments may stem from differences in cache access, the number of cache lines accessed by each basic block can be used as the instruction context information for the corresponding program segment to improve the effectiveness of clustering analysis. Furthermore, considering that the execution order of different basic blocks may also affect the performance of program segments, thereby influencing the classification results, the program counter jump information for each basic block can also be used as the instruction context information for the corresponding program segment. This program counter jump information includes, but is not limited to, the jump address identified by the program counter and the jump distance.

[0117] In this case, considering that the characteristic information of the program segment itself can be described by the attribute information of each basic block in the program segment, the first composite vector has a dimension of 3n when the program segment includes n basic blocks. Among them, n dimensions are used to record the attribute information of n basic blocks, n dimensions are used to record the number of cache lines accessed by each of the n basic blocks, and n dimensions are used to record the program counter jump information corresponding to the n basic blocks.

[0118] Of course, if the dimension of the vector corresponding to the program segment generated based on the self-feature information, the cache line count information, and the program counter jump information is too high, in order to reduce the computational load and improve the clustering efficiency, the vector can be dimensionality-reduced to obtain the first composite vector. Whether or not to perform dimensionality reduction can be determined according to the actual situation, and this invention does not impose any limitations.

[0119] Optionally, the self-feature information includes the self-attribute information of each basic block in the program segment; step S71, which generates a first composite vector corresponding to each program segment based on the self-feature information, the cache line number information, and the program counter jump information, includes:

[0120] Step S81: For each program segment, the self-attribute information of a basic block, the cache line count information of a basic block, and the program counter jump information corresponding to a basic block are respectively used as information of a single dimension of a vector. Based on the self-attribute information, the cache line count information, and the program counter jump information, a second composite vector corresponding to each program segment is generated.

[0121] Step S82: Perform dimensionality reduction on the second composite vector to obtain the first composite vector corresponding to each program segment.

[0122] The self-characteristic information includes the self-attribute information of each basic block in the program segment. This means that the self-characteristic information of a single program segment includes the self-attribute information of each basic block that constitutes the program segment, such as the execution frequency and the function performed by the basic block in the program segment.

[0123] Regarding step S81, since the self - characteristic information, cache - line number information, and program - counter jump information of the above - mentioned program fragment can all be information recorded in units of basic blocks, one kind of information of a basic block can be used as the information of a single dimension of the vector to construct the first composite vector. Specifically, when the program fragment includes n basic blocks, the dimension of the first composite vector corresponding to this program fragment is 3n. Among them, the data of n dimensions are respectively used to record the attribute information of n basic blocks, the data of n dimensions are respectively used to record the number of cache lines accessed by n basic blocks, and the data of n dimensions are respectively used to record the program - counter jump information corresponding to n basic blocks.

[0124] When the dimension of the initial vector (i.e., the second composite vector) generated based on the self - characteristic information, cache - line number information, and program - counter jump information is too high, in order to reduce the computational amount and improve the clustering efficiency, dimensionality reduction processing can be performed on the first vectors corresponding to each program fragment, such as principal - component analysis, random mapping, etc. The present invention does not limit this.

[0125] Specifically, after obtaining the second composite vectors corresponding to each program fragment, assuming there are m program fragments and the original second composite vectors are all 3n - dimensional, a matrix Y of m×3n is formed. If it is to be reduced to a matrix Y′ of m×n′, where n′ < 3n, a matrix P of 3n×n′ is introduced, and each element in P is randomly selected from the interval [-1,1]. Then:

[0126] Y′ = Y×P, (2)

[0127] Among them, the dimension of the second composite vector after dimensionality reduction (i.e., the first composite vector) is n′.

[0128] Referring to Table 2, it shows the structural diagram of the first composite vector (i.e., CCSV) corresponding to another program fragment provided by an embodiment of the present invention. Among them, di represents the i - th dimension of the first composite vector, i < n′ + 1, and CCSi represents the information on the i - th dimension of the first composite vector obtained after random mapping.

[0129] Table 2

[0130] d1 d2 ... d(n′-1) dn′ CCS1 CCS2 ... CCS(n′-1) CCSn′

[0131] Optionally, the second composite vector includes: multiple vector sub - blocks; each vector sub - block is respectively used to record the execution - frequency information, cache - line number information, and program - counter jump information of the basic block, and the same position in each vector sub - block represents the same basic block.

[0132] Referring to Table 3, a structural diagram of a second composite vector corresponding to a program segment provided in an embodiment of the present invention is shown. The program segment has n basic blocks, so the dimension of the corresponding second composite vector is 3n. Here, di represents the i-th basic block of the program segment, and di, d(n+i), and d(2n+i) all represent the same basic block. Their corresponding values ​​are used to represent the execution frequency (f), cache line number (CH), and jump address (PC) identified by the program counter for that basic block, respectively.

[0133] fi represents the execution frequency of the mapped basic block recorded in the i-th dimension of the target vector, and Distance represents the first adjacency distance corresponding to this program segment.

[0134] Table 3

[0135]

[0136]

[0137] It should be noted that, for the sake of simplicity, the method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of the present invention are not limited to the described order of actions, because according to the embodiments of the present invention, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions involved are not necessarily essential to the embodiments of the present invention.

[0138] Reference Figure 2 The diagram illustrates a structural block diagram of an embodiment of a program segment classification device according to the present invention. Specifically, the device 200 may include:

[0139] The program segmentation module 201 is used to divide the target program into at least two program segments;

[0140] The information acquisition module 202 acquires the self-characteristic information of the program segment and the instruction context information corresponding to the program segment;

[0141] Clustering module 203 is used to cluster the program segment based on its own feature information and the instruction context information to obtain a classification result.

[0142] Optionally, the clustering module may include:

[0143] The first vector generation submodule is used to generate a first composite vector for each program segment based on the segment's own feature information and the instruction context information.

[0144] The clustering submodule is used to cluster the at least two program segments based on the first composite vector to obtain the classification result.

[0145] Optionally, the instruction context information includes the first adjacency distance corresponding to the program segment;

[0146] The first vector generation submodule includes:

[0147] The second vector generation submodule is used to generate a first composite vector corresponding to each program segment based on the first adjacency distance and the self-feature information.

[0148] Optionally, the information acquisition module includes:

[0149] The target vector acquisition submodule is used to acquire the target vector of each program segment based on the segment's own feature information.

[0150] The preheating information acquisition submodule is used to acquire preheating information during the execution of the first program segment if the program segment is the first program segment of the target program; the preheating information is used to reflect whether a preheating process exists during the execution of the first program segment.

[0151] The first distance determination submodule is used to determine the first adjacency distance corresponding to the first program segment based on the preheating information;

[0152] The second distance determination submodule is used to calculate the distance between the target vector corresponding to the program segment and the target vector corresponding to the previous program segment if the program segment is not the first program segment of the target program, and obtain the first adjacency distance of the program segment.

[0153] Optionally, the self-feature information includes the self-attribute information of each basic block in the program segment; the second vector generation submodule includes:

[0154] The weighted submodule is used to weight the first adjacency distance to obtain the second adjacency distance;

[0155] The third vector generation submodule is used to generate a first composite vector for each program segment by taking the self-attribute information of a basic block and the second adjacency distance corresponding to the program segment as information of a single dimension of the vector, respectively, based on the self-attribute information and the second adjacency distance.

[0156] Optionally, the first distance determination submodule includes:

[0157] The third distance determination submodule is used to calculate the distance between the target vector and the all-zero vector corresponding to the first program segment if there is no warm-up process during the execution of the first program segment, and obtain the first adjacency distance corresponding to the program segment.

[0158] The fourth distance determination submodule is used to determine that the first adjacency distance corresponding to the program segment is zero if there is a warm-up process during the execution of the first program segment.

[0159] Optionally, the target vector acquisition submodule includes:

[0160] The first vector generation submodule is used to generate a first vector corresponding to each program segment based on the program segment's own feature information.

[0161] The first dimensionality reduction module is used to perform dimensionality reduction processing on the first vector to obtain a target vector; the dimension of the target vector is smaller than that of the first vector.

[0162] Optionally, the instruction context information includes: the number of cache lines accessed by each basic block in the program segment and the program counter jump information corresponding to each basic block;

[0163] The first vector generation submodule includes:

[0164] The fourth vector generation submodule is used to generate a first composite vector corresponding to each program segment based on the self-feature information, the cache line number information and the program counter jump information.

[0165] Optionally, the self-feature information includes the self-attribute information of each basic block in the program segment; the fourth vector generation submodule includes:

[0166] The fifth vector generation submodule is used to generate a second composite vector corresponding to each program segment by taking the self-attribute information of a basic block, the cache line number information of a basic block, and the program counter jump information corresponding to a basic block as information of a single dimension of the vector, respectively, based on the self-attribute information, the cache line number information, and the program counter jump information.

[0167] The second dimensionality reduction submodule is used to perform dimensionality reduction processing on the second composite vector to obtain the first composite vector corresponding to each program segment.

[0168] Optionally, the second composite vector includes: multiple vector sub-blocks; each vector sub-block is used to record the execution frequency information, cache line number information and program counter jump information of the basic block, and the same basic block is represented at the same position in each vector sub-block.

[0169] In summary, this invention provides a program segment classification device. After dividing a target program into at least two program segments, it obtains the self-feature information of each program segment and the instruction context information corresponding to each program segment. Then, based on the self-feature information and the instruction context information, it clusters the program segments to obtain a classification result. By performing clustering processing on program segments based on their self-feature information and the corresponding instruction context information, compared to clustering processing using only the self-feature information of the program segments, it is beneficial for uncovering deep connections between different basic blocks or program segments, improving the clustering effect and classification accuracy for different program segments.

[0170] As the device embodiment is basically similar to the method embodiment, the description is relatively simple, and relevant parts can be found in the description of the method embodiment.

[0171] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.

[0172] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.

[0173] Reference Figure 3 This is a schematic diagram of the structure of the electronic device provided in an embodiment of the present invention. Figure 3 As shown, the electronic device includes: a processor, a memory, a communication interface, and a communication bus. The processor, the memory, and the communication interface communicate with each other through the communication bus. The memory is used to store at least one executable instruction, which causes the processor to execute the program segment classification method of the aforementioned embodiment.

[0174] This invention provides a non-transitory computer-readable storage medium that, when the instructions in the storage medium are executed by a terminal's program or processor, enables the terminal to execute the program fragment classification method of the aforementioned embodiments.

[0175] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.

[0176] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, apparatus, or computer program products. Therefore, embodiments of the present invention can take the form of entirely hardware embodiments, entirely software embodiments, or embodiments combining software and hardware aspects. Furthermore, embodiments of the present invention can take the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0177] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0178] These computer program instructions may also be stored in a computer-readable storage medium capable of directing a computer or other programmable data processing terminal device to operate in a predictive manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0179] These computer program instructions can also be loaded onto a computer or other programmable data processing terminal equipment, causing a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0180] Although preferred embodiments of the present invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the present invention.

[0181] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.

[0182] The present invention has provided a detailed description of a method and apparatus for classifying program fragments, an electronic device, and a readable storage medium. Specific examples have been used to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, those skilled in the art will recognize that, based on the ideas of the present invention, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of the present invention.

Claims

1. A method for classifying program fragments, characterized in that, The method includes: The target program is divided into at least two program segments; the target program is a program used for microprocessor performance testing; the program segment is an instruction set obtained by dividing the target program according to a preset number of instructions; Obtain the self-characteristic information of the program segment and the instruction context information corresponding to the program segment; the instruction context information includes the first adjacency distance corresponding to the program segment, or the cache line number information accessed by each basic block in the program segment and the program counter jump information corresponding to each basic block; The program segments are clustered based on their own feature information and the instruction context information to obtain a classification result. This includes: for each program segment, generating a first composite vector of the program segment based on its own feature information and the instruction context information; and clustering the at least two program segments based on the first composite vector to obtain a classification result.

2. The method according to claim 1, characterized in that, For each program segment, generating a first composite vector for that program segment based on its own feature information and the instruction context information includes: Based on the first adjacency distance and the self-feature information, a first composite vector is generated corresponding to each program segment.

3. The method according to claim 2, characterized in that, Obtaining the instruction context information corresponding to the program fragment includes: For each program segment, the target vector corresponding to the program segment is obtained based on the segment's own feature information; If the program segment is the first program segment of the target program, then the warm-up information during the execution of the first program segment is obtained; the warm-up information is used to reflect whether a warm-up process exists during the execution of the first program segment; and the first adjacency distance corresponding to the first program segment is determined based on the warm-up information. If the program fragment is not the first program fragment of the target program, then the distance between the target vector corresponding to the program fragment and the target vector corresponding to the previous program fragment is calculated to obtain the first adjacency distance of the program fragment.

4. The method according to claim 2 or 3, characterized in that, The self-feature information includes the self-attribute information of each basic block in the program segment; the step of generating a first composite vector corresponding to each program segment based on the first adjacency distance and the self-feature information includes: The first adjacency distance is weighted to obtain the second adjacency distance; For each program segment, the self-attribute information of a basic block and the second adjacency distance corresponding to the program segment are respectively used as information of a single dimension of the vector. Based on the self-attribute information and the second adjacency distance, a first composite vector corresponding to each program segment is generated.

5. The method according to claim 3, characterized in that, Determining the first adjacency distance corresponding to the first program segment based on the preheating information includes: If there is no warm-up process during the execution of the first program segment, then calculate the distance between the target vector and the all-zero vector corresponding to the first program segment to obtain the first adjacency distance corresponding to the program segment; If a warm-up process exists during the execution of the first program segment, then the first adjacency distance corresponding to the program segment is determined to be zero.

6. The method according to claim 3, characterized in that, The step of obtaining the target vector of each program segment based on its own feature information includes: For each program segment, a first vector is generated based on the segment's own feature information; The first vector is subjected to dimensionality reduction processing to obtain the target vector; the dimension of the target vector is smaller than that of the first vector.

7. The method according to claim 1, characterized in that, For each program segment, generating a first composite vector for that program segment based on its own feature information and the instruction context information includes: Based on the self-feature information, the cache line number information, and the program counter jump information, a first composite vector corresponding to each program segment is generated.

8. The method according to claim 7, characterized in that, The self-feature information includes the self-attribute information of each basic block in the program segment; the generation of a first composite vector corresponding to each program segment based on the self-feature information, the cache line number information, and the program counter jump information includes: For each program segment, the self-attribute information of a basic block, the cache line count information of a basic block, and the program counter jump information corresponding to a basic block are respectively used as information of a single dimension of a vector. Based on the self-attribute information, the cache line count information, and the program counter jump information, a second composite vector corresponding to each program segment is generated. The second composite vector is subjected to dimensionality reduction processing to obtain the first composite vector corresponding to each program segment.

9. The method according to claim 8, characterized in that, The second composite vector includes: multiple vector sub-blocks; each vector sub-block is used to record the execution frequency information, cache line number information and program counter jump information of the basic block, and the same basic block is represented at the same position in each vector sub-block.

10. A device for classifying program fragments, characterized in that, The device includes: A program segmentation module is used to divide a target program into at least two program segments; the target program is a program used for microprocessor performance testing; the program segment is an instruction set obtained by dividing the target program according to a preset number of instructions; The information acquisition module acquires the program segment's own characteristic information and the instruction context information corresponding to the program segment; the instruction context information includes the first adjacency distance corresponding to the program segment, or the cache line number information accessed by each basic block in the program segment and the program counter jump information corresponding to each basic block; A clustering module is used to cluster the program segments based on their own feature information and the instruction context information to obtain a classification result. The clustering module includes: a first vector generation submodule, used to generate a first composite vector for each program segment based on its own feature information and the instruction context information; and a clustering submodule, used to cluster the at least two program segments based on the first composite vector to obtain a classification result.

11. The apparatus according to claim 10, characterized in that, The first vector generation submodule includes: The second vector generation submodule is used to generate a first composite vector corresponding to each program segment based on the first adjacency distance and the self-feature information.

12. The apparatus according to claim 11, characterized in that, The information acquisition module includes: The target vector acquisition submodule is used to acquire the target vector of each program segment based on the segment's own feature information. The preheating information acquisition submodule is used to acquire preheating information during the execution of the first program segment if the program segment is the first program segment of the target program; the preheating information is used to reflect whether a preheating process exists during the execution of the first program segment. The first distance determination submodule is used to determine the first adjacency distance corresponding to the first program segment based on the preheating information; The second distance determination submodule is used to calculate the distance between the target vector corresponding to the program segment and the target vector corresponding to the previous program segment if the program segment is not the first program segment of the target program, and obtain the first adjacency distance of the program segment.

13. The apparatus according to claim 11 or 12, characterized in that, The self-feature information includes the self-attribute information of each basic block in the program segment; the second vector generation submodule includes: The weighted submodule is used to weight the first adjacency distance to obtain the second adjacency distance; The third vector generation submodule is used to generate a first composite vector for each program segment by taking the self-attribute information of a basic block and the second adjacency distance corresponding to the program segment as information of a single dimension of the vector, respectively, based on the self-attribute information and the second adjacency distance.

14. The apparatus according to claim 12, characterized in that, The first distance determination submodule includes: The third distance determination submodule is used to calculate the distance between the target vector and the all-zero vector corresponding to the first program segment if there is no warm-up process during the execution of the first program segment, and obtain the first adjacency distance corresponding to the program segment. The fourth distance determination submodule is used to determine that the first adjacency distance corresponding to the program segment is zero if there is a warm-up process during the execution of the first program segment.

15. The apparatus according to claim 12, characterized in that, The target vector acquisition submodule includes: The first vector generation submodule is used to generate a first vector corresponding to each program segment based on the program segment's own feature information. The first dimensionality reduction module is used to perform dimensionality reduction processing on the first vector to obtain a target vector; the dimension of the target vector is smaller than that of the first vector.

16. The apparatus according to claim 10, characterized in that, The first vector generation submodule includes: The fourth vector generation submodule is used to generate a first composite vector corresponding to each program segment based on the self-feature information, the cache line number information and the program counter jump information.

17. The apparatus according to claim 16, characterized in that, The self-feature information includes the self-attribute information of each basic block in the program segment; the fourth vector generation submodule includes: The fifth vector generation submodule is used to generate a second composite vector corresponding to each program segment by taking the self-attribute information of a basic block, the cache line number information of a basic block, and the program counter jump information corresponding to a basic block as information of a single dimension of the vector, respectively, based on the self-attribute information, the cache line number information, and the program counter jump information. The second dimensionality reduction submodule is used to perform dimensionality reduction processing on the second composite vector to obtain the first composite vector corresponding to each program segment.

18. An electronic device, characterized in that, The electronic device includes a memory and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by one or more processors using the classification method of program fragments as described in any one of claims 1 to 9.

19. A readable storage medium, characterized in that, When the instructions in the storage medium are executed by the processor of the electronic device, the processor is able to perform the classification method of program segments as described in any one of claims 1 to 9.