An analysis method and system for the influence of a configuration item on the performance of a software system
By identifying the performance operations and configuration item dependencies in the software system and using a random forest model to determine the impact of configuration items, performance problems caused by configuration errors are solved, configuration efficiency and accuracy are improved, and performance analysis overhead is reduced.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SUN YAT SEN UNIV
- Filing Date
- 2022-06-27
- Publication Date
- 2026-06-23
AI Technical Summary
In existing technologies, system failures and performance problems caused by software system configuration errors occur frequently, and users have difficulty understanding the impact of configuration items on system performance, resulting in low configuration efficiency.
By identifying the dependencies between performance operations and configuration items in a software system, constructing feature vectors, and using a random forest classification model and a configuration item dependency detector to determine whether configuration items affect system performance, a white-box analysis method is provided to reduce the overhead of performance analysis.
Without running the software system, it can accurately predict the impact of configuration items on performance, improve configuration efficiency, discover the set of configuration items that truly affect performance, reduce performance analysis overhead, and provide interpretability.
Smart Images

Figure CN114996111B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of software system technology, and in particular to a method and system for analyzing the impact of configuration items on software system performance. Background Technology
[0002] Computer software systems refer to the various programs, data, and related documentation that a computer runs, including system software, support software, and application software. Many modern software systems are designed to be highly customizable, configurable according to the user's hardware platform, operating system, and user needs, thus meeting the user's requirements in terms of software functionality or performance.
[0003] However, software systems have a large number of configuration items, and some of these items are dependent on each other. The complexity of software system configuration makes adjusting it a significant challenge. Research shows that software system configuration errors have become one of the main causes of system failures and performance problems. Software system configuration errors can have serious consequences; misconfigurations on commercial storage systems and open-source operating systems can lead to difficult-to-diagnose system crashes, hangs, or severe performance degradation.
[0004] Besides common software system configuration errors, users often have difficulty understanding the actual impact of changing a configuration item on the software system. As a result, users are usually forced to adjust the software system configuration through a lot of trial and error, which leads to low efficiency in software system configuration. Summary of the Invention
[0005] The purpose of this invention is to provide a method and system for analyzing the impact of configuration items on the performance of software systems, so as to solve the technical problems of configuration errors and low configuration efficiency in existing software systems.
[0006] The objective of this invention can be achieved through the following technical solutions:
[0007] A method for analyzing the impact of configuration items on software system performance, comprising:
[0008] Based on the preset code patterns of the software system, all performance operations in the software system are identified and marked. The performance operations are time-intensive and / or space-intensive operations that affect the performance of the software system.
[0009] Identify the dependencies between each performance operation and each configuration item of the software system to obtain a set of performance operations corresponding to each configuration item. Each performance operation in the set of performance operations has a dependency relationship with each configuration item.
[0010] Construct a feature vector corresponding to each configuration item based on the set of performance operations;
[0011] The feature vectors corresponding to each configuration item are input into the trained qualitative performance impact model to determine whether the configuration item affects the software system performance, thereby obtaining a set of configuration items that affect the software system performance. The qualitative performance impact model is trained using the feature vectors corresponding to the configuration items of multiple software systems.
[0012] Optionally, the qualitative performance impact model includes:
[0013] Random forest classification model and configuration item dependency detector;
[0014] The random forest classification model performs binary classification on whether configuration items affect the performance of the software system, and the configuration item dependency detector corrects the classification results of the random forest classification model.
[0015] Optionally, the dependencies include:
[0016] Data dependencies and control dependencies;
[0017] Here, the data dependency is the dependency between data streams, and the control dependency is the dependency caused by the program control flow.
[0018] Optionally, identifying the dependencies between each of the performance operations and each configuration item of the software system includes:
[0019] Tag analysis is used to identify the data dependencies between each performance operation and each configuration item of the software system.
[0020] The control dependencies between each performance operation and each configuration item of the software system are identified using a program dependency graph; the program dependency graph is constructed using program dependency analysis techniques and is used to describe the control dependencies and data dependencies of the program.
[0021] Optionally, using taint analysis to identify data dependencies between each of the performance operations and each configuration item of the software system includes:
[0022] Enter the program entry point of the software system, traverse the control flow, and create a taint as the source point at the API loading location in the configuration item;
[0023] Record the data propagation path from the source point and the final destination point. The performance operation at the destination point has a data dependency on the configuration item. The destination point is a program statement that the source point is not expected to reach. The destination point is preset before the statement corresponding to the performance operation.
[0024] Optionally, identifying the control dependencies between each performance operation and each configuration item of the software system using a program dependency graph includes:
[0025] Traverse all nodes in the program dependency graph and construct the control region for each configuration item. The control region of each configuration item is a sequence of statements that has a direct control dependency relationship with the configuration item.
[0026] Identify the control dependencies between each performance operation and each configuration item of the software system based on the control area of each configuration item.
[0027] Optionally, the training process of the random forest classification model includes:
[0028] The feature vectors corresponding to the configuration items of multiple software systems are divided into training and test sets;
[0029] The random forest classification model is trained based on the training set and the random forest algorithm.
[0030] Optionally, the configuration item dependency detector corrects the classification results of the random forest classification model by including:
[0031] When the first configuration item of the software system depends on the second configuration item, if the random forest classification model determines that the first configuration item affects the performance of the software system and the second configuration item does not affect the performance of the software system, then the configuration item dependency detector will correct the second configuration item to affect the performance of the software system.
[0032] Optionally, before identifying the dependencies between each of the performance operations and each configuration item of the software system, the method further includes:
[0033] Extract the configuration item information of the software system. The configuration item information includes at least the name, quantity, and API used when the configuration item is loaded into the software system.
[0034] This invention also provides an analysis system for the impact of configuration items on software system performance, comprising:
[0035] The performance operation identification module is used to identify and mark all performance operations in the software system according to the preset code patterns of the software system. The performance operations are time-intensive and / or space-intensive operations that affect the performance of the software system.
[0036] A dependency identification module is used to identify the dependency relationship between each of the performance operations and each of the configuration items of the software system, and to obtain a set of performance operations corresponding to each of the configuration items, wherein each performance operation in the set of performance operations has a dependency relationship with the configuration item.
[0037] A feature vector construction module is used to construct feature vectors corresponding to each of the configuration items based on the performance operation set;
[0038] The configuration item set determination module is used to input the feature vectors corresponding to each configuration item into the trained qualitative performance impact model, determine whether the configuration item affects the performance of the software system, and obtain the configuration item set that affects the performance of the software system. The qualitative performance impact model is trained using the feature vectors corresponding to the configuration items of multiple software systems.
[0039] This invention provides a method and system for analyzing the impact of configuration items on software system performance. The method includes: identifying and labeling all performance operations in the software system based on a preset code pattern, wherein the performance operations are time-intensive and / or space-intensive operations that affect the performance of the software system; identifying the dependencies between each performance operation and each configuration item of the software system to obtain a set of performance operations corresponding to each configuration item, wherein each performance operation in the set of performance operations has a dependency relationship with the configuration item; constructing a feature vector corresponding to each configuration item based on the set of performance operations; inputting the feature vectors corresponding to each configuration item into a trained qualitative performance impact model to determine whether the configuration item affects the performance of the software system, thereby obtaining a set of configuration items that affect the performance of the software system. The qualitative performance impact model is trained using feature vectors corresponding to configuration items of multiple software systems.
[0040] In view of this, the beneficial effects of this invention are:
[0041] This invention employs program analysis techniques to track time-intensive or space-intensive operations that have dependencies on configuration items. Based on the program analysis results, it constructs corresponding feature vectors for each configuration item, achieving fine-grained analysis down to the configuration item level. It is not limited to Boolean types or exhaustively enumerating finite numerical types, supporting configuration items of any type. It uses random forests to build a qualitative performance impact model, eliminating the need for time-consuming local measurement operations. Only a single analysis of the software system's source code is required to determine whether a specific configuration item affects the performance of the configurable system, significantly reducing the overhead of performance analysis. It can accurately predict whether each configuration item affects the software system's performance without running the software system, identifying the set of configuration items that truly impact software system performance, improving the efficiency of software system configuration, and facilitating correct configuration of the software system to improve performance. Furthermore, this invention is interpretable, allowing understanding of the underlying reasons why configuration items affect performance through program analysis results and the classification rules of the performance model. Attached Figure Description
[0042] Figure 1 This is a schematic flowchart of the method of the present invention;
[0043] Figure 2 This is a schematic diagram of the system structure of the present invention;
[0044] Figure 3 The program dependency graph for the example program is shown, where solid lines represent control dependencies and dashed lines represent data dependencies.
[0045] Figure 4 Example image for taint analysis of FlowDroid;
[0046] Figure 5 This is a flowchart illustrating an embodiment of the method of the present invention;
[0047] Figure 6 This diagram illustrates the categories of performance operations and their code patterns in the method of this invention.
[0048] Figure 7 This diagram illustrates the classification and examples of the dependencies between performance operations and configuration items in the method of this invention.
[0049] Figure 8 This is a schematic diagram of the stain analysis process of the method of the present invention;
[0050] Figure 9 This is an example diagram of the configuration item control area of the method of the present invention;
[0051] Figure 10 This is a schematic diagram of the qualitative performance impact model in the method of the present invention;
[0052] Figure 11 This is a schematic diagram of the program analysis module of the ConfigAnalyzer tool of the present invention. Detailed Implementation
[0053] Terminology Explanation:
[0054] Option: A special type of input with a type and a defined range of values. For example, a Boolean option might have values in the range {true, false}, while an Integer option might have values in the range {0, 1, 2, 3}. Options allow users to change the internal execution logic of a software system without modifying the code. Therefore, users can change the functionality or performance of the software system by altering the values of option options. In some literature, option options are also often referred to as features or functionalities.
[0055] Configuration: The complete settings for all configuration items in a software system. All configuration items set to specific values constitute the software system's configuration.
[0056] Configuration Space: All possible configurations in a software system constitute the configuration space.
[0057] Configurable System: A software system that provides users with configuration options for customized operation.
[0058] Misconfiguration: A configuration item is set to an inappropriate value, causing the software system to behave or perform incorrectly. Software system errors caused by misconfiguration are called configuration errors.
[0059] Environment: The overall hardware and software components upon which a software system depends for operation. Generally, the environment in which a software system runs does not change.
[0060] Workload: The amount of tasks a software system needs to complete within a certain timeframe. Users input tasks into the software system to complete predetermined tasks or goals. The greater the amount of tasks the software system needs to complete within a certain timeframe, the greater the workload, and the more computing resources the software system needs to use.
[0061] Performance: This refers to the ability of a software system to function, and is typically directly related to energy consumption and operating costs. Generally, performance is measured differently depending on the software service quality requirements. The most intuitive way to measure software system performance is the runtime required to complete a task.
[0062] Performance-influence model: A type of model that describes the performance of a software system under specific environments and workloads.
[0063] Control flow refers to the execution order of statements, instructions, or function calls during program runtime. For imperative programming languages like Java, programs have a clearly defined control flow (unlike programs written in declarative programming languages).
[0064] Control-flow statements are a type of program statement that influences the actual control flow of a program based on different control flow decisions. For example, in Java, the if statement, switch statement, for loop, and while loop are all control-flow statements.
[0065] Control-flow decision: refers to the actual execution of control flow statements, that is, selecting to execute a specific branch.
[0066] Control-flow Graph (CFG): A type of flowchart used to represent the control flow of a program.
[0067] Data Flow: An abstraction of data dependency chains in a program.
[0068] This invention provides a method and system for analyzing the impact of configuration items on software system performance, in order to solve the technical problems of configuration errors and low configuration efficiency in existing software systems.
[0069] To facilitate understanding of the present invention, a more complete description will be given below with reference to the accompanying drawings. Preferred embodiments of the invention are shown in the drawings. However, the invention can be implemented in many different forms and is not limited to the embodiments described herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
[0070] Unless otherwise defined, all technical and scientific terms used in this invention have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and / or" as used herein includes any and all combinations of one or more of the associated listed items.
[0071] (1) Configuration and complexity of modern software systems
[0072] Today, many modern software systems are designed to be highly customizable, configurable based on the user's hardware platform, operating system, and user needs. Software system configuration allows users to easily change the behavior or state of the software system without modifying the code, thereby improving software flexibility and security and meeting user requirements for functionality or performance. Software systems that provide configuration options for users are called configurable systems.
[0073] In simple terms, configuration can be represented as a collection of options, each representing a certain property of the software, such as the hardware platform used, operating system, or whether a certain plugin is loaded. However, while highly customizable configurations can bring potential richness in software functionality or improvement in software performance, they also present significant challenges for users and developers.
[0074] Research indicates that misconfiguration in software systems has become a major cause of system failures and performance issues. Reports indicate that misconfiguration is the second leading cause of service level incidents (SLS) in Google's major production services; at Facebook, misconfiguration accounts for 16% of SLS, considered a critical challenge to Facebook's reliability. Studies of enterprise backup systems show that most task failures are caused by misconfiguration. The consequences of misconfiguration can be quite severe. Research on misconfiguration in commercial storage systems and open-source operating systems shows that a significant proportion of misconfiguration leads to difficult-to-diagnose system crashes, hangs, or severe performance degradation.
[0075] Besides widespread configuration errors, understanding the purpose of configuration is also a major challenge. Users often struggle to grasp the actual impact of changing a configuration item on the software system, forcing them to resort to time-consuming trial-and-error methods to adjust settings. This also costs vendors; reports indicate that configuration issues are a primary source of user support costs for cloud service and data center software providers. Furthermore, configuration issues complicate the development, testing, and maintenance of software systems.
[0076] In summary, while modern software system configurations can meet users' needs in terms of software functionality or performance, the complexity of configurations (reflected in the large number of configuration items and the interactions and dependencies between them) makes adjusting software configurations a huge challenge.
[0077] (2) The impact of configuration on software system performance
[0078] Software system performance (and the often directly related energy consumption and operating costs) is a crucial attribute for both users and developers. From a user's perspective, they typically want a system that provides specific functionality while minimizing energy consumption and operating costs; from a developer's perspective, they aim to create efficiently configurable systems that deliver a high-quality user experience. The inventors' research revealed that in open-source cloud systems, software performance issues lead to approximately 50% of configuration-related patch releases and about 30% of configuration-related forum discussions. In cloud systems, severe performance problems and outages caused by misconfigurations have cost hundreds of millions of dollars.
[0079] Setting configuration options that are sensitive to software system performance is a challenging task that often requires a deep understanding of the system. For example, setting a configuration option may require balancing memory usage and system response time, and making this trade-off requires a thorough understanding of the software system, the hardware used, and the current workload. To make matters worse, software system documentation often lacks clear explanations of these relationships, and even when clear documentation exists, factors such as workload are often complex and variable, making it difficult for users to set appropriate configurations.
[0080] In another example, a specific configuration value might lock every write operation by the user, increasing write latency. However, the documentation only states that there are no restrictions on the objects supported for write operations when this configuration value is set. Unless the user traces the specific implementation logic through the code, it's impossible to understand the reason for the system performance changes.
[0081] (3) Existing technical solutions
[0082] Performance impact models (PIMs) are used to represent how configuration affects system performance. The application of PIMs is an important technical tool for analyzing the relationship between configuration and software system performance. Different configurations are used as inputs to the PIM model to obtain predicted performance values, thus determining whether the configuration affects system performance. Therefore, the differences between existing technical solutions lie in the way they construct the PIM model.
[0083] One related approach involves using a black-box methodology to build performance impact models. The idea behind using a black-box approach is to treat the software system as a black box, sample the configuration space to obtain a subset of configurations, measure the system's performance under a specific workload for each configuration within that subset, and then learn a performance impact model from these observations.
[0084] Existing black-box techniques require a trade-off between modeling cost and model accuracy. Building more accurate models requires more samples, which in turn necessitates sampling a larger subset of configurations. This leads to more measurements of the target software system's performance under specific loads, resulting in a greater time cost. Furthermore, performance impact models built using black-box methods are mostly based on deep learning models, which generally have low interpretability and cannot explain the fundamental reasons why changes in configuration items cause changes in software system performance.
[0085] Another type of related work uses a white-box approach to build performance impact models. The white-box approach no longer treats the software system as a completely black box, but rather divides it into multiple components or modules according to program analysis principles. Each component or module is analyzed and modeled to construct a performance impact model for the entire system. Besides accurately predicting performance, it can also explain the reasons for performance issues, such as which components or modules are responsible for the performance problems.
[0086] However, existing white-box methods for building performance impact models have certain drawbacks. Some white-box methods only support Boolean type or exhaustive finite numerical type configuration items (exhaustive finite type configuration items need to be discretized into several Boolean type configuration items), which is a significant limitation. Moreover, after discretization, the number of configuration items increases dramatically, and the tool's runtime increases exponentially.
[0087] While some white-box methods can learn a more accurate performance impact model, they still require preparing the software system runtime environment, sampling the configuration space, running the target software system based on the sampled configuration subset, and collecting various performance indicators of the software system during runtime.
[0088] While the various existing technical methods described above differ in their implementation details and thus each has its own drawbacks, they all inevitably incur runtime overhead for the software system. These methods require preparing the software system's runtime environment, selecting and traversing a subset of configurations, and measuring the software system's performance under specific loads on different configuration subsets. The enormous time overhead required to repeatedly run the software system and collect runtime performance metrics far exceeds the time overhead required to analyze and construct performance impact models.
[0089] (4) Program Analysis
[0090] Program analysis is an automated process of analyzing a program's performance, focusing on aspects such as correctness, robustness, security, and operability. In other words, program analysis systematically examines a program to analyze its nature.
[0091] Program analysis can be divided into:
[0092] 1) Static program analysis: Program analysis performed without running the program;
[0093] 2) Dynamic program analysis: Run the program on a real or virtual processor and analyze the program based on its runtime performance.
[0094] Although static program analysis cannot obtain runtime information about a program, it saves a significant amount of time and computational resources compared to dynamic program analysis because it does not require actual program execution. Furthermore, more information obtained in program analysis is not necessarily better; a balance must be struck between the benefits and the costs. Therefore, the method proposed in this invention uses static program analysis technology.
[0095] (5) Stain analysis
[0096] Taint analysis, also known as information-flow analysis, is a type of program analysis that detects whether sensitive or private information can be obtained through injection vulnerabilities in source code. Taint analysis is generally used to identify the flow of user input within a system to understand the security impact of the system design. Taint analysis can be divided into static taint analysis and dynamic taint analysis.
[0097] Stain analysis defines a quadruple (P, SO, SI, SA), where:
[0098] 1) P represents the program being analyzed;
[0099] 2) SO represents the set of source points, which are information that needs to be tracked.
[0100] 3) SI represents the set of sinks, which are program statements that the source should not be reached.
[0101] 4) SA stands for Sanitizer. If the source passes through a sanitizer during its propagation, its harmfulness is eliminated.
[0102] Theorem 1: A program has an information leakage vulnerability or a tainted flow vulnerability if and only if there exists a path from a source to a sink in the program that does not pass through any purifier.
[0103] It's important to clarify that "vulnerability" generally refers to all endpoints in a program's code. In information security analysis, a vulnerability represents any point in a program where information can be leaked. Outside of the information security context, a vulnerability refers to any place that uses sensitive information.
[0104] This embodiment summarizes the findings that, in the context of current configurable system performance analysis, tainted flow vulnerabilities exist, meaning there are configuration items that affect the paths of space-intensive and time-intensive operations.
[0105] (6) Program dependency analysis
[0106] There are multiple ways to define control dependencies and data dependencies in a program; the following uses one of the more intuitive methods:
[0107] (6.1) Controlling Dependencies
[0108] Definition of control dependency: For any program branch statement S1 and program statement S2, we have:
[0109] If statement S1 is the branch statement that precedes statement S2 and is closest to statement S2, statement S1 has multiple branch targets, and a change in the branch decision of statement S1 may cause statement S2 to not be executed, then statement S2 is said to be control-dependent on statement S1, or statement S2 has a control dependency relationship with statement S1, denoted as S2δc S1.
[0110] (6.2) Data Dependencies
[0111] Data dependencies exist between program statements that access or modify the same resource. Data dependencies include stream dependencies, anti-dependencies, output dependencies, and input dependencies. Among these, stream dependency is the most basic data dependency.
[0112] (6.3) Program Dependency Analysis and Program Dependency Graph
[0113] The purpose of program dependency analysis is to identify control dependencies and data dependencies within a program. In practice, unlike the definition of statements mentioned above, program dependency analysis typically uses basic blocks as the smallest unit.
[0114] A program dependency graph (PDG) is used to describe the control dependencies and data dependencies of a program.
[0115] The example program is shown below:
[0116]
[0117] Figure 3 This is the program dependency graph of the example program above, where solid lines represent control dependencies and dashed lines represent data dependencies.
[0118] (7) Random Forest
[0119] Decision trees are a common white-box prediction model used in data mining and machine learning. The structure of a decision tree is a tree-like structure similar to a flowchart, where:
[0120] Each internal node represents a test of a certain attribute;
[0121] Each branch represents the result of the above test;
[0122] Each leaf node represents a type label;
[0123] The path from the root node to the leaf node represents the classification rule. The classification rule of a decision tree is constructed by the decision tree algorithm based on feature vectors and classification labels.
[0124] Decision tree learning is a method for constructing decision trees from a source database. During the learning process, the original database is continuously split, and the tree is recursively pruned until it can no longer be split or a branch can be assigned to a specific class. The learned decision tree is prone to overfitting the training set.
[0125] Random forest is a classifier that uses multiple decision trees, and the final output class is determined by the mode of the classes output by the included decision trees. In constructing a random forest, multiple decision trees are randomly built using different portions of the training set.
[0126] Random forests, as a widely used classifier, have the following significant advantages:
[0127] 1) Random forests are less prone to overfitting in many application scenarios;
[0128] 2) When using random forests to process high-dimensional data (i.e., data with many features), feature selection is generally not required;
[0129] 3) For imbalanced classification datasets, random forests can balance the errors.
[0130] (8) Soot: A framework for analyzing and transforming Java and Android applications
[0131] Soot was initially a Java optimization framework, but it has gradually evolved into a framework for analyzing, measuring, optimizing, and visualizing Java and Android applications. Simply put, Soot works by converting input programs (Java bytecode) into intermediate representation (IR), then analyzing and transforming the IR, and finally converting the processed code into target languages such as Java bytecode for output.
[0132] Using the Soot framework, the following functionalities can be achieved:
[0133] • Construct a call graph;
[0134] • Perform directional analysis;
[0135] • Build and define usage chains (the foundation of data flow analysis, upon which data dependencies can be analyzed);
[0136] • Perform template-driven in-program data flow analysis;
[0137] • Perform template-driven inter-program data flow analysis;
[0138] • Perform stream, field, and context-sensitive pointer analysis.
[0139] (9) FlowDroid: A taint analysis framework for Java and Android applications
[0140] FlowDroid is a static taint analysis framework sensitive to context, streams, fields, and objects in Java and Android applications. FlowDroid's implementation is based on Soot and Heros, where Heros is a general-purpose multi-threaded IDFS (inter-application finite subset problem) and IDE (inter-application distributed environment problem) solver.
[0141] FlowDroid ensures sensitivity to context and streams by constructing a fairly accurate call graph, and sensitivity to fields and objects through IDFS-based stream functions. Specifically, to ensure sensitivity to context and fields, FlowDroid implements accurate and efficient alias tracking.
[0142] Figure 4 This is a practical example of taint analysis in FlowDroid. Figure 4 The path from 1 to 7 is a path from the source to the sink that FlowDroid analyzed without passing through the purifier. It is easy to see that in this process, FlowDroid discovered that zgf, agf, and bg are all aliases of xf.
[0143] Because programs often use different names for local variables, class fields, global variables, etc., to refer to the same variable, it's impossible to guarantee which variable name refers to a particular variable when the program is not running. Therefore, static program analysis requires alias analysis to obtain all variable names that refer to a given variable.
[0144] Please see Figure 1 and Figure 5 This invention provides an embodiment of a method for analyzing the impact of configuration items on software system performance, comprising:
[0145] S100: Identify and mark all performance operations in the software system according to the preset code pattern of the software system, wherein the performance operations are time-intensive and / or space-intensive operations that affect the performance of the software system.
[0146] S200: Identify the dependencies between each performance operation and each configuration item of the software system to obtain a set of performance operations corresponding to each configuration item, wherein each performance operation in the set of performance operations has a dependency relationship with the configuration item;
[0147] S300: Construct a feature vector corresponding to each configuration item based on the set of performance operations;
[0148] S400: Input the feature vectors corresponding to each configuration item into the trained qualitative performance impact model to determine whether the configuration item affects the performance of the software system, and obtain the set of configuration items that affect the performance of the software system. The qualitative performance impact model is trained using the feature vectors corresponding to the configuration items of multiple software systems.
[0149] The method for analyzing the impact of configuration items on software system performance provided in this embodiment is a new white-box configuration performance analysis method. Step S100 identifies and marks all performance operations in the software system according to the preset code pattern of the software system.
[0150] In this embodiment, the performance operation (PerfOp) refers to either a time-intensive operation or a space-intensive operation. The main difference between time-intensive and space-intensive operations is that time-intensive operations require a long time to complete, while space-intensive operations require a large amount of memory, disk, and other resources.
[0151] Understandably, performance-intensive operations are strongly correlated with time-consuming and space-consuming operations. It's worth noting that time-intensive operations and time complexity are two different concepts; similarly, space-intensive operations and space complexity are also two different concepts. When evaluating time and space complexity in actual runtime, the input size must be considered. For example, a function f() may have very low time and space complexity, but if an operation o requires running function f() 1000 times to complete, then operation o may be either time-intensive or space-intensive. However, the time and space complexity of this operation still depends on the complexity of function f(). Since function f() has excellent time and space complexity, the time and space complexity of operation o is also excellent. In other words, operation o may have very low time and space complexity, but it may still be either a time-intensive or space-intensive operation.
[0152] Please see Figure 6 Based on observation and research of software systems, taking Java software systems as an example, this embodiment divides performance operations into four categories and summarizes the corresponding code patterns as follows: Figure 6 As shown. Figure 6Performance operations in Java are divided into Java IO, thread operations, synchronization operations, and array creation. Each type of performance operation has its corresponding code pattern. For example, the code pattern for Java IO is: calling methods in the java.io package and calling methods in the java.nio package.
[0153] It should be noted that different types of software systems involve different performance operations, and the above-mentioned types of performance operations may not cover all software systems. The method proposed in this embodiment has universality and scalability; as long as a new type of performance operation is defined and its code pattern is provided, this new type of performance operation can be supported.
[0154] In step S200, the dependencies between each performance operation and each configuration item of the software system are identified to obtain a set of performance operations corresponding to each configuration item. Each performance operation in the set of performance operations has a dependency relationship with each configuration item.
[0155] In this embodiment, before identifying the dependencies between each performance operation and each configuration item of the software system, the method further includes: extracting configuration item information of the software system, wherein the configuration item information includes at least the name, quantity, and API used when the configuration item is loaded into the software system.
[0156] In this embodiment, any configuration item that affects the software system performance has a data dependency or control dependency on certain performance operations. Please refer to [link / reference]. Figure 7 , Figure 7 The configuration item in the program is OptionX, which is an abstract configuration item that can be any configuration item in the program. Figure 7 The performance operation of creating the array arr in the middle has data dependency and control dependency with the configuration item OptionX. The control dependency includes if branch control dependency and loop control dependency. The loop control dependency includes loop boundary control dependency and loop step size control dependency.
[0157] It is worth noting that the control dependencies and data dependencies between configuration items and performance operations in a software system are relatively independent, yet they are combined: identifying control dependencies and data dependencies is independent in most cases, but in some cases, a combination of the two is required.
[0158] Taint analysis, as a means of information flow tracing and analysis, is essentially a data flow analysis technique that can be used to identify data dependencies between configuration items in a program. In step S200 of this embodiment, taint analysis is used to perform information flow tracing to identify the data dependencies between performance operations and configuration items in the program. The process for identifying data dependencies is as follows: Figure 8As shown below, each step is explained in detail:
[0159] The first step is to enter the program.
[0160] If the program provides multiple program entry points, a virtual entry point is created as the unique program entry point, and there are control flow edges between this virtual entry point and all program entry points.
[0161] The second step is to traverse the control flow and insert sink markers for performance operations.
[0162] Traverse the control flow, identify the corresponding branch statements and performance operations through code patterns, and insert statements that call the sink function before the corresponding statements.
[0163] Step 3: Return to the program entry point and traverse the control flow again.
[0164] The first two steps are preparatory work (setting the sink point). The third step returns to the program entry point, retraces the control flow, and analyzes the program after the sink point has been inserted.
[0165] Step 4: Create a taint (source point) in the configuration loading API section.
[0166] The configuration loading API is where the configuration first enters the program, and we will start tracing from there.
[0167] Step 5: Spread of taint (spread of information flow).
[0168] In simple terms, taint propagation is as follows: the source is the initial taint, and during data propagation, this taint marks other variables as tainted as well. Therefore, the path of taint propagation is the data dependency chain.
[0169] Step 6: Record the arrival point
[0170] For a configuration item, the taint created from the configuration loading API eventually reaches the pre-inserted sink through data propagation. This information indicates that the performance operation (or branch statement) at the sink has a data dependency on the configuration item.
[0171] It is understood that the process of identifying the data dependencies between each performance operation and each configuration item of the software system using taint analysis in this embodiment specifically includes: entering the program entry point of the software system, traversing the control flow, creating a taint as the source point at the API loading point of the configuration item; recording the data propagation path of the source point and the final sink point, then the performance operation at the sink point has a data dependency relationship with the configuration item; wherein, the sink point is a program statement that the source point is not expected to reach, and the sink point is pre-set before the statement corresponding to the performance operation.
[0172] It's worth noting that control flow refers to the execution order of statements, instructions, or functions within a codebase, and it's essential to consider control flow when identifying data dependencies. Only through control flow can we understand the execution order of different parts of the program's code.
[0173] In step S200 of this embodiment, the control dependency relationship between performance operations and configuration items in the program is identified by constructing a control area for configuration items.
[0174] Specifically, the control region of a configuration item is: for a certain configuration item, its control region is a sequence of statements that have a control dependency relationship with the configuration item, and in the control flow order, the next statement in the sequence is the direct follow-dominant statement in the sequence.
[0175] In a control flow graph, the postdominate relationship means that for control flow nodes n and m, if all paths starting at the program entry point and passing through n must pass through m to reach the program exit point, then node m is said to postdominate node n. If node m postdominates node n but does not postdominate any other node that postdominates n, then node m is said to directly postdominate node n, and node m is the direct postdominater of node n.
[0176] Intuitively speaking, for a specific configuration item, the control area of the configuration item is a program that is directly controlled by that configuration item. Figure 9 The areas of influence for the four configuration options, Option A, Option B, Option C, and Option D, are marked.
[0177] In this embodiment, identifying the control dependencies between each performance operation and each configuration item of the software system using the program dependency graph includes: traversing all nodes in the program dependency graph, constructing the control region of each configuration item, where the control region of a configuration item is a sequence of statements that has a direct control dependency relationship with that configuration item; and identifying the control dependencies between each performance operation and each configuration item of the software system based on the control region of each configuration item.
[0178] Specifically, the process of identifying the control dependencies between performance operations and various configuration items of the software system is as follows:
[0179] 1) Construct a program dependency graph using program dependency analysis techniques;
[0180] 2) Traverse all nodes in the program dependency graph and construct the configuration item control area;
[0181] 3) For a certain configuration item, after the configuration item control area is constructed, there are two types of performance operations within the configuration item control area: performance operations that have data dependencies and control dependencies on the configuration item; and performance operations that have no data dependencies on the configuration item but have control dependencies.
[0182] It's worth noting that while control flow is used to analyze control dependencies, it's only possible to analyze a subset of them. Specifically, simple control flow can only analyze performance operations that have both data and control dependencies with configuration items. Performance operations that have no data dependencies but do have control dependencies with configuration items require constructing the control region of the configuration item to complete the analysis.
[0183] In step S300, feature vectors corresponding to each configuration item are constructed based on the performance operation set.
[0184] For any configuration item, there is a known set of performance operations that have data or control dependencies with it. It can be understood that the set of performance operations for a configuration item can be viewed as a specific set of code, identified through specific code patterns.
[0185] After obtaining the set of performance operations corresponding to the configuration item, a corresponding feature vector is constructed for the configuration item based on the set of performance operations. That is, the feature vector is used to describe the dependency relationship between the configuration item and the performance operation.
[0186] Constructing feature vectors is quite complex; the following explains the construction of some feature vectors:
[0187] Let the set of configuration items be Options, and the four different sets of performance operations proposed in this embodiment be PerfOps. For any configuration item option∈Options and performance operation perfOp∈PerfOps, let the function f(option,perfOp) represent the number of performance operations perfOp that have data dependency or control dependency with the configuration item option in the target software system.
[0188] Additionally, use perfOp data perfOp if perfOp loop Let these represent `perfOp` instances that have data dependencies, if branch control dependencies, and loop control dependencies with the configuration item `option`, respectively. For k∈{data,if,loop}, let the function `g(option,perfOp)`... k ) represents the number of performance operations perfOp that have a dependency relationship with configuration item option k in the target software system.
[0189] The first 22 dimensions of the feature vector v are as follows:
[0190]
[0191]
[0192]
[0193]
[0194]
[0195]
[0196]
[0197] Here, i and j are only used to represent subscript counts. For example, when i = 0, the first formula v0 = f(option, PerfOps0) indicates that the first dimension of the feature vector is the number of operations in which option has a dependency relationship with the first type of performance operation Java IO.
[0198] As for To illustrate with an example: when i = 0, it indicates a data dependency with the first type of performance operation, Java IO. Similarly, To illustrate with an example: when i = 0, it indicates an if-branch control dependency with the first type of performance operation, Java IO; To illustrate with an example: when i = 0, it indicates a circular control dependency with the first type of performance operation, Java IO.
[0199] In step S400, the feature vectors corresponding to each configuration item are input into the trained qualitative performance impact model to determine whether the configuration item affects the performance of the software system, thereby obtaining a set of configuration items that affect the performance of the software system. The qualitative performance impact model is trained using the feature vectors corresponding to the configuration items of multiple software systems.
[0200] In this embodiment, a qualitative performance impact model for configuration items needs to be constructed. Then, for any new target software system, after calculating the corresponding feature vector for each configuration item using the aforementioned method, the feature vector is input into the constructed qualitative performance impact model. This automatically determines whether the configuration item affects the software system's performance, adding all configuration items that affect the software system's performance to the configuration item set, ultimately obtaining all configuration items that affect the software system's performance. The specific process is as follows:
[0201] First, several software systems are collected to construct the training set. For each software system, it is run multiple times under different configurations, and the actual running time is recorded to determine whether each configuration item of the software system actually affects the system's performance. Then, the feature vector corresponding to each configuration item in the software system is constructed, thus obtaining the training set data.
[0202] Then, this embodiment establishes a qualitative performance impact model for configuration items, such as... Figure 10 As shown, this qualitative performance impact model consists of two parts: a random forest classification model and a cDEP configuration item dependency detector. The random forest model is trained using the random forest algorithm from the sklearn library, focusing on dividing the feature vectors into training and test sets for constructing the RandomForestClassifier classification model.
[0203] In this embodiment, a feature vector is constructed for each configuration item. The dimension of this feature vector is positively correlated with the number of performance operations of each class that have occurred in the target software system. Therefore, the feature vector corresponding to each configuration item usually has a high dimension. In addition, since each software system has different functions, the frequency and characteristics of performance operations vary significantly for different types of software systems (e.g., compute-intensive or memory-intensive), and the data in the constructed training set may not be balanced.
[0204] Since random forests have advantages such as being less prone to overfitting, being able to handle high-dimensional data, and being able to balance the errors of classification datasets, they can solve the above problems. Furthermore, random forests can be used to learn several interpretable classification rules. Therefore, this embodiment uses a random forest classification model to perform binary classification on "whether configuration items affect the performance of the software system", thus providing a preliminary qualitative answer to the question of the impact of configuration items on the performance of the software system.
[0205] This embodiment has so far not considered the dependencies between configuration items, treating each configuration item as an independent one. However, in reality, there may be dependencies between configuration items in a software system, and when adjusting the configuration, dependent configuration items usually need to be considered together.
[0206] This embodiment assumes that if configuration option A depends on configuration option B, and configuration option A affects the performance of the software system, then option B also affects the performance of the software system.
[0207] cDEP is a tool for detecting dependencies between configuration items, proposed by Qingrong Chen et al. in 2020. To take into account the dependencies between configuration items, this embodiment uses cDEP to detect the dependencies between configuration items in the software system, further refining the classification results of the random forest classification model.
[0208] In this embodiment, the configuration item-dependent detector corrects the classification results of the random forest classification model, including:
[0209] When the first configuration item of the software system depends on the second configuration item, if the random forest classification model determines that the first configuration item affects the performance of the software system and the second configuration item does not affect the performance of the software system, then the cDEP configuration item dependency detector will correct the second configuration item to affect the performance of the software system.
[0210] The method for analyzing the impact of configuration items on software system performance provided in this invention employs program analysis techniques to trace time-intensive or space-intensive operations that have dependencies on configuration items. Based on the program analysis results, it constructs corresponding feature vectors for each configuration item, achieving fine-grained analysis down to the configuration item level. It is not limited to Boolean types or exhaustively enumerating finite numerical types, supporting configuration items of any type. A qualitative performance impact model is established using random forests, eliminating the need for time-consuming local measurement operations. Only a single analysis of the software system's source code is required to determine whether a specific configuration item affects the performance of the configurable system, significantly reducing the overhead of performance analysis. It can accurately predict whether each configuration item affects the software system's performance without running the software system, identifying the set of configuration items that truly impact software system performance, improving the efficiency of software system configuration, and facilitating correct configuration of the software system to improve its performance. Furthermore, this invention is interpretable, allowing understanding of the underlying reasons for the performance impact of configuration items through program analysis results and the classification rules of the performance model.
[0211] Please see Figure 2 The present invention also provides an embodiment of a system for analyzing the impact of configuration items on software system performance, comprising:
[0212] The performance operation identification module 11 is used to identify and mark all performance operations in the software system according to the preset code pattern of the software system. The performance operations are time-intensive and / or space-intensive operations that affect the performance of the software system.
[0213] The dependency identification module 22 is used to identify the dependency relationship between each of the performance operations and each of the configuration items of the software system, and to obtain a set of performance operations corresponding to each of the configuration items. Each performance operation in the set of performance operations has a dependency relationship with each of the configuration items.
[0214] Feature vector construction module 33 is used to construct feature vectors corresponding to each configuration item based on the performance operation set;
[0215] The configuration item set determination module 44 is used to input the feature vectors corresponding to each configuration item into the trained qualitative performance impact model, determine whether the configuration item affects the performance of the software system, and obtain the configuration item set that affects the performance of the software system. The qualitative performance impact model is trained using the feature vectors corresponding to the configuration items of multiple software systems.
[0216] The system for analyzing the impact of configuration items on software system performance provided in this invention employs program analysis techniques to track time-intensive or space-intensive operations that have dependencies on configuration items. Based on the program analysis results, it constructs corresponding feature vectors for each configuration item, achieving fine-grained analysis down to the configuration item level. It is not limited to Boolean types or exhaustively enumerating finite numerical types, supporting configuration items of any type. A qualitative performance impact model is built using random forests, eliminating the need for time-consuming local measurement operations. Only a single analysis of the software system's source code is required to determine whether a specific configuration item affects the performance of the configurable system, significantly reducing the overhead of performance analysis. It can accurately predict whether each configuration item affects the software system's performance without running the software system, identifying the set of configuration items that truly impact software system performance, improving the efficiency of software system configuration, and facilitating correct configuration of the software system to improve its performance. Furthermore, this invention is interpretable, allowing understanding of the underlying reasons for the performance impact of configuration items through program analysis results and the classification rules of the performance model.
[0217] In addition, based on the white-box performance analysis method based on program analysis proposed above, this invention designs and implements ConfigAnalyzer, a configuration analysis tool for Java applications.
[0218] The ConfigAnalyzer tool implements the white-box performance analysis method proposed in this invention and supports Java software systems. ConfigAnalyzer uses FlowDroid to perform static taint analysis sensitive to context, streams, fields, and objects. It also makes necessary custom extensions to the Soot analysis framework on which FlowDroid is based to support the analysis logic required by ConfigAnalyzer.
[0219] ConfigAnalyzer consists of the following two main modules:
[0220] 1) Program Analysis Module: Implements the marking of performance operations and the identification of dependencies between configuration items and performance operations;
[0221] 2) Performance Model Module: This module implements the construction of feature vectors and qualitative performance impact models for configuration items. The performance analysis module consists of two parts: constructing feature vectors based on the results obtained from the program analysis module, and constructing qualitative performance impact models based on the feature vectors.
[0222] Among them, the program analysis module of the ConfigAnalyzer tool is as follows: Figure 11 As shown, the program analysis module contains three packages, whose functions are as follows:
[0223] 1) edu.sysu.dds.analysis: Implements taint analysis, program dependency analysis, and information extraction, etc.
[0224] 2) edu.sysu.dds.visual: Enables visualization of intermediate results;
[0225] 3) edu.sysu.dds.utility: This provides utility for the two packages mentioned above.
[0226] Information extraction is mainly used to extract configuration item information of the software system, such as what configuration items are available, the number of configuration items, and the APIs used by the configuration items to load into the software system.
[0227] Visualization of intermediate results mainly includes visualization of the control areas of configuration items (such as...). Figure 9 (as shown), insertion mark sink visualization, etc.
[0228] This invention references existing research on white-box performance analysis methods and, based on a balance between evaluation effectiveness and workload, selects six representative software systems from 19 existing real-world software systems to evaluate the qualitative performance impact model established by ConfigAnalyzer. Table 1 provides an overview of the target software systems.
[0229] Table 1
[0230]
[0231] It is worth noting that an effective configuration refers to a configuration that enables the software system to execute correctly without crashing. These configurations allow the software system to complete the corresponding tasks, but the resources required to complete the tasks differ under different configurations.
[0232] Consider 5 configuration items, all of type Boolean, with values ranging from {false, true}. These 5 configuration items alone can constitute 2. 5 = 32 configurations. Ten Boolean configuration items can form 1024 configurations. If the configuration items are not Boolean, the range of values will be even larger, resulting in a very large total number of configurations.
[0233] In the experiment of this invention, the six target software systems were divided into two categories. The data corresponding to the configuration items (total) of four software systems, Batik, H2, Kanzi, and Prevayler, were used as the training set, and the samples corresponding to the configuration items (total) of two software systems, Catena and Sunflow, were used as the test set.
[0234] The feature vector constructed based on the program analysis results has 50 dimensions. A random forest regressor model is built, and then cDEP is run to refine the configuration dependencies.
[0235] In this embodiment, after the configuration item generates a feature vector using the program analysis module of the ConfigAnalyzer tool, the feature vector is input into the random forest model. The random forest consists of many decision trees, and the feature vector needs to be processed by each decision tree to obtain the predicted classification label using its classification rules.
[0236] The experimental results are shown in Table 2, comparing the predicted and actual impacts of the qualitative performance model on the configuration items of the tested software system. Here, y represents the actual classification of the configuration item. predit This indicates the model's predicted classification for this configuration item. A classification of -1 indicates that the configuration item does not affect the software system performance, while a classification of 1 indicates that the configuration item does affect the software system performance.
[0237] Table 2
[0238]
[0239] Experimental results show that ConfigAnalyzer accurately predicted whether 84.21% of the configuration items in the tested software system would affect performance without running the program, demonstrating high accuracy. This shows that ConfigAnalyzer can indeed effectively establish a qualitative performance impact model for software systems.
[0240] This invention provides an analysis system for the impact of configuration items on software system performance, and implements the ConfigAnalyzer tool, which is a configuration analysis tool for Java applications.
[0241] ConfigAnalyzer first uses program analysis techniques such as taint analysis and program control analysis to statically track time- or space-intensive operations that have dependencies on configuration items. Then, based on the results of the program analysis, ConfigAnalyzer constructs feature vectors and uses random forests to build a qualitative performance impact model.
[0242] ConfigAnalyzer helps users discover the set of configuration items that truly impact system performance without running the software system. Unlike traditional black-box methods, ConfigAnalyzer is interpretable; users can understand the underlying reasons why configuration items affect performance through the program's analysis results and the performance model's classification rules. Unlike existing white-box methods, ConfigAnalyzer supports any type of configuration item, and because it builds a qualitative performance impact model, it eliminates the need for time-consuming local measurement operations, significantly reducing analysis overhead.
[0243] The ConfigAnalyzer tool of the present invention has the following advantages:
[0244] (1) Explainable
[0245] ConfigAnalyzer first uses program analysis techniques such as taint analysis and program control analysis to statically track time- or space-intensive operations that have dependencies on configuration items. Then, based on the program analysis results, ConfigAnalyzer constructs feature vectors and uses random forests to build a qualitative performance impact model. Unlike traditional black-box methods, ConfigAnalyzer is interpretable; users can understand the underlying reasons why configuration items affect performance through the program analysis results and the classification rules of the performance model.
[0246] (2) Accuracy
[0247] Experimental results show that ConfigAnalyzer accurately predicted whether 84.21% of the configuration items in the tested software system would affect performance without running the program, and it can indeed effectively establish a qualitative performance impact model for the software system.
[0248] (3) Analytical particle size
[0249] This invention can accurately determine whether a specific configuration item affects the performance of a configurable system, unlike other testing methods that treat the software system as a black box, sample the configuration space to obtain a subset of configurations, and measure the system's performance under a specific workload for each configuration within that subset. These testing methods can only determine whether a single configuration affects the performance of a configurable system, but cannot provide fine-grained analysis down to the configuration item level.
[0250] (4) Efficiency and completeness
[0251] Existing white-box performance analysis methods only support Boolean types or exhaustive finite numerical types for configuration items (exhaustive finite numerical types require discretization into several Boolean type configuration items), which is a significant limitation. Discretization drastically increases the number of configuration items, leading to an exponential increase in tool runtime. This invention, however, only requires a single analysis of the configurable system's source code to determine whether a specific configuration item affects the system's performance. The configuration item types can cover all types allowed by Java programs, not limited to Boolean types or exhaustive finite numerical types. It requires no hardware to support the software system, no need to build the execution environment for the configurable system, and no consideration of testing overhead under different configurations and specific loads.
[0252] This invention provides the ConfigAnalyzer tool. ConfigAnalyzer analyzes the source code of a target system to discover the set of configuration items that truly affect system performance, regardless of their type. Unlike traditional black-box approaches that construct precise software system performance impact models, ConfigAnalyzer is interpretable. Users can leverage the program analysis results and qualitative performance model classification rules generated by ConfigAnalyzer, combined with the analysis of the relationship between configuration items and specific performance operations, to further understand the root cause of each configuration item's impact on system performance.
[0253] Unlike existing white-box approaches to building performance impact models, ConfigAnalyzer eliminates the overhead of hardware, software environment, time, and energy consumption required to run configurable software systems, thus overcoming the limitations of configuration item types and significantly reducing the cost of analyzing the relationship between configuration items and software system performance.
[0254] The qualitative performance impact model established by ConfigAnalyzer was evaluated through experiments. The results show that ConfigAnalyzer accurately predicted whether 84.21% of the configuration items in the tested software system would affect performance without running the program. It can indeed effectively establish a qualitative performance impact model for the software system, demonstrating that ConfigAnalyzer has superior accuracy.
[0255] It is worth noting that this invention includes, but is not limited to, the specific embodiments described above. All technical solutions that conform to the concept of this invention fall within the protection scope of this invention. For example, the following content also falls within the protection scope of this invention:
[0256] (1) By replacing the static taint analysis used in this invention with dynamic taint analysis, and by combining program testing and achieving a high level of test coverage (80%-99%), the results output by the program analysis module in this invention can also be obtained and used as input to the performance model module.
[0257] (2) Replacing the random forest classification model of the performance model module in this invention with any classification model can classify the configuration items. This may result in the loss of some interpretability of the classification results, but it does not affect the generation of the classification results.
[0258] (3) It is also possible to determine whether a configuration item affects the performance of a configurable system by using only configuration item sampling and program testing. Each configuration item is sampled, and then the sampling results of each configuration item are subjected to a Cartesian product operation to obtain a subset of the configuration space (at this time, there is only one configuration item with a different value among the configurations in the subset, and all other configuration items are set to the same value). Program performance testing is performed on each configuration in the subset. By analyzing the program performance test results between different configurations, it can be determined whether a configuration item affects the performance or behavior of the configurable system.
[0259] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0260] In the embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection between devices or units through some interfaces, and may be electrical, mechanical, or other forms.
[0261] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0262] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0263] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0264] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for analyzing the impact of configuration items on software system performance, characterized in that, include: Based on the preset code patterns of the software system, all performance operations in the software system are identified and marked. The performance operations are time-intensive and / or space-intensive operations that affect the performance of the software system. Identify the dependencies between each performance operation and each configuration item of the software system to obtain a set of performance operations corresponding to each configuration item. Each performance operation in the set of performance operations has a dependency relationship with each configuration item. Construct a feature vector corresponding to each configuration item based on the set of performance operations; The feature vectors corresponding to each configuration item are input into the trained qualitative performance impact model to determine whether the configuration item affects the performance of the software system, thereby obtaining a set of configuration items that affect the performance of the software system. The qualitative performance impact model is trained using the feature vectors corresponding to the configuration items of multiple software systems. The qualitative performance impact model includes: Random forest classification model and configuration item dependency detector; The random forest classification model performs binary classification on whether configuration items affect the performance of the software system, and the configuration item dependency detector corrects the classification results of the random forest classification model. The configuration item dependency detector corrects the classification results of the random forest classification model by including: When the first configuration item of the software system depends on the second configuration item, if the random forest classification model determines that the first configuration item affects the performance of the software system and the second configuration item does not affect the performance of the software system, then the configuration item dependency detector will correct the second configuration item to affect the performance of the software system.
2. The method for analyzing the impact of configuration items on software system performance according to claim 1, characterized in that, The dependencies include: Data dependencies and control dependencies; Here, the data dependency is the dependency between data streams, and the control dependency is the dependency caused by the program control flow.
3. The method for analyzing the impact of configuration items on software system performance according to claim 2, characterized in that, Identifying the dependencies between each of the performance operations and each configuration item of the software system includes: Tag analysis is used to identify the data dependencies between each performance operation and each configuration item of the software system. The control dependencies between each performance operation and each configuration item of the software system are identified using a program dependency graph; the program dependency graph is constructed using program dependency analysis techniques and is used to describe the control dependencies and data dependencies of the program.
4. The method for analyzing the impact of configuration items on software system performance according to claim 3, characterized in that, Using taint analysis to identify data dependencies between each performance operation and each configuration item of the software system includes: Enter the program entry point of the software system, traverse the control flow, and create a taint as the source point at the API loading location in the configuration item; Record the data propagation path from the source point and the final destination point. The performance operation at the destination point has a data dependency on the configuration item. The destination point is a program statement that the source point is not expected to reach. The destination point is preset before the statement corresponding to the performance operation.
5. The method for analyzing the impact of configuration items on software system performance according to claim 3, characterized in that, Identifying the control dependencies between each performance operation and each configuration item of the software system using a program dependency graph includes: Traverse all nodes in the program dependency graph and construct the control region for each configuration item. The control region of each configuration item is a sequence of statements that has a direct control dependency relationship with the configuration item. Identify the control dependencies between each performance operation and each configuration item of the software system based on the control area of each configuration item.
6. The method for analyzing the impact of configuration items on software system performance according to claim 1, characterized in that, The training process of the random forest classification model includes: The feature vectors corresponding to the configuration items of multiple software systems are divided into training and test sets; The random forest classification model is trained based on the training set and the random forest algorithm.
7. The method for analyzing the impact of configuration items on software system performance according to any one of claims 1-6, characterized in that, Before identifying the dependencies between each of the performance operations and the configuration items of the software system, the process also includes: Extract the configuration item information of the software system. The configuration item information includes at least the name, quantity, and API used when the configuration item is loaded into the software system.
8. A system for analyzing the impact of configuration items on software system performance, characterized in that, include: The performance operation identification module is used to identify and mark all performance operations in the software system according to the preset code patterns of the software system. The performance operations are time-intensive and / or space-intensive operations that affect the performance of the software system. A dependency identification module is used to identify the dependency relationship between each of the performance operations and each of the configuration items of the software system, and to obtain a set of performance operations corresponding to each of the configuration items, wherein each performance operation in the set of performance operations has a dependency relationship with the configuration item. A feature vector construction module is used to construct feature vectors corresponding to each of the configuration items based on the performance operation set; The configuration item set determination module is used to input the feature vectors corresponding to each configuration item into the trained qualitative performance impact model, determine whether the configuration item affects the performance of the software system, and obtain the configuration item set that affects the performance of the software system. The qualitative performance impact model is trained using the feature vectors corresponding to the configuration items of multiple software systems. The qualitative performance impact model includes: Random forest classification model and configuration item dependency detector; The random forest classification model performs binary classification on whether configuration items affect the performance of the software system, and the configuration item dependency detector corrects the classification results of the random forest classification model. The configuration item dependency detector corrects the classification results of the random forest classification model by including: When the first configuration item of the software system depends on the second configuration item, if the random forest classification model determines that the first configuration item affects the performance of the software system and the second configuration item does not affect the performance of the software system, then the configuration item dependency detector will correct the second configuration item to affect the performance of the software system.