A software defect prediction method based on clustering ensemble

By employing a cluster ensemble approach, random sampling with replacement, and fusion of multiple clustering results, along with spectral clustering and z-score normalization, the method addresses the shortcomings in accuracy and robustness of individual clustering predictions in existing technologies, thus achieving efficient software defect prediction.

CN115994310BActive Publication Date: 2026-06-23SHAANXI NORMAL UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHAANXI NORMAL UNIV
Filing Date
2022-10-08
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing cluster-based unsupervised software defect prediction methods rely on the prediction results of a single cluster, resulting in low accuracy and robustness, making it difficult to accurately predict software defects.

Method used

An unsupervised software defect prediction model is constructed by using a cluster ensemble approach, which involves random sampling with replacement and fusion of multiple clustering results. The spectral clustering algorithm and z-score normalization are then used to generate ensemble prediction results, thereby improving prediction accuracy and robustness.

Benefits of technology

It enables efficient software defect identification in situations where there is no historical defect data or the data is scarce, improving the accuracy and robustness of prediction and ensuring the quality and reliability of software products.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115994310B_ABST
    Figure CN115994310B_ABST
Patent Text Reader

Abstract

This invention provides a software defect prediction method based on clustering ensemble, belonging to the field of software defect prediction technology. The method includes: randomly sampling N software entities x with replacement from given project data; randomly selecting m metrics from the N randomly sampled software entities x to form a dataset X. * Where N is the total number of software entities in the project data; based on dataset X * An unsupervised software defect prediction model is constructed using a clustering algorithm; the prediction results are labeled as defective and defect-free to obtain the prediction label vector p of the extracted software entities; and a repeatedly sampled dataset X is used. * And generate predicted label vector p multiple times for the extracted software entity x. i Calculate the integrated prediction result P; if P is greater than 0.5, it indicates that the extracted software entity x i There is a defect: if P is less than or equal to 0.5, it indicates that the extracted software entity x... i No defects found. This method can be used to predict software defects.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of software defect prediction technology, specifically relating to a software defect prediction method based on clustering ensemble. Background Technology

[0002] Software defect prediction is a key technology in software quality assurance. Its motivation lies in using cost-effective techniques to discover defects in software, thereby ensuring the quality and reliability of software products. Software defects are inevitable during the software development process. Discovering and fixing these defects as early as possible is extremely important for later software development and maintenance. This makes software defect prediction technology a promising area for application in identifying software product defects.

[0003] Unsupervised software defect prediction does not require prior labeling of data in the software project under test. Instead, it learns and mines the latent structure of its own data to automatically identify software defects. Unsupervised defect prediction can be used when the target project is a newly developed project with little or no historical defect data. This has significant research implications for solving the problem of software defect prediction when historical defect data is scarce or nonexistent.

[0004] In recent years, researchers have proposed several cluster-based unsupervised software defect prediction methods. However, most of these methods rely on the prediction results of a single cluster, which often have low accuracy and poor robustness, making it difficult to achieve accurate prediction of software defects.

[0005] In summary, existing unsupervised software defect prediction methods based on clustering rely on the prediction results of a single cluster, resulting in low accuracy and poor robustness, making it difficult to achieve accurate prediction of software defects. Summary of the Invention

[0006] To overcome the shortcomings of the existing technology, the present invention provides a software defect prediction method based on clustering ensemble.

[0007] To achieve the above objectives, the present invention provides the following technical solution:

[0008] A software defect prediction method based on clustering ensemble includes:

[0009] From the given project data, N software entities x are randomly sampled with replacement. From these N randomly sampled software entities x, m metrics are randomly selected to form a dataset X. * , where N is the total number of software entities in the project data;

[0010] Based on dataset X * A clustering algorithm is used to construct an unsupervised software defect prediction model;

[0011] Defect prediction is performed on software entities based on an unsupervised software defect prediction model, and the prediction results are labeled as defective and defect-free, resulting in a prediction label vector p of software entities containing defective and defect-free labels.

[0012] Remove duplicate sampled software entities from the predicted label vector p;

[0013] Repeated sampling dataset X * And generate predicted label vector p multiple times for the extracted software entity x. i Calculate the predicted label vector p(x) generated by the predicted label vector. i The average value P(x) i ), P(x i ) as its integrated prediction result;

[0014] If P(x) i If the value is greater than 0.5, it indicates that the extracted software entity x i There is a defect, if P(x) i If the value is less than or equal to 0.5, it indicates that the extracted software entity x i There are no defects.

[0015] Furthermore, the project data refers to the data of the software project under test, and the software entity refers to the instance module extracted from the program code or development process, which can be a method, class, file, package or code change.

[0016] Furthermore, it also includes: using the z-score method to normalize the given project data; the normalization algorithm is as follows:

[0017]

[0018] Where, x i It is the original value of the i-th metric of software entity x. It is x i The normalized value, μ x It is the average value of software entity x, σ x It is the standard deviation of the software entity x.

[0019] Furthermore, the clustering algorithm is spectral clustering, and its algorithm is as follows:

[0020] The algorithm for constructing the adjacency matrix W is as follows:

[0021]

[0022] The algorithm for calculating the Laplacian matrix L is as follows:

[0023] L=DW

[0024] Where D is the degree matrix, which is a diagonal matrix whose diagonal elements are...

[0025] The normalized Laplacian matrix L is obtained as L sym The algorithm is as follows:

[0026]

[0027] For L sym Eigenvalue decomposition is performed, the eigenvector v corresponding to the second smallest eigenvalue is selected, and it is standardized to obtain an unsupervised software defect prediction model.

[0028] Furthermore, the labeling of the prediction results as defective and defect-free includes:

[0029] Divide v into two clusters. Label the software entities corresponding to v>0 as defective and the software entities corresponding to v≤0 as defect-free. If the total metric value of the software entities corresponding to v>0 is less than the total metric value of the software entities corresponding to v≤0, then label the software entities corresponding to v>0 as defect-free and the rest as defective.

[0030] Furthermore, the integrated prediction result P(x) i The algorithm is as follows:

[0031]

[0032] Where, x i The extracted software entities to be tested. This indicates that the actual software entity x was extracted. i The number of times, pj is the number of extracted software entities x i The label vector obtained from the j-th prediction.

[0033] Furthermore, the random sampling of N software entities x with replacement includes: each time, a random sample of N software entities x with replacement from a given project data X = {x} 1 ,x 2 ,…,x N}∈R M×N One entity is extracted from the data, and the extraction is performed N times in total. Here, M represents the number of software metrics of project data X, and N represents the number of software entities of data X.

[0034] The software defect prediction method based on clustering ensemble provided by this invention has the following beneficial effects:

[0035] This invention employs ensemble learning and unsupervised clustering techniques to design an unsupervised software defect prediction method based on cluster ensemble. This method eliminates the need for pre-labeled data for automatic defect identification and can be used in software projects with limited or no historical defect data. By integrating the results of multiple base clusters, this invention achieves high defect identification accuracy and robustness, contributing to better assurance of software product quality and reliability. It addresses the problem that existing cluster-based unsupervised software defect prediction methods rely on the prediction results of a single cluster, resulting in low accuracy and poor robustness, making accurate prediction of software defects difficult. Attached Figure Description

[0036] To more clearly illustrate the embodiments and design schemes of the present invention, the accompanying drawings required for this embodiment will be briefly described below. The drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0037] Figure 1 This is a schematic diagram of the structure of a software defect prediction method based on clustering ensemble according to an embodiment of the present invention;

[0038] Figure 2 This is a comparison chart of the experimental results of the present invention and five other methods in five projects. Detailed Implementation

[0039] To enable those skilled in the art to better understand and implement the technical solutions of the present invention, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. The following embodiments are only used to more clearly illustrate the technical solutions of the present invention and should not be construed as limiting the scope of protection of the present invention.

[0040] Example:

[0041] This invention provides a software defect prediction method based on clustering ensemble, such as... Figure 1 As shown, it includes:

[0042] S1: Data preprocessing, including: normalizing the given project data to eliminate the influence of different dimensions between different metrics of software entities; project data refers to the data of the software project under test, and software entities refer to instance modules extracted from program code or development process, which can be methods, classes, files, packages, or code changes. Specifically, the normalization method used is the z-score method, and its algorithm is as follows:

[0043]

[0044] Where, x iIt is the original value of the i-th metric of software entity x. It is x i The normalized value, μ x It is the average value of software entity x, σ x It is the standard deviation of the software entity x.

[0045] S2: From the given project data, randomly sample N software entities x with replacement, and randomly select m metrics from the N randomly sampled software entities x to form a dataset X. * ;

[0046] Specifically, randomly sampling i software entities x includes: each with replacement from the given project data X = {x} 1 ,x 2 ,…,x N}∈R M×N One entity is extracted from the data, and the extraction is performed N times in total. Here, M represents the number of software metrics of project data X, and N represents the number of software entities of data X.

[0047] Specifically, randomly sampling m software metrics includes: drawing m (m≤M) metrics from the extracted software entities to generate a new dataset.

[0048] S3: Based on data X * Clustering algorithms are used to obtain clustering results. The clustering algorithm is spectral clustering, and its algorithm is as follows:

[0049] The algorithm for constructing the adjacency matrix W is as follows:

[0050]

[0051] The algorithm for calculating the Laplacian matrix L is as follows:

[0052] L = DW,

[0053] Where D is the degree matrix, which is a diagonal matrix whose diagonal elements are...

[0054] The normalized Laplacian matrix L is obtained as L sym The algorithm is as follows:

[0055]

[0056] For L sym Perform eigenvalue decomposition, select the eigenvector v corresponding to the second smallest eigenvalue, and standardize it to obtain the clustering result;

[0057] S4: Label the clustering results to obtain the predicted label vector p. The specific labeling process is as follows:

[0058] Existing research has shown that the metric values ​​of defective software entities are generally higher than those of defect-free software entities. Based on this, the data to be tested, X... * The predicted label vector p is calculated as follows:

[0059] p = v > 0

[0060] The above formula divides v into two clusters: software entities corresponding to v>0 are marked as defective, and software entities corresponding to v≤0 are marked as defect-free. If the overall metric value of the software entities corresponding to v>0 is less than the overall metric value of the software entities corresponding to v≤0, then the software entities corresponding to v>0 should be marked as defect-free, and the rest should be marked as defective.

[0061] The predicted label vector p will contain duplicate sampled software entities, which will be used for the final ensemble prediction after deduplication. Software entities that are not sampled will be given a special label and will not be used for the final ensemble prediction.

[0062] S5: Repeat steps S2-S4 T times to obtain T predicted label vectors. Fuse the T predicted label vectors to obtain the integrated prediction result P. The algorithm is as follows:

[0063] For the i-th software entity to be tested x i :

[0064]

[0065] Where, x i The extracted software entities to be tested. This indicates that the actual software entity x was extracted. i The number of times, pj is the number of extracted software entities x i The label vector obtained from the j-th prediction.

[0066] To fully demonstrate the feasibility of the cluster-based software defect prediction method proposed in this invention, verification experiments were conducted.

[0067] Experimental environment configuration: Windows 10 system, MATLAB 2020b

[0068] Experimental Data: This experiment uses five publicly available datasets as experimental data. Details are shown in Table 1.

[0069] Table 1. Datasets used in the experiment

[0070]

[0071] Detailed operation process:

[0072] Input: Data of the project to be tested X = {x 1,x 2 ,…,x N}∈R M×N ;

[0073] Output: Whether each software entity in the project has a defect. 1 indicates that the software entity has a defect, and 0 indicates that the software entity does not have a defect.

[0074] The first step is to perform z-score normalization on the data X of the project to be tested;

[0075] The second step involves conducting experiments using 50 rounds of 2-fold cross-validation for supervised classifier methods, generating a total of 100 prediction results. For unsupervised clustering methods, clustering experiments are performed directly on each fold of data, generating a total of 100 prediction results.

[0076] The third step is to randomly sample N software entities from the data X to be tested, and then randomly sample m = log2(M) metrics from them to generate a new dataset X. * ;

[0077] The fourth step is to use the spectral clustering algorithm to analyze X. * Clustering is performed to build a prediction model, and the results are labeled as defective and non-defective to obtain prediction label vectors;

[0078] Fifth, repeat steps three and four multiple times to obtain multiple predicted label vectors, which are then fused to obtain an integrated prediction result.

[0079] The Area Under Curve (AUC) is a commonly used performance metric for evaluation. AUC represents the area under the receiver operation characteristic curve, and its value ranges from 0 to 1. The larger the value, the better the algorithm performance. A value of 0.5 indicates the performance of random guessing.

[0080] Under the same experimental conditions, the method of this invention (ESC) is compared with logistic regression (LR) and Naive Bayes (…). This study compares three commonly used supervised classifier methods—Bayes (NB), Random Forest (RF), and two unsupervised clustering methods—K-means and Spectral Clustering (SC). Experimental results on the AUC metric are also compared across five software projects.

[0081] like Figure 2Experimental results for all methods on five items are presented. The boxplot shows the AUC values ​​after 100 runs, displaying the minimum, first quantile, median, third quantile, and maximum. Vertical bars represent the range from the first to the third quantile, horizontal lines above and below the bars represent the minimum and maximum values, dots represent the median, and plus signs represent outliers. The figure shows that, on both individual items and all items, the method of this invention achieves similar AUC results to the supervised LR, NB, and RF methods; and compared to the unsupervised K-means and SC methods, the method of this invention achieves superior AUC results. These results demonstrate that the proposed method can be applied to unsupervised software defect prediction, validating its feasibility and effectiveness.

[0082] The above-described embodiments are merely preferred embodiments of the present invention, and the scope of protection of the present invention is not limited thereto. Any simple changes or equivalent substitutions of the technical solutions that can be obviously obtained by those skilled in the art within the scope of the technology disclosed in the present invention shall fall within the scope of protection of the present invention.

Claims

1. A software defect prediction method based on clustering ensemble, characterized in that, include: From the given project data, N software entities x are randomly sampled with replacement. From these N randomly sampled software entities x, m metrics are randomly selected to form a dataset. , where N is the total number of software entities in the project data; Based on dataset A clustering algorithm is used to construct an unsupervised software defect prediction model, and the clustering result feature vector v is obtained. Based on the feature vector v, defects are predicted for software entities according to a preset annotation rule. The prediction results are then labeled as defective or defect-free, resulting in a predicted label vector p for each software entity containing defective or defect-free annotations. The preset annotation rule is to first... The corresponding software entity is marked as a defect class, and... The corresponding software entity is labeled as defect-free. The overall metric value of the corresponding software entity is less than The overall metric value of the corresponding software entity will then be The corresponding software entity is labeled as defect-free, and the rest are labeled as defective. The predicted label vector p takes the value of 1 or 0, where 1 indicates defective and 0 indicates defect-free. The predicted label vector Remove duplicate sampling of software entities; Repeated sampling dataset And generate predicted label vector p multiple times for the extracted software entities. Calculate the predicted label vector p generated by the predicted label vector. average ,Will As its integrated prediction result; like A value greater than 0.5 indicates that the extracted software entities There are defects, if A value less than or equal to 0.5 indicates that the extracted software entities... No defects; The randomly sampled N software entities x with replacement each time include: data from a given item with replacement. One entity is extracted from the sample, and the extraction is performed N times in total. Represents project data The number of software metrics, Representing data The number of software entities; The project data refers to the data of the software project under test. Software entities refer to instance modules extracted from program code or during development, which include methods, classes, files, packages, or code changes.

2. The software defect prediction method based on clustering ensemble as described in claim 1, characterized in that, Also includes: The z-score method is used to normalize the given project data; The normalization algorithm is as follows: in It is the original value of the i-th metric of software entity x. yes The normalized value, It is the average value of software entity x. It is the standard deviation of the software entity x.

3. The software defect prediction method based on clustering ensemble as described in claim 1, characterized in that, The clustering algorithm is spectral clustering, and its algorithm is as follows: Constructing an adjacency matrix The algorithm is as follows: Calculate the Laplace matrix The algorithm is as follows: in, Let be a degree matrix, which is a diagonal matrix whose diagonal elements are... ; Normalized Laplace matrix get The algorithm is as follows: right Perform eigenvalue decomposition and select the eigenvector corresponding to the second smallest eigenvalue. Then, it is standardized to obtain an unsupervised software defect prediction model.

4. The software defect prediction method based on clustering ensemble as described in claim 1, characterized in that, The integrated prediction results The algorithm is as follows: in, The extracted software entities to be tested. This indicates that the actual software entity was extracted. Number of times, For the extracted software entities The label vector obtained from the j-th prediction.