A software defect prediction method based on structural semantic features

By constructing a bidirectional LSTM network to learn specific and shared information between projects, the problem of poor performance of existing defect prediction models in the same project and across projects is solved, and better defect prediction results are achieved.

CN115599691BActive Publication Date: 2026-06-19KYLIN CORP

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
KYLIN CORP
Filing Date
2022-10-28
Publication Date
2026-06-19

Smart Images

  • Figure CN115599691B_ABST
    Figure CN115599691B_ABST
Patent Text Reader

Abstract

This invention discloses a software defect prediction method based on structural semantic features, comprising: constructing a defect prediction model, the defect prediction model including: an input layer for receiving a specific information vector sequence and a common information vector sequence, which are respectively input into corresponding bidirectional LSTM networks; the specific information vector sequence is a vectorized specific information node token sequence, and the common information vector sequence is a vectorized common information node token sequence; an encoding layer for encoding using a bidirectional LSTM network to obtain corresponding specific information context vectors and hidden states, and a common information context vector and hidden state logistic regression layer for classifying the output obtained from the encoding layer to obtain prediction results corresponding to specific information and common information, respectively; and an output layer for using a switcher to select between the two prediction results to obtain the final prediction result. This invention can solve the problem of limitations in project defect prediction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of software engineering, and in particular to a method for predicting software defects based on structural semantic features. Background Technology

[0002] Software defects refer to problems, errors, or improper behaviors in computer software or programs that disrupt their normal operation and fail to meet expected or intended requirements. Research shows that some software defects may not yet show harm, but they can erupt at any time and pose varying degrees of threat. On a small scale, software defects can cause work interruptions, extremely poor user experience, and reputational and financial losses for companies. On a larger scale, attackers can exploit software defects to launch malicious attacks on systems or user accounts, leading to even more serious consequences. Early detection and remediation of software defects are crucial to minimizing losses; therefore, software defect prediction is an indispensable and important part of the software development process.

[0003] With the development of deep learning technology, researchers have begun to apply deep learning to automatically learn features from software source code and build models based on these features for defect prediction. Some studies have used semantic information as features to build models, but this ignores the strong structure of code programs, such as structural statements, method execution order, and nesting relationships. Similar code can have different semantics due to different structures, and existing defect prediction research has not effectively captured this structural information. Furthermore, most current defect prediction models have limitations; they can only perform well within the same project or across projects, and it is difficult to achieve good prediction results simultaneously across both. This is because each project has specific information, while multiple projects share common information, and existing defect prediction models cannot effectively utilize both types of information for defect prediction. Summary of the Invention

[0004] The technical problem to be solved by this invention is: in view of the technical problems existing in the prior art, this invention provides a software defect prediction method based on structural semantic features, which can simultaneously learn specific information and common information between projects, and perform defect prediction within the same project and across projects, so as to solve the problem of the limitation of project defect prediction.

[0005] To solve the above-mentioned technical problems, the technical solution proposed by this invention is as follows:

[0006] A software defect prediction method based on structural semantic features includes the following steps:

[0007] Construct a defect prediction model, the defect prediction model including:

[0008] The input layer is used to receive specific information vector sequences and common information vector sequences, and input them into the corresponding bidirectional LSTM networks respectively. The specific information vector sequence is a vectorized specific information node token sequence, and the common information vector sequence is a vectorized common information node token sequence.

[0009] The encoding layer is used to encode using a bidirectional LSTM network to obtain the corresponding specific information context vector and hidden state, as well as the common information context vector and hidden state.

[0010] The logistic regression layer is used to classify the output obtained from the encoding layer, and obtain the prediction results corresponding to specific information and the prediction results corresponding to common information respectively.

[0011] The output layer is used to select between two prediction results using a switcher to obtain the final prediction result.

[0012] Furthermore, after constructing the defect prediction model, the following steps are also included:

[0013] Obtain the project source code and parse it into an abstract syntax tree;

[0014] Traverse the abstract syntax tree to obtain the token sequence of specific information nodes and the token sequence of common information nodes, and add special characters to both ends of the scope of the traversed control flow nodes;

[0015] Eliminate tokens that do not meet the requirements from the specific information node token sequence and the shared information node token sequence;

[0016] Initialize the Embedding word vector matrix and vectorize the specific information node token sequence and the common information node token sequence according to the matrix. Then, input the vectorized specific information node token sequence, the vectorized common information node token sequence, and the corresponding software module labels into the defect prediction model for training.

[0017] Furthermore, the bidirectional LSTM network corresponds one-to-one with the specific information vector sequence and the common information vector sequence. Each bidirectional LSTM network includes two parallel LSTM layers. These two LSTM layers capture sequence features from the corresponding vectorized token sequences in the forward and backward directions, respectively, and fuse the features of the corresponding tokens in the two directions to obtain the context features of each token.

[0018] Furthermore, the formula for the logistic regression layer is as follows:

[0019] p t=softmax(W t o t +b t )

[0020] p v =softmax(W v o v +b v )

[0021] Where, p t p is the prediction result obtained based on specific information context vectors. v W is the prediction result obtained based on the shared information context vector. t W v b t b v All are trainable parameter vectors, o v For a specific information context vector, o t This is a shared information context vector.

[0022] Furthermore, the formula for the output layer is as follows:

[0023] s p =σ(W s [h v h′ v +b s )

[0024] p = [s p p v ;(1-s p )p t ]

[0025] Among them, s p The balance value is calculated by the switcher, and p is the final prediction result of whether it is a defect. t p is the prediction result obtained based on specific information context vectors. v h is the prediction result obtained based on the shared information context vector. v Let W be the forward hidden state, σ be the sigmoid activation function, and W be the sigmoid activation function. s b s All are trainable parameter vectors.

[0026] Furthermore, the specific information nodes include method names or class names related to method calls and instance creation, keywords related to declarations, and keywords related to control flow; the common information nodes include node types related to method calls and instance creation, node types related to declarations, and node types related to control flow.

[0027] Furthermore, adding special characters to both ends of the scope of the traversed control flow nodes specifically involves adding the special character '(' at the beginning of the scope of the control flow node and adding the special character ')' at the end.

[0028] Furthermore, the tokens that do not meet the requirements are tokens that appear less than three times.

[0029] Furthermore, after vectorizing the specific information node token sequence and the shared information node token sequence using the Embedding word vector matrix, the method further includes adding at least one 0 to the shorter sequence in the vectorized specific information node token sequence and the vectorized shared information node token sequence, so that the two sequences have the same length.

[0030] Furthermore, during the training of the defect prediction model, L2 regularization is applied to the weight matrices of the bidirectional LSTM network and the logistic regression layer, the loss function is set to cross-entropy loss, the optimizer is set to Adam, and the Embedding word vector matrix is ​​updated and adjusted during training.

[0031] Compared with the prior art, the advantages of the present invention are as follows:

[0032] 1. The structure of code plays a crucial role in understanding its semantics. Some syntactic details in source code files, such as nested parentheses and assignment operators, are not presented in the abstract syntax tree (API), but may be hidden within the logical structure above and below the API. Existing defect prediction methods simply traverse the API to obtain token representation sequences when acquiring code features, ignoring the structural information contained within them. This invention uses a novel token sequence representation method, adding special characters to control flow nodes. Compared with existing code representation methods, this token sequence, without destroying the program's syntactic information, can reflect the structural relationships between program statements, enabling the defect prediction model to better learn the code's semantic information and obtain more valuable features for model prediction.

[0033] 2. In the Abstract Syntax Tree (AST) of the project source code, some nodes are method- or class-specific, such as assignment declarations, and cannot be generalized to the entire project. Extracting these nodes may dilute the importance of other nodes. Therefore, to better extract code feature information, three types of representative nodes were selected from the AST nodes. Node value information (such as method names, variable names, etc.) may only exist in a specific project, which can help the model learn specific feature information. Node types are the same in different software projects, so node types can represent software modules in different projects in a unified way, helping the model learn common feature information. This invention designs a bidirectional LSTM model as a defect prediction model to learn specific and common information between projects, and provides a selection mechanism to weigh the proportion of specific and common information features, thereby achieving better defect prediction results and solving the problem of limitations in project defect prediction. Attached Figure Description

[0034] Figure 1 This is a schematic diagram of the overall process of an embodiment of the present invention.

[0035] Figure 2 This is a schematic diagram of the control flow node in an embodiment of the present invention.

[0036] Figure 3 This is a schematic diagram of token sequence vectorization according to an embodiment of the present invention.

[0037] Figure 4 This is a schematic diagram of a bidirectional LSTM network structure according to an embodiment of the present invention. Detailed Implementation

[0038] The present invention will be further described below with reference to the accompanying drawings and specific preferred embodiments, but this does not limit the scope of protection of the present invention.

[0039] The purpose of this plan is:

[0040] This paper provides a token sequence representation based on Abstract Syntax Tree (AST). This token sequence can reflect the structural relationship between program statements without destroying the program syntax information, so as to construct a vector representation of the implied semantics of the token sequence and apply it to improve the performance of defect prediction model.

[0041] Specific and shared information between projects is extracted. This information can be represented by one or more token sequences. Both types of information will help the defect prediction model to make predictions on the target project to some extent.

[0042] Design a defect prediction model that can simultaneously learn specific and shared information features, and provide a selection mechanism to balance the weight of specific and shared information features. This model can then be used to predict defects in target projects, achieving good defect prediction results.

[0043] To achieve the above objectives, this embodiment proposes a software defect prediction method based on structural semantic features, such as... Figure 1 As shown, the specific steps are as follows:

[0044] Step 1: Obtain the source code of the open-source project. Using a code parsing tool (such as javalang for Java code parsing or AST for Python code parsing), parse the source code into an abstract syntax tree. Use specific information nodes and common information nodes as tokens. The specific information nodes mainly include:

[0045] 1) Method names or class names related to method calls and instance creation;

[0046] 2): Keywords related to the declaration;

[0047] 3): Keywords related to control flow;

[0048] The common information nodes mainly include:

[0049] 1) Node types related to method calls and instance creation;

[0050] 2): Node types related to the declaration;

[0051] 3): Node types related to control flow.

[0052] Step 2: Traverse the abstract syntax tree obtained in Step 1 to obtain the token sequence of specific information nodes and the token sequence of common information nodes. Simultaneously, when traversing to control flow-related nodes, add the special character '(' at the beginning of the node's scope and the special character ')' at the end. The control flow-related nodes are shown in the attached table. Figure 2 As shown.

[0053] For example, appendix Figure 3 Two Java programs are shown. By parsing the source code, the token sequence representation obtained by the traditional method is as follows:

[0054] Program 1: [if,for,foo(),bar()]

[0055] Program 2: [if,for,foo(),bar()]

[0056] The specific information node token sequence obtained by adding special characters to the control flow node through step 2 of this embodiment is as follows:

[0057] Program 1: [if,(,for,(,foo(),bar(),),)]

[0058] Program 2: [if,(,for,(,foo(),),bar(),)]

[0059] In particular, the special character '(' is added to the starting point of the scope of the if and for control flow keywords in the Token sequence, and the special character ')' is added to the ending point of the scope, thus representing the structure of the source code.

[0060] Step 3: Defective data is often noisy and contains mislabeled tokens. To remove noise from the data, Closest List Noise Identification (CLNI) is used based on edit distance similarity to eliminate tokens with potentially incorrect labels. In this process, infrequent token nodes are filtered out; these nodes may be designed for a specific file and are difficult to generalize to other files. For a given project, a token is filtered out if it appears less than three times.

[0061] Step 4: Randomly initialize the Embedding word vector matrix. The vector representations of specific information nodes and shared information node token sequences can be obtained by querying this Embedding matrix. Then, the vectorized token sequences and the corresponding software module labels are input into the defect prediction model for training. The Embedding word vector matrix will be updated and adjusted during training.

[0062] It's important to note that the defect prediction model only accepts numeric vectors as input, and all input vectors must have the same length. To generate semantic features using the model, a mapping must first be established between integers and tokens, and the token vectors must be encoded as integer vectors. Each token has a unique integer identifier. Since integer vectors may have different lengths, we append 0s to the integer vectors to make all lengths consistent and equal to the length of the longest vector. Specifically, we add at least one 0 to the shorter of the vectorized specific information node token sequence and the vectorized common information node token sequence, making the two sequences the same length. Adding 0s does not affect the model's learning results; it's merely a representation transformation. The integer vectors correspond to the word vector matrix, with each integer having its corresponding word vector.

[0063] For example, in the appendix Figure 3In the two code snippets, if we only consider the two code snippets in the examples, the vectors are mapped to [3,1,5,1,4,6,2,2] and [3,1,5,1,4,2,6,2] respectively. Through this mapping process, method call information and statement execution information are represented as integer vectors. Furthermore, since the order of the tokens remains unchanged, the structural information of the program is preserved.

[0064] In this embodiment, the steps for obtaining the defect prediction model include:

[0065] Step 5: Constructing the Defect Prediction Model. In this embodiment, the defect prediction model is constructed based on a bidirectional LSTM model. The overall architecture of the defect prediction model in this embodiment consists of an input layer, an encoding layer, a logistic regression layer, and an output layer. The input layer is responsible for receiving the word vector matrix. The bidirectional LSTM acts as the encoder of the encoding layer. The logistic regression layer is responsible for classifying the output of the encoding layer. The output layer uses a converter to obtain the final prediction result. The vector sequence obtained in Step 4 and the corresponding labels are input into the bidirectional LSTM model. Specific information vector sequences and common information vector sequences each have their corresponding bidirectional LSTM networks. Each bidirectional LSTM network has two parallel LSTM layers. The two LSTM layers propagate in the forward and backward directions respectively and capture sequence features from these two directions. Finally, the features captured from the two directions are combined to obtain the context features. In the defect prediction model, the two LSTM layers in the bidirectional LSTM layer capture sequence features from the vectorized token sequence in two directions respectively, and fuse the features of the corresponding tokens in the two directions to obtain the context features of each token. The final output is a feature vector sequence with the same length as the input sequence, where each feature vector is the context feature of the corresponding token. The details of each layer of the defect prediction model are as follows:

[0066] 5-1. Input layer: Receives specific information vector sequences and common information vector sequences, and inputs them into the corresponding bidirectional LSTM networks respectively.

[0067] 5-2. Encoding Layer: Encoding is performed using a bidirectional LSTM network. The structure of the bidirectional LSTM network is shown in the attached figure. Figure 4 As shown, through the forward propagation layer and the backward propagation layer, the corresponding feature information context vector and hidden state, and the common information context vector and hidden state are obtained, and their formulas are as follows:

[0068] h v =f(w1x v +w2h v-1 )

[0069] h′ v =f(w3x) v +w5h′ v+1)

[0070] o v =g(w4h) v +w6h′ v )

[0071] Where x is the input word vector, h v For the forward-hidden state, h' v In the backward hidden state, o v For the output specific information context vector, W1 to W6 are trainable parameter vectors, randomly initialized before training begins and adjusted during training. Similarly, the public information context vector o can be obtained. t .

[0072] During the forward propagation of the LSTM, the current input node value is x. v The input node value at the previous time step was x. v-1 h v-1 To perform forward propagation of the LSTM based on the previous input node value x v-1 The hidden state is obtained, and then the current input node value x is obtained. v The corresponding hidden state h v Similarly, during the backpropagation process of LSTM, the current input node value is x. v The input node value at the previous time step was x. v+1 h v+1 To perform forward propagation of the LSTM based on the previous input node value x v+1 The hidden state is obtained, and then the current input node value x is obtained. v The corresponding hidden state h v .

[0073] 5-3. Logistic Regression Layer: This layer classifies the output from the encoding layer, obtaining prediction results for specific information and prediction results for common information. The formula is as follows:

[0074] p t =softmax(W t O t +b t )

[0075] p v =softmax(W v o v +b v )

[0076] Where, p t p is the prediction result obtained based on specific information context vectors. v W is the prediction result obtained based on the shared information context vector.t W v b t b v All parameters are trainable parameter vectors, randomly initialized before training begins, and adjusted during training. v For a specific information context vector, o t This is a shared information context vector.

[0077] It should be noted that for each input token, a corresponding context feature vector o is obtained. In the logistic regression layer, the o obtained from the last input token is used to calculate the prediction result p.

[0078] 5-4. Output Layer: A switcher is used to select between two predictions. The switcher is a sigmoid function conditional on the forward and backward hidden states. Finally, the model completes the final prediction by connecting the two prediction distributions. The formula is as follows:

[0079] s p =σ(W s [h v h′ v +b s )

[0080] p = [s p p v ;(1-s p )p t ]

[0081] Among them, s p The balance value is calculated by the switcher, and p is the final prediction result of whether it is a defect. t p is the prediction result obtained based on specific information context vectors. v h is the prediction result obtained based on the shared information context vector. v Let W be the forward hidden state, σ be the sigmoid activation function, and W be the sigmoid activation function. s b s All are trainable parameter vectors, randomly initialized before training begins, and adjusted during training.

[0082] Step 6: Model Training. During training, L2 regularization is applied to the weight matrices of the bidirectional LSTM network and the logistic regression layer to alleviate overfitting. The loss function is set to cross-entropy loss, and the optimizer is set to Adam to optimize the training process. In addition, as mentioned earlier, the embedding word vector matrix is ​​updated and adjusted during training.

[0083] After the defect prediction model is trained, the source code of the test project set is processed according to steps 1 to 4 and then input into the defect prediction model to obtain the corresponding software defect prediction results.

[0084] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention in any way. Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the invention. Therefore, any simple modifications, equivalent changes, and alterations made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention should fall within the protection scope of the present invention.

Claims

1. A software defect prediction method based on structural semantic features, characterized in that, Includes the following steps: Construct a defect prediction model, the defect prediction model including: The input layer receives specific information vector sequences and common information vector sequences, which are then input into the corresponding bidirectional LSTM networks. The specific information vector sequences are vectorized specific information node token sequences, and the common information vector sequences are vectorized common information node token sequences. The specific information nodes include method names or class names related to method calls and instance creation, keywords related to declarations, and keywords related to control flow. The common information nodes include node types related to method calls and instance creation, node types related to declarations, and node types related to control flow. The encoding layer is used to encode using a bidirectional LSTM network to obtain the corresponding specific information context vector and hidden state, as well as the common information context vector and hidden state. The logistic regression layer is used to classify the output obtained from the encoding layer, and obtain the prediction results corresponding to specific information and the prediction results corresponding to common information respectively. The output layer is used to select between two prediction results using a switcher to obtain the final prediction result. The formula for the output layer is as follows: in, s p The balance value obtained after calculation of the switch. p The final prediction result is whether it is a defect. p t The prediction result is obtained based on the specific information context vector. p v The prediction result is obtained based on the shared information context vector. h v Let W be the forward hidden state, σ be the sigmoid activation function, and W be the sigmoid activation function. s b s All are trainable parameter vectors; Obtain the project source code and parse it into an abstract syntax tree; Traverse the abstract syntax tree to obtain the token sequence of specific information nodes and the token sequence of common information nodes, and add special characters to both ends of the scope of the traversed control flow nodes; Eliminate tokens that do not meet the requirements from the specific information node token sequence and the shared information node token sequence; Initialize the Embedding word vector matrix and vectorize the specific information node token sequence and the common information node token sequence according to the matrix. Then, input the vectorized specific information node token sequence, the vectorized common information node token sequence, and the corresponding software module labels into the defect prediction model for training. 2.The software defect prediction method based on structural semantic features according to claim 1, wherein, The bidirectional LSTM network corresponds one-to-one with specific information vector sequences and common information vector sequences. Each bidirectional LSTM network includes two parallel LSTM layers. These two LSTM layers capture sequence features from the corresponding vectorized token sequences in the forward and backward directions, respectively, and fuse the features of the corresponding tokens in the two directions to obtain the context features of each token.

3. The software defect prediction method based on structural semantic features according to claim 1, characterized in that, The formula for the logistic regression layer is as follows: in, p t The prediction result is obtained based on the specific information context vector. p v W is the prediction result obtained based on the shared information context vector. t W v b t b v All are trainable parameter vectors, o v For a specific information context vector, o t This is a shared information context vector. 4.The software defect prediction method based on structural semantic features of claim 1, wherein, Adding special characters to both ends of the scope of the traversed control flow node is specifically done by adding the special character '(' at the beginning of the scope of the control flow node and the special character ')' at the end. 5.The software defect prediction method based on structural semantic features according to claim 1, wherein, The tokens that do not meet the requirements are those that appear less than three times.

6. The software defect prediction method based on structural semantic features according to claim 1, characterized in that, After vectorizing the specific information node token sequence and the shared information node token sequence using the Embedding word vector matrix, the method further includes adding at least one 0 to the shorter sequence in the vectorized specific information node token sequence and the vectorized shared information node token sequence, so that the two sequences have the same length. 7.The software defect prediction method based on structural semantic features of claim 1, wherein, During the training of the defect prediction model, L2 regularization is applied to the weight matrices of the bidirectional LSTM network and the logistic regression layer, the loss function is set to cross-entropy loss, the optimizer is set to Adam, and the embedding word vector matrix is ​​updated and adjusted during training.