Defect report and code commit link recovery enhancement method based on deep semi-supervision

By combining deep semi-supervised learning with time rules and pseudo-labeling techniques, the limitations of data dependency and semantic understanding in defect reporting and code commit link restoration are overcome, achieving efficient link restoration on the GitHub platform.

CN117215626BActive Publication Date: 2026-06-12NANJING UNIV OF AERONAUTICS & ASTRONAUTICS

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NANJING UNIV OF AERONAUTICS & ASTRONAUTICS
Filing Date
2023-09-12
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing rule-based and machine learning-based defect reporting and code commit link recovery methods have limitations in understanding semantic relationships, and deep learning methods rely heavily on large amounts of labeled data, which is costly to obtain, resulting in poor performance of defect reporting and code commit link recovery on the GitHub open-source platform.

Method used

A deep semi-supervised learning method is adopted, which trains a deep neural network using labeled and unlabeled data, pairs defect reports and code submissions through time rules, infers pseudo-labels, constructs a class-balanced dataset, and jointly trains a traceability link recovery model to enhance link recovery performance.

🎯Benefits of technology

With limited labeled data, the performance of traceability link recovery is improved, and the model's recovery efficiency and accuracy are enhanced by making full use of unlabeled data information.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117215626B_ABST
    Figure CN117215626B_ABST
Patent Text Reader

Abstract

The application discloses a defect report and code submission link recovery enhancement method based on deep semi-supervision, pre-training is carried out by using limited labeled defect report and code submission data; unlabeled defect report and code submission data are paired based on time rules; pseudo-labels of the unlabeled defect report and code submission data are inferred through semi-supervised learning; a class-balanced data set is constructed by selecting pseudo-labeled defect report and code submission data; and re-training is carried out by using the labeled and pseudo-labeled defect report and code submission data. The application selects high-quality pseudo-labeled data and labeled data to re-train the traceability link recovery model, and the pseudo-labeled data expands the training data set; and the application realizes better defect report and code submission traceability link recovery accuracy compared with only using labeled data under the condition that the amount of labeled defect report and code submission data is small.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of traceability link recovery in software engineering, specifically involving an enhanced method for recovering defect reports and code submission links based on deep semi-supervised learning. Background Technology

[0002] Software development involves creating and maintaining source code in a version control system, and submitting defect reports by users and testers in a bug tracking system. The traceability link between defect reports and code commits is crucial for software maintenance, helping to provide useful support for various software activities such as program understanding, assessing software quality, and predicting software defects. However, these links are often lost, making effective software maintenance a challenge.

[0003] To recover lost links, researchers have proposed several rule-based and machine learning-based traceability link recovery methods. However, these methods have limitations in understanding the semantic relationships between defect reports and code submissions. To address this issue, researchers have turned to deep learning, which can effectively capture semantic relevance between terms. However, a significant drawback of deep learning is its heavy reliance on large amounts of labeled data for training. Obtaining labeled data is expensive, and building large labeled datasets for each deep learning task is often impractical. Therefore, using readily available unlabeled data has become a promising research direction.

[0004] In software development on the GitHub open-source platform, labeling the link data between defect reports and code commits is costly, often resulting in only a small amount of labeled data being available, while the vast majority remains unlabeled. Deep neural networks exhibit superior performance in supervised tasks when sufficient labeled data is available. Considering the limited amount of labeled data in training datasets on GitHub open-source platforms, research interest in applying semi-supervised learning to deep neural networks—i.e., deep semi-supervised learning—is growing. Deep semi-supervised learning is a method that uses both labeled and unlabeled data to train deep neural networks, and it has already achieved good results in the field of computer vision. This invention combines deep semi-supervised learning with traceability link recovery to achieve good traceability link recovery performance with less labeled data. Summary of the Invention

[0005] Purpose of the invention: This invention proposes a method for enhancing the recovery of traceability links based on deep semi-supervised learning. By fully utilizing the information in unlabeled defect reports and code submission data through deep semi-supervised learning, pseudo-labels of unlabeled defect reports and code submission data are inferred. These pseudo-labels are then combined with labeled defect reports and code submissions to jointly train a traceability link recovery model, thereby enhancing the performance of traceability link recovery.

[0006] Technical Solution: The present invention provides an enhanced method for defect reporting and code submission link recovery based on deep semi-supervised learning, which specifically includes the following steps:

[0007] (1) Use a limited set of labeled defect reports and code submission data to pre-train a traceable link recovery deep neural network, i.e., a pre-trained model;

[0008] (2) Based on time rules, unmarked defect reports and code submission data are matched to obtain candidate links;

[0009] (3) Based on the pre-trained model obtained in step (1), the pseudo-labels of the candidate links obtained in step (2) are inferred through self-training or label propagation methods.

[0010] (4) Select samples with positive pseudo-labels from the pseudo-labeled data obtained in step (3), and then generate negative examples through dynamic random negative sample sampling technology to construct a class-balanced dataset;

[0011] (5) Use the labeled defect reports and code submission data and the pseudo-labeled data obtained from step (4) to retrain the traceability link recovery deep neural network until the preset number of training cycles is reached.

[0012] Furthermore, the implementation process of step (2) is as follows:

[0013] The time rule states that among all code commits whose submission time falls between the creation and closure time of a defect report, there must be at least one code commit with a traceable link to that defect report. According to this time rule, defect reports are paired with several code commits, and there is a candidate link in each pair of "defect report - code commit".

[0014] Furthermore, the process of inferring the pseudo-tags of the candidate links obtained in step (2) through the self-training method in step (3) is as follows:

[0015] The candidate link set obtained from step (2) is denoted as X. u Each candidate link is represented as (s,t). i The predicted probability of a sample i belonging to class j by the pre-trained model is expressed as f. θ ((s,t) i ) j If the maximum predicted probability corresponding to the class index is j, then sample i belongs to the label of class j. The value is true otherwise; for "Defect Report - Code Submission", there are only two categories: traceable link exists and no traceable link exists; the pseudo-tag for "Existing Link" is represented as... The pseudo-tag for a non-existent link is represented as

[0016] A confidence threshold β is set to filter predictions that the model is not confident in; candidate links whose prediction probabilities exceed the confidence threshold are selected for the next training step; the selected candidate links are denoted as... Where X′ u It is X u A subset of the set with a confidence threshold of 0.8.

[0017] Furthermore, the process of inferring the pseudo-tags of the candidate links obtained in step (2) through the tag propagation method in step (3) is as follows:

[0018] The pre-trained model constructs a descriptor set V = (v1, v2, ..., v) for both labeled and unlabeled "defect reports - code submissions". n Then, a k-nearest neighbor graph is constructed based on the descriptor set V, represented by a sparse matrix A. The elements are shown below:

[0019]

[0020] Among them, NN k (v j ) represents distance v j The k nearest neighbors; the weight matrix W of the K-nearest neighbor graph is represented as W = A + AT, and its symmetric normalization term is represented as Where 2 = diag(W1) n ) is a degree matrix, 1 n It is a vector of size n consisting entirely of 1s; the label matrix. The elements are shown below:

[0021]

[0022] Among them, X l This indicates a tagged "Defect Report - Code Submission", y i The label represents the sample i; the label information is obtained from X through the feature space. l Spread to X u By calculating the diffusion matrix Z = (I - αW) -1 Y obtains pseudo-labels, and the pseudo-labels for unlabeled samples are represented as follows: Among them, z ij It is the element in the i-th row and j-th column of matrix Z. The probability of the existence of a traceable link is obtained by normalizing each row of matrix Z; weights are assigned to each pseudo-label. in, The i-th row of matrix Z is normalized to obtain H, which is the entropy function, and C is the number of classes. According to the above process, each candidate link is assigned a pseudo-label, and the data with pseudo-labels is called pseudo-labeled data.

[0023] Furthermore, the implementation process of step (4) is as follows:

[0024] First, select samples with positive pseudo-labels from the pseudo-labeled data obtained in step (3); a positive pseudo-label indicates that the model believes there may be a traceable link between "defect report - code submission"; for these samples with positive pseudo-labels, defect reports belong to set S and code submissions belong to set T; next, link to defect reports s i n code submissions are removed from set T, and then n code submissions with the same s value are randomly selected from the remaining code submissions in set T. i Pairing is performed to create negative sample cases that have no traceable links; the above operation is performed on each defect report, and finally a class-balanced dataset with the same number of positive and negative cases is generated.

[0025] Furthermore, the implementation process of step (5) is as follows:

[0026] The loss function for the tagged "Defect Report - Code Submission" is expressed as follows:

[0027]

[0028] The loss function of the self-training method is expressed as:

[0029]

[0030] The loss function for label propagation is expressed as:

[0031]

[0032] The overall loss function is expressed as:

[0033] loss = loss l +λ u loss u

[0034] Where H(p,q) represents the cross-entropy of distributions p and q, and λ u This represents hyperparameters.

[0035] Beneficial Effects: Compared with existing technologies, the present invention offers the following advantages: It combines deep semi-supervised learning with traceability link recovery, fully utilizing the abundant unlabeled defect reports and code submissions in the dataset. Pseudo-labels for these unlabeled defect reports and code submissions are inferred through semi-supervised learning, and the model is trained using both labeled and unlabeled defect reports and code submissions. This improves the performance of the traceability link recovery model even with limited amounts of labeled defect reports and code submissions. Attached Figure Description

[0036] Figure 1 This is a flowchart of the present invention;

[0037] Figure 2 This is an overall framework diagram of the present invention;

[0038] Figure 3 A diagram illustrating the pairing of defect reports and code submission data based on time rules;

[0039] Figure 4 This is a diagram representing dynamic random negative sample sampling. Detailed Implementation

[0040] The present invention will now be described in further detail with reference to the accompanying drawings.

[0041] like Figure 1 , Figure 2 As shown, this invention proposes a method for enhancing defect reporting and code submission link recovery based on deep semi-supervised learning, specifically including the following steps:

[0042] Step 1: Pre-train a traceability link recovery deep neural network, i.e., a pre-trained model, using a limited set of tagged defect reports and code submission data.

[0043] In this step, a traceability link model is trained using a limited set of labeled defect reports and code submissions. This enables the model to learn the latent features of the defect reports and code submissions. After pre-training, the model can predict labels or provide feature representations for unlabeled defect reports and code submissions, providing valuable input for subsequent deep semi-supervised learning processes.

[0044] Step 2: Pair untagged defect reports and code submission data based on time rules to obtain candidate links.

[0045] Defect reports and code commits are common software artifacts that play a crucial role in software development. The traceable link between defect reports and code commits is essential for software engineering research and problem-solving. Throughout the software lifecycle, various issues may arise, which users and testers describe and request fixes through defect reports. Upon receiving a defect report, the development team addresses the issue and documents the fix in a code commit, typically referencing the appropriate defect report identifier. Users and testers then validate the solution and close the bug report. This process establishes a link between defect reports and code commits through identifiers.

[0046] In the traceability link recovery task, defect reports and code commits are first paired, with each pair containing a candidate link. A candidate link can be either a genuine link with traceability or a non-genuine link without traceability. The next step is to predict whether the candidate link is labeled as a genuine or non-genuine link. A conventional pairing method is to calculate the Cartesian product of defect reports and code commits, but this leads to a severe class imbalance problem because the number of defect reports / code commits without traceability is far greater than the number of defect reports / code commits with traceability.

[0047] In software development, there's a timeline between defect reports and code submissions: users or testers create defect reports after discovering issues, developers resolve the issues and submit fixes upon receiving the reports, and users or testers close the defect report once they confirm the issue is resolved. Therefore, the submission of bug fix code typically falls between the creation and closure of the defect report.

[0048] Based on this time rule, it is known that among all code commits whose commit times fall between the defect report creation and closure times, there must be at least one code commit with a traceable link to that defect report. By pairing each defect report with several code commits according to this relationship, a candidate link set is constructed. The number of candidate links constructed using this method is far less than that using the Cartesian product method, significantly reducing the number of "defect report-code commit" pairs without traceable links, and improving the efficiency and accuracy of traceable link recovery.

[0049] like Figure 3As shown, there are 2 defect reports and 8 code commits. Calculating their Cartesian product would generate 16 pairs of "defect report - code commit". However, by examining the temporal relationship between the defect reports and code commits, candidate links within the creation and closure time windows of each defect report can be identified. For example, code commits 2, 3, 4, and 5 are committed within the creation and closure time window of defect report 1, while code commits 5, 6, and 7 are committed within the creation and closure time window of defect report 2. Therefore, we pair defect report 1 with code commits 2, 3, 4, and 5, and defect report 2 with code commits 5, 6, and 7. Based on the temporal rules, we only generated 7 pairs of "defect report - code commit", effectively reducing the number of candidate links compared to the Cartesian product.

[0050] Step 3: The pre-trained model obtained from Step 1 can directly predict pseudo-labels for candidate links through self-training methods, and can also propagate label information to unlabeled "defect reports - code submissions" through label propagation methods.

[0051] One type of semi-supervised learning method is self-training, a model-based approach. The principle of self-training is to continuously learn a classifier from pseudo-labeled unlabeled data, exceeding a certain threshold. Then, unlabeled examples are used to enrich the labeled training data, and a new classifier is trained using the labeled training set.

[0052] The candidate link set obtained from step 2 is denoted as X. u Each "defect report - code submission" pair is represented as (s,t). i The predicted probability of a sample i belonging to class j by the pre-trained model is expressed as f. θ ((s,t) i ) j If the maximum predicted probability corresponding to the class index is j, then sample i belongs to the label of class j. The value is true otherwise. For "Defect Report - Code Submission", there are only two categories: traceable link exists and no traceable link exists; the pseudo-tag for "link exists" is represented as... The pseudo-tag for a non-existent link is represented as

[0053] Furthermore, a confidence threshold β is set to 0.8 to filter out predictions from which the model lacks confidence. Candidate links whose predicted probabilities exceed the confidence threshold are selected for the next training step. The selected candidate link pairs are denoted as follows: Where X′ u It is X u A subset of.

[0054] Another semi-supervised learning method is label propagation, a graph-based approach that uses the label information of labeled nodes to predict the label information of unlabeled nodes. A complete graph model is built using the relationships between samples. In this graph, nodes include both labeled and unlabeled data, and edges represent the similarity between two nodes. A node's label is propagated to other nodes based on similarity. Labeled data acts as a source, labeling unlabeled data; the greater the similarity between nodes, the easier it is for the label to propagate. In this graph, nodes represent defect report-code submission pairs, and edges represent the similarity between these pairs. This method propagates label information to unlabeled defect reports and code submissions through the graph.

[0055] The pre-trained model constructs a descriptor set V = (v1, v2, ..., v) for both labeled and unlabeled "defect reports - code submissions". n Then, a k-nearest neighbor graph is constructed based on the descriptor set V, represented by a sparse matrix A. The elements are shown below:

[0056]

[0057] Among them, NN , (v j ) represents distance v j The nearest neighbors; the weight matrix W of the K-nearest neighbor graph is represented as W = A + AT, and its symmetric normalization term is represented as Where 2 = diag(W1) n ) is a degree matrix, 1 n It is a vector of size n consisting entirely of 1s.

[0058] Tag matrix The elements are shown below:

[0059]

[0060] Among them, X l This indicates a tagged "Defect Report - Code Submission", y i This represents the label of sample i. Label information is obtained from X through the feature space. l Spread to X u The diffusion matrix Z = (I - αW) is calculated. -1 Y can be used to obtain pseudo-labels. The pseudo-labels for unlabeled samples are represented as follows: Among them, z ij It is the element in the i-th row and j-th column of matrix Z. The probability of the existence of traceable links is obtained by normalizing each row of matrix Z.

[0061] Pseudo-labels are not entirely accurate, and incorrect pseudo-labels can mislead the model's learning. Therefore, weights are assigned to each pseudo-label. in It is obtained by normalizing the i-th row of matrix Z, where H is the entropy function and c is the number of classes.

[0062] According to the above process, each candidate link is assigned a pseudo-tag, and the data with pseudo-tags is called pseudo-label data.

[0063] Step 4: Select samples with positive pseudo-labels from the pseudo-labeled data obtained in Step 3, and then generate negative examples using dynamic random negative sample sampling technology to construct a class-balanced dataset.

[0064] Dynamic random negative sample sampling (DRSS) can generate class-balanced datasets. Its principle is as follows: given a set of defect reports and code submissions with traceable links, defect reports belong to set S, and code submissions belong to set T. DRSS links these defect reports to set S. i n code submissions are removed from set T, and then n code submissions with the same s value are randomly selected from the remaining code submissions in set T. i Pairing results in negative examples that lack traceable links. This process is repeated for each defect report, ultimately generating a class-balanced dataset with an equal number of positive and negative examples.

[0065] like Figure 4 The diagram illustrates a simple example to explain the process of dynamic random negative sample sampling. A traceable link exists between defect report S1 and code submission T1. Dynamic random negative sample sampling removes T1 from the code submission set and randomly selects one code submission from the subsets T2, T3, and T4 to generate a negative sample. The same process applies to defect report S2. Furthermore, defect report S3 is linked to code submissions T3 and T4 respectively, and dynamic random negative sample sampling needs to generate two negative samples for S3. After removing T3 and T4 from the code submission set T, S3 must be linked to T1 and T2 to create two negative samples. Therefore, to construct the pseudo-label dataset, samples with positive pseudo-labels are selected, and negative samples are generated using dynamic random negative sample sampling. This process ensures that the number of positive and negative samples in the pseudo-label dataset remains balanced.

[0066] Step 5: Use the tagged defect reports and code submission data, along with the pseudo-labeled data obtained in Step 4, to retrain the traceability link recovery deep neural network until the preset number of training cycles is reached.

[0067] This step uses both labeled and pseudo-labeled data to train the deep neural network. At the beginning of each cycle, pseudo-labeled data is obtained by inferring candidate links in step 3, and a class-balanced dataset is constructed in step 4. Then, the labeled and pseudo-labeled data are combined and the network is trained again. The cycle ends after the combined dataset updates the model. The above steps are repeated until the preset number of training cycles is reached.

[0068] The network is trained in a supervised manner using labeled and pseudo-labeled data, their loss functions are calculated separately, and finally the combined loss is calculated.

[0069] The loss function for labeled data is expressed as:

[0070]

[0071] The loss function of the self-training method is expressed as:

[0072]

[0073] The loss function for label propagation is expressed as:

[0074]

[0075] The overall loss function is expressed as:

[0076] loss = loss l +λ u loss u

[0077] Where H(p,q) represents the cross-entropy of distributions p and q, and λ u This represents hyperparameters.

[0078] The above description provides a detailed account of the traceability link recovery model enhancement method based on deep semi-supervised learning described in this invention. However, it is clear that the specific implementation of this invention is not limited thereto. For those skilled in the art, various obvious modifications made to this invention without departing from the spirit and scope of the claims are within the protection scope of this invention.

Claims

1. A method for enhancing defect reporting and code commit link recovery based on deep semi-supervised learning, characterized in that, Includes the following steps: (1) Use a limited set of labeled defect reports and code submission data to pre-train a traceable link recovery deep neural network, i.e., a pre-trained model; (2) Based on time rules, unmarked defect reports and code submission data are matched to obtain candidate links; (3) Based on the pre-trained model obtained in step (1), the pseudo-labels of the candidate links obtained in step (2) are inferred through self-training or label propagation methods. (4) Select samples with positive pseudo-labels from the pseudo-labeled data obtained in step (3), and then generate negative examples through dynamic random negative sample sampling technology to construct a class-balanced dataset; (5) Use the labeled defect reports and code submission data and the pseudo-labeled data obtained from step (4) to retrain the traceability link recovery deep neural network until the preset number of training cycles is reached.

2. The method for enhancing defect reporting and code submission link recovery based on deep semi-supervised learning according to claim 1, characterized in that, The implementation process of step (2) is as follows: The time rule states that among all code commits whose submission time falls between the creation and closure time of a defect report, there must be at least one code commit with a traceable link to that defect report. According to this time rule, defect reports are paired with several code commits, and there is a candidate link in each "defect report - code commit" pair.

3. The method for enhancing defect reporting and code submission link recovery based on deep semi-supervised learning according to claim 1, characterized in that, The process of inferring the pseudo-tags of the candidate links obtained in step (2) through the self-training method in step (3) is as follows: The candidate link set obtained from step (2) is denoted as X. u Each candidate link is represented as (s,t). i The predicted probability of a sample i belonging to class j by the pre-trained model is expressed as f. θ ((s,t) i ) j If the maximum predicted probability corresponding to the class index is j, then sample i belongs to the label of class j. The value is true otherwise; for "Defect Report - Code Submission", there are only two categories: traceable link exists and no traceable link exists; the pseudo-tag for "Existing Link" is represented as... The pseudo-tag for a non-existent link is represented as A confidence threshold β is set to filter predictions that the model is not confident in; candidate links whose prediction probabilities exceed the confidence threshold are selected for the next training step; the selected candidate links are denoted as... Where X′ u It is X u A subset of.

4. The method for enhancing defect reporting and code submission link recovery based on deep semi-supervised learning according to claim 1, characterized in that, The process of inferring the pseudo-tags of the candidate links obtained in step (2) through the tag propagation method in step (3) is as follows: The pre-trained model constructs a descriptor set V = (v1, v2, ..., v) for both labeled and unlabeled "defect reports - code submissions". n Then, a k-nearest neighbor graph is constructed based on the descriptor set V, represented by a sparse matrix A. The elements are shown below: Among them, NN k (v j ) represents distance v j The k nearest neighbors; the weight matrix W of the K-nearest neighbor graph is represented as W = A + A T And its symmetric normalization term is expressed as Where D = diag(W1) n ) is a degree matrix, 1 n It is a vector of size n consisting entirely of 1s; the label matrix. The elements are shown below: Among them, X l This indicates a tagged "Defect Report - Code Submission", y i The label represents the sample i; the label information is obtained from X through the feature space. l Spread to X u By calculating the diffusion matrix Z = (I - αW) -1 Y obtains pseudo-labels, and the pseudo-labels for unlabeled samples are represented as follows: Among them, z ij It is the element in the i-th row and j-th column of matrix Z. The probability of the existence of a traceable link is obtained by normalizing each row of matrix Z; weights are assigned to each pseudo-label. in, The i-th row of matrix Z is normalized to obtain H, which is the entropy function, and c is the number of classes. According to the above process, each candidate link is assigned a pseudo-label, and the data with pseudo-labels is called pseudo-labeled data.

5. The method for enhancing defect reporting and code submission link recovery based on deep semi-supervised learning according to claim 1, characterized in that, The implementation process of step (4) is as follows: First, select samples with positive pseudo-labels from the pseudo-labeled data obtained in step (3); a positive pseudo-label indicates that the model believes there may be a traceable link between "defect report - code submission"; for these samples with positive pseudo-labels, defect reports belong to set S and code submissions belong to set T; next, link to defect reports s i n code submissions are removed from set T, and then n code submissions with the same s value are randomly selected from the remaining code submissions in set T. i Pairing is performed to create negative sample cases that have no traceable links; the above operation is performed on each defect report, and finally a class-balanced dataset with the same number of positive and negative cases is generated.

6. The method for enhancing defect reporting and code submission link recovery based on deep semi-supervised learning according to claim 1, characterized in that, The implementation process of step (5) is as follows: The loss function for the tagged "Defect Report - Code Submission" is expressed as follows: The loss function of the self-training method is expressed as: The loss function for label propagation is expressed as: The overall loss function is expressed as: loss=loss l +λ u loss u Where H(p,q) represents the cross-entropy of distributions p and q, and λ u This represents hyperparameters.

7. The method for enhancing defect reporting and code submission link recovery based on deep semi-supervised learning according to claim 3, characterized in that, The confidence threshold is 0.8.