Source code vulnerability detection method based on composite program representation

By constructing a composite program representation and combining it with a multilayer perceptron model, the problem of insufficient representation of code syntax and semantic information in existing vulnerability detection methods is solved, achieving more accurate vulnerability detection and reducing false positive and false negative rates.

CN117574375BActive Publication Date: 2026-06-26SOUTH CHINA UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SOUTH CHINA UNIV OF TECH
Filing Date
2023-07-12
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing vulnerability detection methods struggle to fully utilize the local and global contextual structure and semantic information of vulnerable code, resulting in high false positive and false negative rates, and insufficient representation of the code's syntax and semantic information.

Method used

By fusing Abstract Syntax Tree (AST), Code Flowchart (CFG), and Program Dependency Graph (PDG) to construct a composite program representation, and combining the doc2vec model and the multilayer perceptron model, a vulnerability detection model is trained using regularization loss and quadruples loss functions to remove the influence of vulnerability-irrelevant statements.

Benefits of technology

It achieves more accurate vulnerability detection, reduces false positives and false negatives, and has scalability and good generalization ability, adapting to different platforms and complex software systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117574375B_ABST
    Figure CN117574375B_ABST
Patent Text Reader

Abstract

The present application belongs to the field of software vulnerability detection, and is a source code vulnerability detection method based on composite program representation. The method comprises the following steps: first, three kinds of source code intermediate representations are extracted from the source code, and corresponding paths are extracted; then, extra marks are removed from the extracted paths, and path sequences are constructed, all path sequences are spliced to construct a training corpus; then, a code embedding model is trained using a doc2vec model according to the constructed corpus; then, a vector representation form of the source code is obtained using the trained embedding model, and a training set, a test set and a validation set are divided; then, a metric learning model is improved by combining a regularization loss and a four-tuple loss function, the model is trained using the training set to obtain a vulnerability detection model; finally, whether the source code contains a vulnerability is detected using the trained vulnerability detection model. The present application solves the problem that the existing method is based on a single specific representation, ignores the complementary relationship between different representations, and is insufficient in representing the semantic and syntactic information of the code by using a composite source code representation method.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of software vulnerability detection, specifically relating to a source code vulnerability detection method based on compound program representation. Background Technology

[0002] Computer software is ubiquitous in modern society, and people's lives depend to a great extent on a wide variety of software. Different forms of software run on different platforms, ranging from simple applications on handheld mobile devices to complex distributed enterprise software systems. These software programs are developed based on different technologies and methodologies, each with its own advantages and limitations.

[0003] However, within this vast software industry and computer security field, a significant issue is software security vulnerabilities. Software vulnerabilities refer to errors or defects in software that can be exploited by malicious attackers, threatening the security and stability of the system.

[0004] Software vulnerabilities are usually caused by human error and manifest as defects in the code. For vulnerable software, normal execution operations may not violate security policies. Only when vulnerable code encounters specific data (i.e., exploit code) or random data that meets specific conditions can it violate security policies.

[0005] Software vulnerability detection technologies can be categorized into three types: static, dynamic, and hybrid. Static technologies primarily rely on code analysis, including rule-based, code similarity-based, and symbolic execution-based methods. Dynamic technologies involve methods such as fuzzing and taint analysis. Hybrid methods combine static and dynamic techniques, but are less efficient in practice.

[0006] In recent years, with the development of machine learning and artificial intelligence, data-driven vulnerability detection methods have been widely applied. These methods utilize pattern recognition and machine learning techniques to learn the characteristics of vulnerable code, improving the generalization ability of vulnerability detection. Currently, machine learning-based methods mainly analyze source code, using deep learning techniques to extract high-dimensional features, and have strong generalization capabilities.

[0007] From the development trend of vulnerability detection, early research required experts to abstract vulnerability patterns and define rules for several vulnerability patterns. However, this approach was ineffective against new vulnerabilities that did not conform to these patterns, resulting in a high false negative rate. Machine learning-based vulnerability detection also required experts to identify features that could be considered vulnerabilities. While deep learning-based vulnerability detection can automatically extract vulnerability patterns, these patterns are often not obvious in the code itself, with a large proportion of statements unrelated to the vulnerability, representing significant noise in vulnerability detection. Therefore, researching how to fully utilize the local and global contextual structure and semantic information of vulnerable code, while removing the influence of statements unrelated to the vulnerability, to reduce false positives and false negatives, is an urgent problem to be solved in the field of vulnerability detection. Summary of the Invention

[0008] This invention proposes a source code vulnerability detection method based on composite program representation. By integrating AST, CFG, and PDG program representations, it comprehensively represents the structural and semantic information of the code, uncovers potential vulnerabilities in the program, makes full use of the local and global context structure and semantic information of the vulnerable code, removes the influence of statements unrelated to the vulnerability, and reduces false positives and false negatives.

[0009] The source code vulnerability detection method based on compound program representation includes the following steps:

[0010] S1. Extract the abstract syntax tree, code flowchart, and program dependency graph from the source code, and extract the paths corresponding to the abstract syntax tree, code flowchart, and program dependency graph respectively;

[0011] S2. Remove extra tags from the extracted paths and construct path sequences. Concatenate all path sequences to obtain the composite program representation of the source code. Construct a training corpus based on the composite program representation of the source code.

[0012] S3. Train the doc2vec model using the constructed training corpus to obtain the code embedding model;

[0013] S4. Obtain the embedding vector of the source code through the code embedding model, construct the dataset for training the vulnerability detection model, and divide the dataset into training set, test set and validation set;

[0014] S5. Based on the multilayer perceptron model, combined with regularization loss and quadruples loss function, a vulnerability detection model is trained using the training set.

[0015] S6. Use a vulnerability detection model to detect whether the source code contains vulnerabilities and output the detection results.

[0016] Specifically, step S1 includes:

[0017] Construct a dataset containing different code files, each with a corresponding label; use the code analysis tool Joern to extract intermediate representations of all code files in the dataset, including abstract syntax trees, code flowcharts, and program dependency graphs;

[0018] AST path, CFG path, and PDG path are extracted from the abstract syntax tree, code flowchart, and program dependency graph, respectively.

[0019] Specifically, step S2 includes:

[0020] For AST paths, CFG paths, and PDG paths, remove bracket markers, remove the movement direction of nodes, and use spaces as separators between nodes;

[0021] The AST path, CFG path, and PDG path are concatenated in the extraction order to obtain the abstract syntax tree path sequence, the code flowchart path sequence, and the program dependency graph path sequence.

[0022] The abstract syntax tree path sequence, the code flowchart path sequence, and the program dependency graph path sequence are combined to obtain a composite source code representation, and a corpus is constructed based on the composite program representation of the source code.

[0023] Specifically, step S3 includes:

[0024] The compound program representation of the source code in the corpus is used to construct a TaggedDocumnet object as a list of words;

[0025] Create a doc2vec model using the gensim library, set the context window size and minimum word frequency, and obtain the code vector after training;

[0026] The doc2vec vocabulary is constructed by taking a list of all TaggedDocument objects as input.

[0027] Set the model training parameters, train the doc2vec model using all taggedDocument objects and the vocabulary, obtain the code embedding model and save it.

[0028] Specifically, step S4 includes the following steps:

[0029] The source code in the dataset is converted into a composite program representation of the source code. The composite program representation of the source code is then converted into an embedding vector through a code embedding model. The embedding vectors and the labels corresponding to the composite program representation are used to form the dataset for training the vulnerability detection model.

[0030] The dataset used to train the vulnerability detection model was divided into training, testing, and validation sets in a 6:2:2 ratio. The SMOTE method was used on the training set to generate synthetic samples to increase the number of minority class samples and balance the sample distribution between different classes.

[0031] Specifically, step S5 includes the following steps:

[0032] Two different probability distributions are obtained by using the dropout method, and the regularization loss is calculated by combining the two different probability distributions with the KL divergence.

[0033] Positive and negative samples are obtained from the original input samples. Quadruples are constructed based on the original input samples, positive samples, and negative samples. Quadruple loss is calculated.

[0034] The final loss function is calculated by combining regularization loss and quadruple loss, and the vulnerability detection model is trained.

[0035] This invention has the following technical features:

[0036] 1. This invention proposes a source code vulnerability detection method based on composite program representation. By introducing path representation methods such as AST, CFG, and PDG, a composite program representation is constructed, which fully captures the syntactic and semantic information of the code, achieves more accurate detection results, and solves the problem that existing methods are based on a single specific representation, ignore the complementary relationship between different representations, and are insufficient in representing the semantic and syntactic information of the code.

[0037] 2. The joint path representation is pre-trained using doc2vec. By building a function library and continuously introducing new data, the method becomes scalable. A combined loss function of classification loss and quadruple loss is adopted. This method expands the distance between positive and negative samples during classification and makes full use of vector features. The R-drop idea is used, and KL divergence is used to avoid overfitting and enhance the model's generalization ability. Attached Figure Description

[0038] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the structures shown in these drawings without creative effort.

[0039] Figure 1 This is a flowchart of the source code vulnerability detection method in an embodiment of the present invention;

[0040] Figure 2This is a flowchart illustrating the construction of a composite program representation from the extracted path in an embodiment of the present invention;

[0041] Figure 3 This is a flowchart of the regularization loss of the model in the embodiments of the present invention;

[0042] Figure 4 This is a flowchart of the quadruplet loss of the calculation model in the embodiment of the present invention. Detailed Implementation

[0043] To make the technical solutions and details of this invention clearer and easier to understand, the invention will be further described in detail below with reference to embodiments and accompanying drawings. It should be understood that the specific embodiments described herein are only for explaining the invention and are not intended to limit the invention.

[0044] To improve the usability of existing methods and avoid the arduous task of feature extraction for human experts, recent vulnerability detection efforts have focused on applying deep neural networks for vulnerability identification. However, current work has limitations; it either treats source code as a flat sequence similar to natural language or represents source code with only partial information, making it difficult to learn the full semantics of the program. Source code is actually more structured and logical than natural language, possessing specific structural information such as abstract syntax trees, data flow, and control flow. Furthermore, vulnerabilities sometimes only become apparent under specific circumstances, after performing specific operations or receiving specific data, thus requiring a comprehensive investigation from multiple dimensions.

[0045] Example 1

[0046] like Figure 1 The flowchart shown illustrates a source code vulnerability detection method. The source code vulnerability detection method based on compound program representation proposed in this invention includes the following steps:

[0047] S1. Extract the abstract syntax tree, code flowchart, and program dependency graph from the source code, and extract the paths corresponding to the abstract syntax tree, code flowchart, and program dependency graph respectively.

[0048] First, a dataset is constructed containing different code files, each with a corresponding label. The code analysis tool Joern is used to extract intermediate representations from all code files in the dataset. These intermediate representations include Abstract Syntax Trees (ASTs), Code Flow Graphs (CFGs), and Program Dependency Graphs (PDGs). The dataset can utilize open-source datasets available in the field. Each code file is labeled to indicate whether it contains vulnerabilities. However, this label information is not used during the training of the embedding model; only the code files themselves are used. This design makes the model general and scalable, independent of any specific dataset.

[0049] Next, we extract AST paths, CFG paths, and PDG paths from the Abstract Syntax Tree, code flowchart, and program dependency graph, respectively. These paths will be used to construct the final composite representation. The AST path represents the hierarchical structure and relationships between different statements and expressions in the code; the CFG path represents the order and dependencies between different control flows in the code; and the PDG path represents the dependencies and data flow between different variables and functions in the code.

[0050] In this embodiment, the first step is to extract the graph representation of the source code: The code files to be extracted should be prepared first, with each file containing one or more functions to be extracted. Since the vulnerability percentage is relatively small, directly training with the original graph representation will result in indistinguishable final results, failing to achieve the desired effect. Therefore, the corresponding AST path, CFG path, and PDG path are further extracted from the graph, and the graph representation is extracted using Joern version 1.1.99 on the Linux platform.

[0051] An AST path is a set of paths such as: n1n2n3…n k n k+1 Where n represents a node in the AST, n1 and n k+1 It is a terminal node, from n2 to n k It is a non-terminal node. `n` contains two attributes: the node type `t` and the node's token. The AST path is actually represented using `t` to represent a node, therefore the final representation is `t1t2t3…t`. k t k+1 Each AST path also contains a context (s, e). s and e are the tokens corresponding to the terminal nodes of the AST path, where s represents the token corresponding to n1, and e is the token corresponding to nk+1. k+1 .

[0052] A CFG path is a set of paths like this: n1n2n3…n k n k+1 Each CFG path represents a control flow pattern that may be executed during program execution. Here, n represents a node in the CFG, and n has two attributes: type t and code token. n1 is the terminal node, marking the start of the path.

[0053] To represent loop structures, CFG includes three different path patterns: a) ignore the loop and continue to the next node; b) a path traverses the loop only once and continues to the next node; c) a path traverses the loop only once and ends at the beginning node of the loop. The ending node n of all paths... k+1 There are two ways to construct it:

[0054] 1) Terminal nodes, corresponding to modes a and b;

[0055] 2) The previously visited nodes represent intermediate nodes of a loop control structure, corresponding to pattern c.

[0056] CFG path contexts are defined similarly to AST path contexts, while PDG path contexts are defined similarly to AST paths. PDG edges have different types, including control dependencies and data dependencies, and all edges on the path have the same type. PDG path contexts are defined similarly to AST path contexts.

[0057] Taking the actual code function "example" as an example: "int example(int b){int a=5;if(b<a&&b> `0){fun(b);}}` is a C language function named `example`. The function takes `b` as its input parameter and calls a function `fun` when `b` is less than 5 and greater than 0. The `fun` function does not need a specific implementation. The CFG extracted from the code contains two parts for each node: the node type and the corresponding code content. Based on the path definition above, we need to extract the token from terminal nodes and the node type from non-terminal nodes.

[0058] Four paths can be extracted from CFG, each starting and ending with a terminal node. To prevent interference from function names, all function names use "self" as the token. One example path is: "self(METHOD)_( <operator>.assignment)_( <operator>.lessThan)_( <operator>.logicalAnd)_(METHOD_RETURN)int”

[0059] Here, `self` and `int` represent the token corresponding to the terminal node, while the others represent the node types for non-terminal nodes: `METHOD` indicates that this is a method type. <operator>Represents operators, <operator>.assignment represents the assignment operator, corresponding to "=" in the code. <operator>.lessThan represents "<", <operator>.logicalAnd corresponds to "&&", METHOD_RETURN corresponds to the return value, and "_" indicates that the movement direction of the nodes in the path is from top to bottom.

[0060] S2. Remove extra tags from the extracted paths and construct path sequences. Concatenate all path sequences to obtain a composite program representation of the source code. Construct a corpus based on the composite program representation of the source code. The corpus can be used to train a code embedding model.

[0061] First, by removing bracket markers, removing the movement direction of nodes, and using spaces as separators between nodes, the interference in the paths can be removed, allowing the model to focus on the differences between paths in terms of type and context token.

[0062] Corresponding to the path example in step 1, the processed path sequence is as follows:

[0063] self METHOD <operator>.assignment <operator>.lessThan <operator>.logicalAnd METHOD_RETURN int.

[0064] Then, the AST path, CFG path, and PDG path are concatenated in the order of path extraction to obtain the abstract syntax tree path sequence, the code flowchart path sequence, and the program dependency graph path sequence.

[0065] A composite program representation is obtained by combining the path sequences of the abstract syntax tree, the code flowchart, and the program dependency graph. This representation contains details of the structure, control flow, and data dependencies within the code file, providing a rich information foundation for subsequent vulnerability detection and analysis. A corpus is constructed based on the composite program representation of the source code. The corpus is defined as follows:

[0066] Corpus={S1, S2, S3,…,S n },

[0067] n is the length of the corpus, which also represents the number of functions; S is the composite program representation of the source code, S = Concatenate(A, C, P), where A, C, and P represent the path sequences obtained from the paths extracted from AST, CFG, and PDG, respectively.

[0068] S3. Train the doc2vec model using the constructed training corpus to obtain the code embedding model.

[0069] The code embedding model used in step S3 is based on the doc2vec model architecture, an unsupervised learning model for learning document and word vector representations, and an extension of word2vec. Similar to word2vec, which captures semantic information of words by learning word vectors, doc2vec is further extended to the document level, enabling it to learn the vector representation of each document. This method treats each sequence of compound program representations as a document.

[0070] The doc2vec model has two variants: PV-DM (Paragraph Vector-Distributed Memory) and PV-DBOW (Paragraph Vector-Distributed Bag of Words). The PV-DM model is an inferential model that attempts to learn the vector representations of documents and words by predicting target words. The PV-DBOW model, on the other hand, is a direct model that learns document vectors by randomly sampling a word from the document as the target. This method uses the PV-DM approach, and specifically includes the following steps:

[0071] S31. Construct a TaggedDocumnet object using a compound program representation S from the source code of the corpus as a list of words.

[0072] The purpose of constructing TaggedDocuments is to convert text data into a data format suitable for the doc2vec model. Each TaggedDocument object consists of a label and a list of words. This method uses the compound procedural representation S of the source code as described in step S2 as the list of words to construct the TaggedDocument object. Each line in the corpus is a compound procedural representation S object of the source code, and the order of the compound procedural representation S in the corpus is used as an integer label to distinguish different compound procedural representations S of the source code.

[0073] S32. Create a doc2vec model using the gensim library, set the context window size to 3, the minimum word frequency to 1, and obtain the code vector after training.

[0074] A doc2vec model was created using methods provided in the gensim library, with a context window size of 3 and a minimum word frequency of 1. After training, the resulting code vectors had a dimension of 512, which effectively represented the semantic features of the code.

[0075] S33. Construct the doc2vec vocabulary using a list of all TaggedDocument objects as input.

[0076] Building the vocabulary is a crucial step before training the Doc2Vec model. The purpose of the vocabulary is to provide the model with an internal dictionary containing all words appearing in all documents. The doc2vec vocabulary is constructed using a list of all TaggedDocument objects created in S31 as input. During vocabulary construction, the model performs preprocessing operations on each word in the document set, such as frequency counting and removing low-frequency words. The constructed vocabulary provides the model with a way to organize and manage the vocabulary in the document set, assigning a unique identifier to each word and providing a foundation for subsequent training. By building the vocabulary, the model can better understand and represent the lexical information in the documents.

[0077] S34. Set the model training parameters, train the doc2vec model using all taggedDocument objects and the vocabulary, obtain the code embedding model and save it.

[0078] After preparing the TaggenDocument and vocabulary through the above steps, you can begin training the model-only training. You need to set the training parameters: specify the total number of training samples seen by the model, and set the total_examples parameter to the number of compound program representations S in the corpus; specify the number of training epochs, and set the epochs parameter to a certain number of times, for example, set the epochs parameter to 130 times.

[0079] After training the doc2vec model, you can save it to disk for later use. Saving the model avoids the time and resource consumption of retraining, and also allows you to load and use the model in other applications or environments. When saving the model, all its parameters, vocabulary, document vectors, word vectors, and other information are stored in a specified file.

[0080] S4. Obtain the embedding vector of the source code through the code embedding model, construct the dataset for training the vulnerability detection model, and divide the dataset into training set, test set and validation set.

[0081] Specifically, the source code in the dataset is first converted into a composite program representation S of the source code as described in step S2. Then, the code embedding model is loaded to convert the composite program representation S of the source code into a vector, i.e., the embedding vector is obtained. In order to train the vulnerability detection model, the embedding vector is combined with its corresponding label according to the label information of the dataset to form the dataset for training the vulnerability detection model.

[0082] The dataset used to train the vulnerability detection model was divided into training, test, and validation sets in a 6:2:2 ratio. The SMOTE method was applied to the training set to generate synthetic samples, increasing the number of minority class samples and balancing the sample distribution across different classes. The training set served as input for subsequent model training, while the test and validation sets were used to adjust model parameters to obtain the optimal model.

[0083] Specifically, in machine learning tasks, when the number of samples from different classes differs significantly, the classifier may perform poorly when handling minority class samples. To address the class imbalance problem, the SMOTE method is used on the training set to increase the number of minority class samples by generating synthetic samples, thus balancing the sample distribution between different classes. It selects a minority class sample in the feature space based on the similarity between minority class samples, and then randomly selects several sample points from its nearest neighbors, generating new synthetic samples through linear interpolation. By using SMOTE, the number of minority class samples can be increased, thereby improving the classifier's performance on the minority class. This helps to solve the class imbalance problem and improve the model's generalization ability. The SMOTE method is not used on the test and validation sets, maintaining the original data label distribution.

[0084] S5. A vulnerability detection model is obtained by training the training set based on a multilayer perceptron model combined with regularization loss and quadruples loss function.

[0085] Specifically, a vulnerability detection model is trained using a metric learning approach. This model employs a multilayer perceptron (MLP) as its backbone network, comprising an input layer, multiple intermediate hidden layers, and an output layer. The network layers use embedding vectors composed of source code representations as input data. The intermediate hidden layers perform non-linear processing on the input data through activation functions and weight parameter transformations. The model is trained by combining regularization loss and quadruple loss. Finally, the output layer generates the vulnerability detection results.

[0086] The model's loss function consists of two parts: Regularized Dropout (R-drop) and quadruple loss. R-drop utilizes the dropout method to avoid overfitting by randomly dropping some neurons to construct different sub-models and applying regularization constraints to the network's output predictions. R-drop includes KL divergence and cross-entropy loss. KL divergence constrains the consistency of outputs from different dropout models, while cross-entropy loss is used for the binary classification task of leaks. Quadruple loss is a loss function that measures learning and requires four input samples: the original sample x... i Positive sample p i negative samples and negative samples In this system, positive samples represent samples similar to the original input, while negative samples represent samples dissimilar to the original input. The training objective is to ensure that samples with the same label are close together and samples with different labels are far apart in the new encoding space.

[0087] Specifically, step S5 includes the following steps:

[0088] S51. Two different probability distributions are obtained through the dropout method. The regularization loss is calculated by combining the two different probability distributions with the KL divergence.

[0089] This method employs dropout to avoid overfitting, randomly discarding some neurons to construct different sub-models, and applies regularization constraints to the network's output predictions. The regularization loss includes cross-entropy loss and KL divergence. Cross-entropy loss is used for the binary classification task of vulnerabilities.

[0090]

[0091] in This represents the cross-entropy loss function, where the superscript i indicates the loss for the i-th sample, and y... i This represents the true label of the i-th sample. This represents the label predicted by the model for the i-th sample.

[0092] like Figure 3 The flowchart shown illustrates the calculation of the regularization loss of the model. Dropout is used twice to obtain two different probability distributions, P1 and P2. Then, KL divergence is used to constrain P1 and P2. KL divergence is an asymmetric measure of the difference between two probability distributions.

[0093]

[0094] Here, p and q represent the original probability distribution and the approximate probability distribution, respectively. Using KL divergence loss, we can accurately calculate how much information is lost when approximating one distribution to another. In this sense, the role of KL divergence is to aim for model outputs with different dropout rates to be as similar as possible. The KL divergence of P1 and P2... for:

[0095]

[0096] The final regularization loss combines the cross-entropy loss and the KL divergence loss, and adds a parameter λ to control the weight of the loss. The regularization loss is:

[0097]

[0098] in, Indicates the regularization loss. Represents the cross-entropy loss function. Let KL divergence be two different probability distributions.

[0099] S52. Construct positive and negative samples based on the original input samples. Construct a quadruple based on the original input samples, positive samples, and negative samples, and calculate the quadruple loss.

[0100] In addition to the input data x, it is also necessary to construct positive and negative samples to form a quadruple, that is, to construct (x, p, n1, n2) sample pairs, such as Figure 4 The flowchart shown is for calculating the quadruple loss of the model, where x is the original input sample, p is a positive sample representing a sample of the same class as x, and n1 and n2 are negative samples representing samples of a different class than x.

[0101] All data in the dataset are divided into two classes: a label of 1 represents a vulnerability, and a label of 0 represents no vulnerability. Indices are stored according to the different labels. For an input x, a sample different from x is randomly selected from the list of indices with the same label as x as a positive sample, and two different samples are randomly selected from the list of indices with different labels as negative samples.

[0102] The goal of using quadruples is to ensure that samples with the same label are closer together and samples with different labels are farther apart in the new encoding space. The formula for calculating quadruple loss is:

[0103]

[0104] Where D represents the distance metric between samples, using cosine distance to measure the similarity or difference between samples. i represents the i-th sample. Let represent the i-th input sample, positive sample, negative sample 1, and negative sample 2, respectively; α and β are manually set constants, usually α > β.

[0105] In the formula for calculating the quadruple loss, the first term is called the strong push, which aims to make the distance between the original sample and the positive sample as small as possible, which is less than the distance between the original sample and the negative sample plus the boundary parameter α. The second term is called the weak push, which aims to make the distance between the original sample and the positive sample as small as possible, which is less than the distance between the first negative sample and the second negative sample plus the boundary parameter β.

[0106] S53. The final loss function is calculated by combining the regularization loss and the quadruple loss, and the vulnerability detection model is trained.

[0107] By using two dropouts and calculating with quadruples, the final loss function of the vulnerability detection model is:

[0108]

[0109] λ1 is a variable parameter used to adjust the weights and control the impact of KL divergence on the regularization loss. λ2 is a variable parameter used to control the ratio of quadruple loss to regularization loss. In actual training, these parameters are adjusted to find the best model. Represents classification loss. Let represent the quadruple loss, and i represent the i-th sample.

[0110] Preferably, the epoch parameter for model training is set to 200, and the early stopping idea is adopted. The patience parameter is set to 3. When the model's performance does not improve for 3 consecutive times or reaches the maximum training epoch, training is stopped and the optimal model is saved to obtain the vulnerability detection model, which can be used for vulnerability detection.

[0111] Compared to traditional models that use cross-entropy loss, the vulnerability detection model described in this method is improved in the following ways:

[0112] (1) Use dropout twice to obtain different sub-models and calculate KL divergence;

[0113] (2) Introducing a quadruple, which includes the original input and positive and negative samples, can train an embedding space such that the distance between samples of the same class is as small as possible, while the distance between samples of different classes is as large as possible;

[0114] S6. Use a vulnerability detection model to detect whether the source code contains vulnerabilities and output the detection results.

[0115] Specifically, for a given source code file, it is first converted into vector form using the code embedding model in step S3. Then, the vector is input into the vulnerability detection model trained in step S5 for discrimination. The model outputs either 0 or 1, where 0 represents no vulnerability and 1 represents a vulnerability.

[0116] In summary, this invention first processes the dataset to construct a corpus for training a code embedding model, then trains a doc2vec-based embedding model, and finally uses the embedding model to convert the source code into embedding vectors to train a vulnerability detection model. Because the embedding model used in this invention is based on doc2vec and does not depend on specific labels or datasets, it has broad applicability and can address the problem of insufficient source code files, few vulnerabilities, and difficulty in detection in some real-world projects. Furthermore, the vulnerability detection model, based on regularization loss and quadruple loss, can effectively detect subtle differences between vulnerabilities and exhibits good robustness. In practical use, it reduces the requirements for the dataset and meets the needs of the application.

[0117] The above embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above embodiments. Any changes, modifications, substitutions, combinations, or simplifications made without departing from the spirit and principle of the present invention shall be considered equivalent substitutions and shall be included within the protection scope of the present invention.< / operator> < / operator> < / operator> < / operator> < / operator> < / operator> < / operator> < / operator> < / operator> < / operator>

Claims

1. A source code vulnerability detection method based on compound program representation, characterized in that, Includes the following steps: S1. Extract the abstract syntax tree, code flowchart, and program dependency graph from the source code, and extract the paths corresponding to the abstract syntax tree, code flowchart, and program dependency graph respectively; S2. Remove extra tags from the extracted paths and construct path sequences. Concatenate all path sequences to obtain the composite program representation of the source code. Construct a training corpus based on the composite program representation of the source code. For AST paths, CFG paths, and PDG paths, remove bracket markers, remove the movement direction of nodes, and use spaces as separators between nodes; The AST path, CFG path, and PDG path are concatenated in the extraction order to obtain the abstract syntax tree path sequence, the code flowchart path sequence, and the program dependency graph path sequence. The abstract syntax tree path sequence, code flowchart path sequence, and program dependency graph path sequence are combined to obtain a composite source code representation, and a corpus is constructed based on the composite program representation of the source code. S3. Train the doc2vec model using the constructed training corpus to obtain the code embedding model; The compound program representation of the source code in the corpus is used to construct a TaggedDocumnet object as a list of words; Create a doc2vec model using the gensim library, set the context window size and minimum word frequency, and obtain the code vector after training; The doc2vec vocabulary is constructed by taking a list of all TaggedDocument objects as input. Set the model training parameters, train the doc2vec model using all taggedDocument objects and the vocabulary, obtain the code embedding model and save it; S4. Obtain the embedding vector of the source code through the code embedding model, construct the dataset for training the vulnerability detection model, and divide the dataset into training set, test set and validation set; S5. Based on the multilayer perceptron model, combined with regularization loss and quadruples loss function, a vulnerability detection model is trained using the training set. S6. Use a vulnerability detection model to detect whether the source code contains vulnerabilities and output the detection results.

2. The source code vulnerability detection method based on compound program representation according to claim 1, characterized in that, Step S1 includes: Construct a dataset containing different code files, each with a corresponding label; use the code analysis tool Joern to extract intermediate representations of all code files in the dataset, including abstract syntax trees, code flowcharts, and program dependency graphs; AST path, CFG path, and PDG path are extracted from the abstract syntax tree, code flowchart, and program dependency graph, respectively.

3. The source code vulnerability detection method based on compound program representation according to claim 2, characterized in that, The AST path represents the hierarchical structure and relationships between different statements and expressions in the code; the CFG path represents the order and dependencies between different control flows in the code; and the PDG path represents the dependencies and data flow between different variables and functions in the code.

4. The source code vulnerability detection method based on compound program representation according to claim 1, characterized in that, The settings for model training parameters include: specifying the total number of training samples seen by the model, setting the total_examples parameter to the number of compound program representations in the corpus; specifying the number of training epochs, setting the epochs parameter to a certain number of times.

5. The source code vulnerability detection method based on compound program representation according to claim 1, characterized in that, Step S4 includes the following steps: The source code in the dataset is converted into a composite program representation of the source code. The composite program representation of the source code is then converted into an embedding vector through a code embedding model. The embedding vectors and the labels corresponding to the composite program representation are used to form the dataset for training the vulnerability detection model. The dataset used to train the vulnerability detection model was divided into training, testing, and validation sets in a 6:2:2 ratio. The SMOTE method was used on the training set to generate synthetic samples to increase the number of minority class samples and balance the sample distribution between different classes.

6. The source code vulnerability detection method based on compound program representation according to claim 1, characterized in that, Step S5 includes the following steps: Two different probability distributions are obtained by using the dropout method, and the regularization loss is calculated by combining the two different probability distributions with the KL divergence. Positive and negative samples are obtained from the original input samples. Quadruples are constructed based on the original input samples, positive samples, and negative samples. Quadruple loss is calculated. The final loss function is calculated by combining regularization loss and quadruple loss, and the vulnerability detection model is trained.

7. The source code vulnerability detection method based on compound program representation according to claim 6, characterized in that, The formula for calculating the regularization loss is as follows: ; in, Indicates the regularization loss. The parameters are used to control the weights of the loss. Represents the cross-entropy loss function. Let KL divergence be the difference between two distinct probability distributions. The formula for calculating the quadruple loss is as follows: ; in, This represents the quadruple loss, where D represents the distance metric between samples. Indicates the first One sample, They represent the first One input sample, one positive sample, one negative sample 1, and one negative sample 2; , All are preset constants. > ; The final loss function is: ; in, It is a variable parameter used to control the effect of KL divergence on regularization loss; It is a variable parameter used to control the ratio of quadruple loss to regularization loss.

8. The source code vulnerability detection method based on compound program representation according to claim 6, characterized in that, The final loss function is calculated by combining regularization loss and quadruple loss, and the vulnerability detection model is trained, including: The training parameters epoch and patience are set to 3. Training is stopped and the optimal model is saved after the model's performance does not improve for three consecutive training epochs. This yields the vulnerability detection model.