A multi-sample comparison and fusion multi-type vulnerability detection method and system
By symbolizing and vectorizing the samples, a proprietary word vector model is constructed and a deep learning model is trained, which solves the problems of low accuracy and insufficient variety in vulnerability detection in existing technologies, and achieves efficient detection of multiple types of vulnerabilities.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING CITY UNIVERSITY
- Filing Date
- 2022-03-01
- Publication Date
- 2026-06-23
AI Technical Summary
Existing deep learning-based vulnerability detection technologies suffer from low accuracy, complex model architecture, and a limited range of vulnerability types that can be detected.
By processing the original dataset, datasets of various vulnerability types are generated. The samples are symbolized and a proprietary word vector model is constructed to generate sample vectors. A fusion matrix for samples of the same type and a comparison matrix for samples of different types are constructed respectively. These vectors are used to train a deep learning model to extract rich and diverse features.
It reduces the complexity of the model, improves the accuracy and variety of vulnerability identification, and can detect up to 180 types of CWE vulnerabilities.
Smart Images

Figure CN114756861B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of network information security technology, and in particular to a method and system for detecting multiple types of vulnerabilities through multi-sample comparison and fusion. Background Technology
[0002] Vulnerabilities in current software are a major cause of software security incidents. Therefore, identifying vulnerabilities through static and dynamic detection techniques at each stage of software development is crucial. However, because static and dynamic detection techniques heavily rely on the experience of experts and senior software developers, vulnerability discovery is time-consuming and lacks precision.
[0003] Currently, with the continuous application of deep learning technology in areas such as malicious code detection and spam filtering, the academic and industrial communities are also constantly trying to apply deep learning technology to the discovery and analysis of software vulnerabilities. The aim is to achieve automated discovery and analysis of software vulnerabilities and significantly improve the efficiency and accuracy of vulnerability discovery.
[0004] The idea behind deep learning-based source code vulnerability detection originates from natural language processing (NLP) technology. In NLP, natural language can be treated as a temporal sequence of strings, and a neural network model can be trained to capture features within this sequence for applications such as language recognition and machine translation. Similarly, programming languages are also temporal languages, executed sequentially according to a set time order. However, compared to other languages (such as English), programming languages are markup languages, or "hard languages," whose syntax uniquely interprets the meaning of the code, allowing computers to analyze and execute the code based on defined rules. In contrast, the meaning and form of natural languages like English can vary flexibly; for example, articles with the same meaning can have different expressions, or there may be ambiguity. From the above analysis, it's clear that learning features in programming languages using deep learning is simpler than learning features in natural languages because the meaning of words in programming languages is uniquely determined and unambiguous.
[0005] Currently, many researchers have used deep learning technology to extract features of defective parts in source code programs and train corresponding neural network models for software vulnerability detection. For example, Russell et al. [Russell R, Kim L, Hamilton L, et al. Automated vulnerability detection in source code using deep representation learning[C] / / 2018 17th IEEE international conference on machine learning and applications (ICMLA). IEEE, 2018: 757-762] proposed a function-level defect detection method based on deep learning for C / C++ open source software code. This method directly uses the function body as the basic unit to identify whether there are corresponding defects in the function. However, the problem is that this method is not suitable for detecting defects in code that contains cross-function data dependencies. Zhou et al. [Zhou Y, Liu S, Siow J, et al. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks[J]. arXiv preprint arXiv:1909.03496, 2019] implemented Devign, a source code vulnerability detection system based on graph neural networks. This system is also a binary classification source code vulnerability detection system, that is, given a source code, it determines whether there is a vulnerability, but it cannot identify the specific vulnerability type. Duan et al. [Duan X, Wu J, Ji S, et al. VulSniper: Focus Your Attention to Shoot Fine-Grained Vulnerabilities[C] / / IJCAI. 2019: 4665-4671] implemented VulSniper, a source code defect detection method based on attention mechanism. It can detect multiple types of source code defects, but it involves only a limited number of source code vulnerability types, namely CWE119 and CWE399.Xu et al. [Xu D, Jingzheng WU, Tianyue LU O. Vulnerability mining method based on code property graph and attention BiLSTM[J]. Journal of Software, 2020, 31(11): 3404-3420] implemented a vulnerability detection method based on code property graph and deep learning technology. However, the method is still based on function or program-by-program vulnerability detection, which is still a binary classification method and cannot detect multiple vulnerability types. In addition, Zou et al. [Zou, Deqing, et al. "μVulDeePecker: A Deep Learning-Based System for Multiclass VulnerabilityDetection." IEEE Transactions on Dependable and Secure Computing 18.5 (2019):2224-2236] implemented a multi-vulnerability type detection system based on code slicing technology, which can detect 40 vulnerability types, but the detected vulnerability types do not cover most of the CWE vulnerability types.
[0006] Therefore, current deep learning-based vulnerability detection technologies suffer from shortcomings such as low accuracy, complex deep learning model architecture, and a limited number of vulnerability types that can be detected. Summary of the Invention
[0007] To address the aforementioned problems, the present invention aims to provide a method and system for detecting various types of vulnerabilities through multi-sample comparison and fusion. The core idea is to find samples of the same type and samples of different types for each sample in the deep learning training set. During model training, the model learns a large number of more subtle features from samples of the same type, as well as very obvious differences between samples of different types, and then fuses and compares these features. In this way, the deep learning model can extract richer and more diverse features during training, thereby reducing model complexity and improving the accuracy and variety of vulnerabilities identified.
[0008] To address the aforementioned technical problems, embodiments of the present invention provide the following solutions:
[0009] On the one hand, a multi-sample comparison and fusion method for detecting various types of vulnerabilities is provided, including the following steps:
[0010] S1. Process the original dataset to generate datasets with various vulnerability types;
[0011] S2. Perform a single-code symbolization operation on the samples in the dataset to obtain symbolized samples;
[0012] S3. Construct a proprietary word vector model for a specific programming language, and perform vectorization operations on the symbolic samples to obtain sample vectors;
[0013] S4. Construct a fusion matrix for samples of the same type and a comparison matrix for samples of different types for the sample vectors respectively, and add the fusion matrix for samples of the same type and the comparison matrix for samples of different types and take the average value respectively to obtain the fusion vector for samples of the same type and the comparison vector for samples of different types for each sample in the dataset;
[0014] S5. The deep learning model is trained using the fusion vector of samples of the same type and the comparison vector of samples of different types, and the trained deep learning model is used for vulnerability detection.
[0015] Preferably, step S3 specifically includes:
[0016] S31. Perform word statistics on the symbolized samples in the dataset that have undergone symbolization operations to obtain a corpus;
[0017] S32. Use the CBOW method in word2vec to train a proprietary word vector model;
[0018] S33. Use the trained proprietary word vector model to vectorize each symbolic sample in the dataset to generate a sample vector.
[0019] Preferably, step S33 specifically includes:
[0020] S331. Read a symbolic sample from the dataset;
[0021] S332. Use the word segmentation tool NLTK to divide the symbolized sample into several words;
[0022] S333. For each word, perform vectorization operation using a pre-trained proprietary word vector model to generate word vectors with fixed dimensions.
[0023] S334. Add up the word vectors corresponding to each word and take the average to obtain the sample vector corresponding to the entire symbolic sample.
[0024] S335. Repeat steps S331-S334 to obtain the sample vectors corresponding to each symbolized sample in the dataset.
[0025] Preferably, step S4 specifically includes:
[0026] S41. Statistically analyze the number of samples for each vulnerability type in the dataset;
[0027] S42. Based on the results of step S41, find samples with the same vulnerability type and samples with different vulnerability types for each sample in the dataset.
[0028] S43. Based on the set of samples of the same type and the set of samples of different types returned in step S42, construct the fusion matrix of samples of the same type and the comparison matrix of samples of different types for the target sample;
[0029] S44. Add the vectors corresponding to each sample of the same type in the same type fusion matrix and take the average value to obtain the final same type sample fusion vector; add the vectors corresponding to each sample of different types in the different type sample comparison matrix and take the average value to obtain the final different type sample comparison vector.
[0030] Preferably, in step S5, the deep learning model includes an input layer module, a feature extraction layer module, and a classification layer module.
[0031] The input layer module consists of a fully connected network and The activation function is used to construct the feature extraction layer module, which consists of multiple fully connected networks, and the fully connected networks are interconnected using... Activation function and The layer randomly discards neurons in the network; the classification layer module consists of a fully connected layer, which outputs a score for each sample, and then uses... The function calculates the score and outputs the predicted probability value for each vulnerability category. The one with the highest probability value is the final classification result.
[0032] On the one hand, a multi-sample comparison and fusion vulnerability detection system is provided, including:
[0033] The multi-type vulnerability dataset processing unit is used to process the raw dataset and generate datasets of various vulnerability types.
[0034] A single code symbolization unit is used to perform a single code symbolization operation on samples in the dataset to obtain symbolized samples;
[0035] The proprietary word vector model vectorization unit is used to construct a proprietary word vector model for a specific programming language, and to perform vectorization operations on the symbolic samples to obtain sample vectors.
[0036] The sample comparison and fusion matrix unit is used to construct a fusion matrix of samples of the same type and a comparison matrix of samples of different types for the sample vectors, and to add the fusion matrix of samples of the same type and the comparison matrix of samples of different types and take the average value to obtain the fusion vector of samples of the same type and the comparison vector of samples of different types for each sample in the dataset.
[0037] The deep learning model unit is used to train the deep learning model using the fusion vector of samples of the same type and the comparison vector of samples of different types, and to use the trained deep learning model to perform vulnerability detection.
[0038] Preferably, the proprietary word vector model vectorization unit is specifically used to perform the following steps:
[0039] Word statistics are performed on the symbolic samples in the dataset that have undergone symbolization operations to obtain a corpus;
[0040] Train a proprietary word vector model using the CBOW method in word2vec;
[0041] The trained proprietary word vector model is used to vectorize each symbolic sample in the dataset to generate a sample vector.
[0042] Preferably, the step of vectorizing each symbolic sample in the dataset using a trained proprietary word vector model specifically includes:
[0043] Read a symbolic sample from the dataset;
[0044] The symbolic sample was divided into several words using the word segmentation tool NLTK;
[0045] For each word, a pre-trained proprietary word vector model is used to perform vectorization operations to generate word vectors with fixed dimensions;
[0046] The word vectors corresponding to each word are added together and averaged to obtain the sample vector corresponding to the entire symbolic sample.
[0047] Repeat the above steps to obtain the sample vectors corresponding to each symbolized sample in the dataset.
[0048] Preferably, the sample comparison and fusion matrix unit is specifically used to perform the following steps:
[0049] The statistical data set includes the number of samples for each vulnerability type.
[0050] Based on the results of the previous step, find samples with the same vulnerability type and samples with different vulnerability types for each sample in the dataset;
[0051] Based on the set of samples of the same type and the set of samples of different types returned in the previous step, construct the fusion matrix of samples of the same type and the comparison matrix of samples of different types for the target sample;
[0052] The vectors corresponding to each sample of the same type in the same type fusion matrix are added together and the average value is taken to obtain the final fusion vector of the same type of samples; the vectors corresponding to each sample of different types in the comparison matrix of different types of samples are added together and the average value is taken to obtain the final comparison vector of different types of samples.
[0053] Preferably, the deep learning model includes an input layer module, a feature extraction layer module, and a classification layer module;
[0054] The input layer module consists of a fully connected network and The activation function is used to construct the feature extraction layer module, which consists of multiple fully connected networks, and the fully connected networks are interconnected using... Activation function and The layer randomly discards neurons in the network; the classification layer module consists of a fully connected layer, which outputs a score for each sample, and then uses... The function calculates the score and outputs the predicted probability value for each vulnerability category. The one with the highest probability value is the final classification result.
[0055] The beneficial effects of the technical solutions provided in the embodiments of the present invention include at least the following:
[0056] (1) The method of the present invention can generate proprietary word vector models for specific programming languages, which can greatly improve the richness and completeness of features after the vectorization of program code.
[0057] (2) The method of the present invention can generate a fusion vector of the same type of samples and a comparison vector of different types of samples for each sample in the dataset. In this way, the deep learning model can easily extract very obvious distinguishable features when learning, thereby greatly improving the model's ability and accuracy to identify various vulnerabilities.
[0058] (3) The method of the present invention does not require a complex neural network model. It can achieve a high accuracy rate of vulnerability identification using only a simple neural network model. Attached Figure Description
[0059] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0060] Figure 1 This is a flowchart of a multi-sample comparison and fusion method for detecting various types of vulnerabilities provided in an embodiment of the present invention;
[0061] Figure 2 This is a schematic diagram of the structure of the deep learning model provided in an embodiment of the present invention;
[0062] Figure 3 This is a schematic diagram of the structure of a multi-sample comparison and fusion vulnerability detection system for multiple types provided in an embodiment of the present invention. Detailed Implementation
[0063] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0064] Embodiments of the present invention provide a method for detecting various types of vulnerabilities through multi-sample comparison and fusion, such as... Figure 1 As shown, the method includes the following steps:
[0065] S1. Process the original dataset to generate datasets with various vulnerability types;
[0066] S2. Perform a single-code symbolization operation on the samples in the dataset to obtain symbolized samples;
[0067] S3. Construct a proprietary word vector model for a specific programming language, and perform vectorization operations on the symbolic samples to obtain sample vectors;
[0068] S4. Construct a fusion matrix for samples of the same type and a comparison matrix for samples of different types for the sample vectors respectively, and add the fusion matrix for samples of the same type and the comparison matrix for samples of different types and take the average value respectively to obtain the fusion vector for samples of the same type and the comparison vector for samples of different types for each sample in the dataset;
[0069] S5. The deep learning model is trained using the fusion vector of samples of the same type and the comparison vector of samples of different types, and the trained deep learning model is used for vulnerability detection.
[0070] In this embodiment of the invention, samples of various vulnerability types are collected in a statistical data set. Then, samples of the same type and samples of different types are selected. By extracting the fusion features of samples of the same type and the comparative features of samples of different types, vulnerability features are fused and compared for learning. Finally, a deep learning model performs classification learning and outputs the vulnerability identification type. Through this method, the deep learning model not only learns the subtle features shared by samples of the same type, but also learns the clearly distinguishable features between samples of different types. This reduces the complexity of the deep learning model architecture and significantly improves the vulnerability identification capability and accuracy. Simultaneously, it can detect up to 180 types of CWE vulnerabilities.
[0071] Specifically, in step S1, the vulnerability dataset used comes from the NIST Software Assurance Reference Dataset (SARD) project. This dataset provides researchers with some known software security vulnerabilities, which can be used to train source code vulnerability detection models. The source code in SARD mainly includes three categories: (1) "Fix" type, indicating that the vulnerability in the source code has been fixed and no longer contains vulnerabilities; (2) "Flaw" type, indicating that the source code contains vulnerabilities; (3) "Mixed" type, indicating that the relevant source code not only contains vulnerabilities but also contains corresponding patches. To date, the SARD dataset contains 251,336 test cases, including 96,494 C code examples, 34,133 C++ code examples, 46,438 Java code examples, 42,253 PHP code examples, and 32,018 C# code examples. This invention selects the C code examples as the instance dataset to illustrate the specific implementation process of this invention.
[0072] Once the instance dataset is prepared, the sample labeling process can begin. The instance dataset of this invention consists of 96,494 C program codes, with three types: "Fix", "Flaw", and "Mixed". The code in the "Fix" type is the fixed code, which has no vulnerabilities. The code in the "Flaw" type has vulnerabilities. The code in the "Mixed" type not only has vulnerabilities but also has corresponding fixed code. Based on these characteristics, this invention creates instance samples in the following way: (1) The code in the "Fix" type has no vulnerabilities and is directly added to the sample set; (2) The code in the "Flaw" type has a single nature and no corresponding fix, and each vulnerable program has a definite CWE type description, so it is also directly added to the sample set of this invention, and its CWE type description can be used as the vulnerability type label of the sample; (3) The code in the "Mixed" type not only has vulnerable code but also has a corresponding fix. Based on these characteristics, this invention first extracts the vulnerable code and adds it to the sample set (the corresponding CWE type description is used as the sample label); then, it extracts the fix corresponding to each vulnerable program and adds it to the sample set.
[0073] It's important to note that the number of corresponding fixes for vulnerable code in "Mixed" type code is not limited to one. Sometimes, a single vulnerable code has multiple corresponding fixes, and the number varies, ranging from a minimum of one to a maximum of 12. After processing using the three methods described above, this invention obtained a sample dataset of 280,894, including 95,926 pieces of code containing vulnerabilities of 180 different vulnerability types, and 184,968 pieces of code that do not contain vulnerabilities. Thus, the sample dataset required by this invention has been created.
[0074] In step S2, by symbolizing each sample in the dataset, the interference of user-defined variable names and function names on the deep learning model can be reduced. Specifically, all statements in the program are placed on a single line, treated as a complete statement. This preserves the program's complete syntactic structure, semantics, and temporal information between statements. During symbolization, the keywords, standard function libraries, API function libraries, and commonly used header files of the example program are unique to the example program and reflect its rich semantic information; therefore, this invention also preserves these elements during symbolization.
[0075] In step S3, the reason for creating a new proprietary word vector model in this invention is that there is currently no word vector model specifically designed for a particular instance language. Most existing word2vec models are word vector models for natural language and are not suitable for existing programming languages. Programming languages have their own syntax, semantics, and temporal features, which are the biggest differences from natural languages. By creating a proprietary word vector model specifically for programming languages, deep learning models can better understand the essential characteristics and rich inherent semantics of programming languages, and learn more distinctive vulnerability features.
[0076] The specific steps for constructing a language-specific word vector model in step S3 are as follows:
[0077] S31. Perform word statistics on the symbolized samples in the dataset that have undergone symbolization operations to obtain a corpus;
[0078] S32. Use the CBOW method in word2vec to train a proprietary word vector model;
[0079] S33. Use the trained proprietary word vector model to vectorize each symbolic sample in the dataset to generate a sample vector.
[0080] Specifically, step S33 includes:
[0081] S331. Read a symbolic sample from the dataset;
[0082] S332. Use the word segmentation tool NLTK to divide the symbolized sample into several words;
[0083] S333. For each word, perform vectorization operation using a pre-trained proprietary word vector model to generate word vectors with fixed dimensions.
[0084] S334. Add up the word vectors corresponding to each word and take the average to obtain the sample vector corresponding to the entire symbolic sample.
[0085] S335. Repeat steps S331-S334 to obtain the sample vectors corresponding to each symbolized sample in the dataset.
[0086] Specifically, firstly, word statistics were performed on the 96,494 symbolized C programs in the dataset, resulting in a corpus with a vocabulary of 97,425,010 words and a total of 24,610 words. Then, a proprietary word vector model was trained using the CBOW method in word2vec. Finally, the trained proprietary word vector model was used to vectorize each sample in the dataset, generating a fully vectorized dataset.
[0087] In step S4, firstly, a fusion matrix for samples of the same type and a comparison matrix for samples of different types are constructed for each sample in the dataset. Then, a fusion vector for samples of the same type and a comparison vector for samples of different types are generated. The specific steps are as follows:
[0088] S41. Count the number of samples of each vulnerability type in the data set, that is, determine which vulnerability type each sample belongs to, as shown in Algorithm 1 in Table 1. In Algorithm 1, the input is the set of labels corresponding to each sample in the dataset, and the total number of vulnerability types in the dataset. The output represents the number of samples of each vulnerability type.
[0089] Table 1
[0090]
[0091] S42. Based on the results of step S41, find samples with the same vulnerability type and samples with different vulnerability types for each sample in the dataset, as shown in Algorithm 2 in Table 2.
[0092] Table 2
[0093]
[0094] In Algorithm 2, This represents a dataset. This represents the set of sample labels in the dataset. This indicates the results returned by S41, which samples are included for each vulnerability type. This represents the number of samples of the same type and the number of samples of different types selected for each sample; this value can be the same or different. Algorithm 2 first starts from the dataset... A sample is selected from the pool; this is referred to as the target sample in this invention. Then... Delete the target sample. This is because the target sample itself cannot be used as a sample for comparison and learning. Afterwards, in... Get the number of samples of the same type as the target sample. If this number is greater than... Then you can Random selection One sample is considered as a sample of the same type as the target sample. Otherwise, according to... The actual number of samples selected is the same type as the target samples. Finally, the same steps are used to select samples of different types from the target samples. Algorithm output. and These represent the selected sets of samples of the same type and the sets of samples of different types, respectively.
[0095] S43. Based on the set of samples of the same type and the set of samples of different types returned in step S42, construct the fusion matrix of samples of the same type and the comparison matrix of samples of different types of the target sample, as shown in Algorithm 3 in Table 3.
[0096] Table 3
[0097]
[0098] S44. Add the vectors corresponding to each sample of the same type in the same type fusion matrix and take the average value to obtain the final same type sample fusion vector; add the vectors corresponding to each sample of different types in the different type sample comparison matrix and take the average value to obtain the final different type sample comparison vector.
[0099] In Algorithm 3, the input Represents a dataset, Represents the dimension of a vector. or This represents the set of samples of the same type and the set of samples of different types returned in S42. In Algorithm 3, the set of samples of the same type as the target sample is used as the basis for... Create a fusion matrix of samples of the same type Then, the vectors corresponding to each sample of the same type in the matrix are summed, and the average value is taken to obtain the final fused vector of samples of the same type. Different types of sample sets This method is also used to obtain comparison vectors for different types of samples. Repeating the above steps will create a fusion vector for samples of the same type and a comparison vector for samples of different types for each sample in the dataset.
[0100] In step S5, such as Figure 2 As shown, the deep learning model includes an input layer module, a feature extraction layer module, and a classification layer module. Because this invention uses fused features from similar samples and comparative features from different samples for learning, even a simple deep learning model can achieve high vulnerability identification accuracy. Therefore, this embodiment employs a simple deep learning architecture.
[0101] The input layer module consists of a fully connected network and The activation function is used to construct the feature extraction layer module, which consists of multiple fully connected networks. The fully connected networks are interconnected using... Activation function and The layer randomly discards neurons in the network; the classification layer module consists of a fully connected layer that outputs a score for each sample, and then uses... The function calculates the score and outputs the predicted probability value for each vulnerability category, with the highest probability value being the final classification result. In this embodiment of the invention, up to 180 vulnerability categories can be identified and predicted.
[0102] The loss function used in the model is the cross-entropy loss function, as shown in formula (1).
[0103]
[0104] In addition, the model also uses the cosine similarity formula to calculate the loss of similarity between fused vectors of the same type of samples, and the loss of similarity between comparison vectors of different types of samples, such as formulas (2) and (3).
[0105]
[0106]
[0107] Among them, in formula (1) This indicates the number of vulnerability categories. The sign function (0 or 1) is represented if the sample The true category equals Select 1 if the value is 1, otherwise select 0. Indicates sample Category The probability prediction value. In formulas (2) and (3) This represents the loss value calculated based on the fusion similarity between the target sample and samples of the same type. This represents the loss value calculated by comparing the similarity between the target sample and samples of different types. , , These represent the components of the target sample, the same type of sample, and the different type of sample vector, respectively. Finally, these three loss values are added together to obtain the final model loss value, as shown in formula (4).
[0108]
[0109] Accordingly, embodiments of the present invention also provide a multi-sample comparison and fusion vulnerability detection system for various types, such as... Figure 3 As shown, the system includes:
[0110] The multi-type vulnerability dataset processing unit is used to process the raw dataset and generate datasets of various vulnerability types.
[0111] A single code symbolization unit is used to perform a single code symbolization operation on samples in the dataset to obtain symbolized samples;
[0112] The proprietary word vector model vectorization unit is used to construct a proprietary word vector model for a specific programming language, and to perform vectorization operations on the symbolic samples to obtain sample vectors.
[0113] The sample comparison and fusion matrix unit is used to construct a fusion matrix of samples of the same type and a comparison matrix of samples of different types for the sample vectors, and to add the fusion matrix of samples of the same type and the comparison matrix of samples of different types and take the average value to obtain the fusion vector of samples of the same type and the comparison vector of samples of different types for each sample in the dataset.
[0114] The deep learning model unit is used to train the deep learning model using the fusion vector of samples of the same type and the comparison vector of samples of different types, and to use the trained deep learning model to perform vulnerability detection.
[0115] Furthermore, in the multi-type vulnerability dataset processing unit, the original dataset adopts the NIST Software Assurance Reference Dataset (SARD) project.
[0116] Furthermore, in the single code symbolization unit, all the statements of the program are placed on one line and treated as a complete statement for symbolization.
[0117] Furthermore, the proprietary word vector model vectorization unit is specifically used to perform the following steps:
[0118] Word statistics are performed on the symbolic samples in the dataset that have undergone symbolization operations to obtain a corpus;
[0119] Train a proprietary word vector model using the CBOW method in word2vec;
[0120] The trained proprietary word vector model is used to vectorize each symbolic sample in the dataset to generate a sample vector.
[0121] Furthermore, the step of using the trained proprietary word vector model to vectorize each symbolic sample in the dataset specifically includes:
[0122] Read a symbolic sample from the dataset;
[0123] The symbolic sample was divided into several words using the word segmentation tool NLTK;
[0124] For each word, a pre-trained proprietary word vector model is used to perform vectorization operations to generate word vectors with fixed dimensions;
[0125] The word vectors corresponding to each word are added together and averaged to obtain the sample vector corresponding to the entire symbolic sample.
[0126] Repeat the above steps to obtain the sample vectors corresponding to each symbolized sample in the dataset.
[0127] Furthermore, the sample comparison and fusion matrix unit is specifically used to perform the following steps:
[0128] The statistical data set includes the number of samples for each vulnerability type.
[0129] Based on the results of the previous step, find samples with the same vulnerability type and samples with different vulnerability types for each sample in the dataset;
[0130] Based on the set of samples of the same type and the set of samples of different types returned in the previous step, construct the fusion matrix of samples of the same type and the comparison matrix of samples of different types for the target sample;
[0131] The vectors corresponding to each sample of the same type in the same type fusion matrix are added together and the average value is taken to obtain the final fusion vector of the same type of samples; the vectors corresponding to each sample of different types in the comparison matrix of different types of samples are added together and the average value is taken to obtain the final comparison vector of different types of samples.
[0132] Furthermore, the deep learning model includes an input layer module, a feature extraction layer module, and a classification layer module;
[0133] The input layer module consists of a fully connected network and The activation function is used to construct the feature extraction layer module, which consists of multiple fully connected networks, and the fully connected networks are interconnected using... Activation function and The layer randomly discards neurons in the network; the classification layer module consists of a fully connected layer, which outputs a score for each sample, and then uses... The function calculates the score and outputs the predicted probability value for each vulnerability category. The one with the highest probability value is the final classification result.
[0134] The system in this embodiment can be used to execute Figure 1 The technical solutions of the method embodiments shown are similar in principle and in effect, and will not be described again here.
[0135] Compared with existing technologies, the method and system provided by this invention can generate proprietary word vector models for specific programming languages, thereby greatly improving the richness and completeness of features after program code vectorization; it can generate fusion vectors of the same type of samples and comparison vectors of different types of samples for each sample in the dataset. In this way, deep learning models can more easily extract obvious distinguishable features during learning, thereby greatly improving the model's ability and accuracy to identify various vulnerabilities.
[0136] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for detecting multiple types of vulnerabilities through multi-sample comparison and fusion, characterized in that, Includes the following steps: S1. Process the original dataset to generate datasets with various vulnerability types; S2. Perform a single-code symbolization operation on the samples in the dataset to obtain symbolized samples; S3. Construct a proprietary word vector model for a specific programming language, and perform vectorization operations on the symbolic samples to obtain sample vectors; S4. Construct a fusion matrix for samples of the same type and a comparison matrix for samples of different types for the sample vectors respectively, and add the fusion matrix for samples of the same type and the comparison matrix for samples of different types and take the average value respectively to obtain the fusion vector for samples of the same type and the comparison vector for samples of different types for each sample in the dataset; Step S4 specifically includes: S41. Statistically analyze the number of samples for each vulnerability type in the dataset; S42. Based on the results of step S41, find samples with the same vulnerability type and samples with different vulnerability types for each sample in the dataset. S43. Based on the set of samples of the same type and the set of samples of different types returned in step S42, construct the fusion matrix of samples of the same type and the comparison matrix of samples of different types for the target sample; S44. Add the vectors corresponding to each sample of the same type in the same type fusion matrix and take the average value to obtain the final same type sample fusion vector; add the vectors corresponding to each sample of different types in the different type comparison matrix and take the average value to obtain the final different type sample comparison vector. S5. The deep learning model is trained using the fusion vector of samples of the same type and the comparison vector of samples of different types, and the trained deep learning model is used for vulnerability detection.
2. The multi-sample comparison and fusion method for detecting multiple types of vulnerabilities according to claim 1, characterized in that, Step S3 specifically includes: S31. Perform word statistics on the symbolized samples in the dataset that have undergone symbolization operations to obtain a corpus; S32. Use the CBOW method in word2vec to train a proprietary word vector model; S33. Use the trained proprietary word vector model to vectorize each symbolic sample in the dataset to generate a sample vector.
3. The multi-sample comparison and fusion method for detecting multiple types of vulnerabilities according to claim 2, characterized in that, Step S33 specifically includes: S331. Read a symbolic sample from the dataset; S332. Use the word segmentation tool NLTK to divide the symbolized sample into several words; S333. For each word, perform vectorization operation using a pre-trained proprietary word vector model to generate word vectors with fixed dimensions. S334. Add up the word vectors corresponding to each word and take the average to obtain the sample vector corresponding to the entire symbolic sample. S335. Repeat steps S331-S334 to obtain the sample vectors corresponding to each symbolized sample in the dataset.
4. The multi-sample comparison and fusion method for detecting multiple types of vulnerabilities according to claim 1, characterized in that, In step S5, the deep learning model includes an input layer module, a feature extraction layer module, and a classification layer module. The input layer module consists of a fully connected network and The activation function is used to construct the feature extraction layer module, which consists of multiple fully connected networks, and the fully connected networks are interconnected using... Activation function and The layer randomly discards neurons in the network; the classification layer module consists of a fully connected layer, which outputs a score for each sample, and then uses... The function calculates the score and outputs the predicted probability value for each vulnerability category. The one with the highest probability value is the final classification result.
5. A multi-sample comparison and fusion vulnerability detection system for multiple types, characterized in that, include: The multi-type vulnerability dataset processing unit is used to process the raw dataset and generate datasets of various vulnerability types. A single code symbolization unit is used to perform a single code symbolization operation on samples in the dataset to obtain symbolized samples; The proprietary word vector model vectorization unit is used to construct a proprietary word vector model for a specific programming language, and to perform vectorization operations on the symbolic samples to obtain sample vectors. The sample comparison and fusion matrix unit is used to construct a fusion matrix of samples of the same type and a comparison matrix of samples of different types for the sample vectors, and to add the fusion matrix of samples of the same type and the comparison matrix of samples of different types and take the average value to obtain the fusion vector of samples of the same type and the comparison vector of samples of different types for each sample in the dataset. The sample comparison and fusion matrix unit is specifically used to perform the following steps: The statistical data set includes the number of samples for each vulnerability type. Based on the results of the previous step, find samples with the same vulnerability type and samples with different vulnerability types for each sample in the dataset; Based on the set of samples of the same type and the set of samples of different types returned in the previous step, construct the fusion matrix of samples of the same type and the comparison matrix of samples of different types for the target sample; The vectors corresponding to each sample of the same type in the same type fusion matrix are summed and the average value is taken to obtain the final fusion vector of the same type of samples; the vectors corresponding to each sample of different types in the comparison matrix of different types of samples are summed and the average value is taken to obtain the final comparison vector of different types of samples. The deep learning model unit is used to train the deep learning model using the fusion vector of samples of the same type and the comparison vector of samples of different types, and to use the trained deep learning model to perform vulnerability detection.
6. The multi-sample comparison and fusion vulnerability detection system for multiple types according to claim 5, characterized in that, The vectorization unit of the proprietary word vector model is specifically used to perform the following steps: Word statistics are performed on the symbolic samples in the dataset that have undergone symbolization operations to obtain a corpus; Train a proprietary word vector model using the CBOW method in word2vec; The trained proprietary word vector model is used to vectorize each symbolic sample in the dataset to generate a sample vector.
7. The multi-sample comparison and fusion vulnerability detection system for multiple types according to claim 6, characterized in that, The step of using the trained specialized word vector model to vectorize each symbolic sample in the dataset specifically includes: Read a symbolic sample from the dataset; The symbolic sample was divided into several words using the word segmentation tool NLTK; For each word, a pre-trained proprietary word vector model is used to perform vectorization operations to generate word vectors with fixed dimensions; The word vectors corresponding to each word are added together and averaged to obtain the sample vector corresponding to the entire symbolic sample. Repeat the above steps to obtain the sample vectors corresponding to each symbolized sample in the dataset.
8. The multi-sample comparison and fusion vulnerability detection system for multiple types according to claim 5, characterized in that, The deep learning model includes an input layer module, a feature extraction layer module, and a classification layer module. The input layer module consists of a fully connected network and The activation function is used to construct the feature extraction layer module, which consists of multiple fully connected networks, and the fully connected networks are interconnected using... Activation function and The layer randomly discards neurons in the network; the classification layer module consists of a fully connected layer, which outputs a score for each sample, and then uses... The function calculates the score and outputs the predicted probability value for each vulnerability category. The one with the highest probability value is the final classification result.