Vulnerability detection model training method, script vulnerability detection method and related device

By training a vulnerability detection model and utilizing extreme learning machine and autoencoder algorithms to automatically adjust network parameters, the problems of low coverage and efficiency in existing TCL script vulnerability detection methods are solved. This enables adaptation to dynamic changes and complexity of scripts, thereby improving detection efficiency and accuracy.

CN119397529BActive Publication Date: 2026-06-23GUANGZHOU ZHONO ELECTRONICS TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGZHOU ZHONO ELECTRONICS TECH CO LTD
Filing Date
2024-10-10
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing TCL script vulnerability detection methods rely on static code analysis and rule matching, which have limited coverage and depth, making it difficult to adapt to the dynamic changes and complexity of scripts, and are also inefficient.

Method used

By training a vulnerability detection model and utilizing extreme learning machine and autoencoder algorithms, combined with variability and vulnerability prediction loss, the network parameters are automatically adjusted to adapt to the dynamic changes and complexity of the script.

Benefits of technology

It improves the coverage and efficiency of vulnerability detection, can adapt to the dynamic changes and complexity of scripts, and enhances the generalization ability of detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119397529B_ABST
    Figure CN119397529B_ABST
Patent Text Reader

Abstract

The application provides a vulnerability detection model training method, a script vulnerability detection method and related devices, and relates to the field of machine learning. In the method, an electronic device obtains vulnerability prediction loss of a sample script set according to vulnerability prediction information of the sample script set by a current first to-be-trained model; obtains a mutation degree of the sample script set, and updates the current first to-be-trained model according to the mutation degree and the vulnerability prediction loss, wherein the mutation degree represents the complexity of the sample script set. The above steps are iterated until an iteration stop condition of the vulnerability detection model is met. In this way, in the training process of the first to-be-trained model, the parameters of the network are automatically adjusted according to the complexity of the sample script set, so that the trained vulnerability detection model has good generalization ability and can adapt to the dynamic changes and complexity of scripts.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of machine learning, and more specifically, to a vulnerability detection model training method, a script vulnerability detection method, and related apparatus. Background Technology

[0002] Scripting languages ​​are widely popular due to their convenience and efficiency. They greatly simplify the programming process, enabling developers to quickly implement various functions with less code. Furthermore, scripting languages ​​typically possess excellent automation and text processing capabilities. However, current scripting languages ​​still have shortcomings in vulnerability detection, mainly manifested in limited detection coverage and low efficiency.

[0003] Take TCL (Tool Command Language) scripts as an example. These scripts are widely used in digital back-end processing, such as placement, routing, and timing analysis. Their security and correctness directly affect the quality and reliability of the entire chip design. With the increasing complexity of integrated circuit design and the shortening of design cycles, traditional manual inspection methods can no longer meet the demands of rapid development. Therefore, developing an automated and efficient TCL script vulnerability detection method is particularly important.

[0004] However, existing TCL script vulnerability detection methods typically rely on static code analysis and rule matching. These methods not only have limited coverage and depth of vulnerability detection, but also struggle to adapt to the dynamic changes and complexity of scripts. Furthermore, traditional methods are inefficient when processing large-scale script data and have a weak ability to identify novel vulnerabilities. Summary of the Invention

[0005] To overcome at least one deficiency in the prior art, this application provides a vulnerability detection model training method, a script vulnerability detection method, and related apparatus, specifically including:

[0006] Firstly, this application provides a method for training a vulnerability detection model, the method comprising:

[0007] Based on the vulnerability prediction information of the current first model to be trained on the sample script set, the vulnerability prediction loss for the sample script set is obtained.

[0008] Obtain the variability of the sample script set, and update the current first training model based on the variability and the vulnerability prediction loss, wherein the variability characterizes the complexity of the sample script set.

[0009] Iterate through the above steps until the iteration stopping condition for the vulnerability detection model is met.

[0010] In conjunction with the optional implementation of the first aspect, obtaining the variability of the sample script set includes:

[0011] Calculate the variance of the sample script set;

[0012] The variance is used to obtain the variability of the sample script set.

[0013] In conjunction with the optional implementation of the first aspect, the current first model to be trained is updated based on the variability and the vulnerability prediction loss, including:

[0014] Based on the loss prediction of the vulnerability, the initial parameter update amount of the first model to be trained is obtained;

[0015] The initial parameter update amount is adjusted using the variability to obtain the target parameter update amount of the first model to be trained.

[0016] The first model to be trained is updated based on the target parameter update amount.

[0017] In conjunction with the optional implementation of the first aspect, the sample script set includes script features of multiple first original scripts, the script features of the multiple first original scripts are obtained by dimensionality reduction of the initial features of the multiple first original scripts using a feature compression model, and the method further includes:

[0018] Based on the reconstructed feature set of the sample feature set by the current second model to be trained, the reconstruction error of the sample feature set is obtained, wherein the sample feature set is processed by the encoder and decoder of the second model to be trained in sequence to obtain the reconstructed feature set;

[0019] The encoded feature set output by the encoder is sparsified to obtain the sparse self-expression error;

[0020] The sparse self-expression error is used as a regularization term to constrain the reconstruction error, and the current second training model is updated.

[0021] Iterate the above steps until the iteration stopping condition, which serves as the feature compression model, is met.

[0022] In conjunction with the optional implementation of the first aspect, the encoded feature set output by the encoder is sparsified to obtain a sparse self-expression error, including:

[0023] A sparse self-expression matrix is ​​generated based on the encoded feature set, wherein the encoded feature set includes multiple encoded vectors, and the sparse self-expression matrix carries similarity information among the multiple encoded vectors;

[0024] Calculate the vector distance between each encoded vector and its own sparse vector, wherein the sparse vector of each encoded vector is the vector after the encoded vector is mapped by the sparse self-expression matrix;

[0025] The sum of all the vector distances is taken as the sparse self-expression error.

[0026] In conjunction with the optional implementation of the first aspect, a sparse self-expression matrix is ​​generated based on the encoded feature set, including:

[0027] Calculate the Euclidean distance between the feature matrix formed by the multiple encoding vectors and the transpose of the feature matrix to obtain the initial self-expression matrix;

[0028] The optimized self-expression matrix is ​​obtained by adjusting the initial self-expression matrix using a preset matrix density coefficient.

[0029] The optimized self-expression matrix is ​​normalized to obtain the sparse self-expression matrix.

[0030] Secondly, this application also provides a script vulnerability detection method, the method comprising:

[0031] Obtain the script to be tested;

[0032] The vulnerability detection model trained using the aforementioned vulnerability detection model training method processes the script to be detected to obtain vulnerability prediction information for the script to be detected.

[0033] Thirdly, this application provides a vulnerability detection model training device, the device comprising:

[0034] The model loss module is used to obtain the vulnerability prediction loss for the sample script set based on the vulnerability prediction information of the current first model to be trained on the sample script set.

[0035] The model training module is used to obtain the variability of the sample script set and update the current first model to be trained based on the variability and the vulnerability prediction loss, wherein the variability represents the complexity of the sample script set.

[0036] Iterate through the above steps until the iteration stopping condition for the vulnerability detection model is met.

[0037] In conjunction with the optional implementation of the third aspect, the model training module is further specifically used for:

[0038] Calculate the variance of the sample script set;

[0039] The variance is used to obtain the variability of the sample script set.

[0040] In conjunction with the optional implementation of the third aspect, the model training module is also specifically used for:

[0041] Based on the loss prediction of the vulnerability, the initial parameter update amount of the first model to be trained is obtained;

[0042] The initial parameter update amount is adjusted using the variability to obtain the target parameter update amount of the first model to be trained.

[0043] The first model to be trained is updated based on the target parameter update amount.

[0044] In conjunction with the optional implementation of the third aspect, the sample script set includes script features of multiple first original scripts, which are obtained by dimensionality reduction of the initial features of the multiple first original scripts using a feature compression model. The model loss module is further used for:

[0045] Based on the reconstructed feature set of the sample feature set by the current second model to be trained, the reconstruction error of the sample feature set is obtained, wherein the sample feature set is processed by the encoder and decoder of the second model to be trained in sequence to obtain the reconstructed feature set;

[0046] The encoded feature set output by the encoder is sparsified to obtain the sparse self-expression error;

[0047] The model training module is also used to update the current second model to be trained by using the sparse self-expression error as a regularization term to constrain the reconstruction error.

[0048] Iterate the above steps until the iteration stopping condition, which serves as the feature compression model, is met.

[0049] In conjunction with the optional implementation method of the third aspect, the model loss module is also specifically used for:

[0050] A sparse self-expression matrix is ​​generated based on the encoded feature set, wherein the encoded feature set includes multiple encoded vectors, and the sparse self-expression matrix carries similarity information among the multiple encoded vectors;

[0051] Calculate the vector distance between each encoded vector and its own sparse vector, wherein the sparse vector of each encoded vector is the vector after the encoded vector is mapped by the sparse self-expression matrix;

[0052] The sum of all the vector distances is taken as the sparse self-expression error.

[0053] In conjunction with the optional implementation method of the third aspect, the model loss module is also specifically used for:

[0054] Calculate the Euclidean distance between the feature matrix formed by the multiple encoding vectors and the transpose of the feature matrix to obtain the initial self-expression matrix;

[0055] The optimized self-expression matrix is ​​obtained by adjusting the initial self-expression matrix using a preset matrix density coefficient.

[0056] The optimized self-expression matrix is ​​normalized to obtain the sparse self-expression matrix.

[0057] Fourthly, this application also provides a storage medium storing a computer program, which, when executed by a processor, implements the vulnerability detection model training method or the script vulnerability detection method.

[0058] Fifthly, this application also provides an electronic device, which includes a processor and a memory. The memory stores a computer program, and when the computer program is executed by the processor, it implements the vulnerability detection model training method or the script vulnerability detection method.

[0059] Compared with the prior art, this application has the following beneficial effects:

[0060] This application provides a vulnerability detection model training method, a script vulnerability detection method, and related apparatus. The electronic device obtains a vulnerability prediction loss for a sample script set based on vulnerability prediction information from a current first model to be trained; it acquires the variability of the sample script set and updates the current first model to be trained based on the variability and the vulnerability prediction loss, whereby the variability characterizes the complexity of the sample script set. The above steps are iterated until the iteration stopping condition for the vulnerability detection model is met. Thus, during the training process of the first model to be trained, the network parameters are automatically adjusted according to the complexity of the sample script set, enabling the trained vulnerability detection model to have good generalization ability and adapt to the dynamic changes and complexity of scripts. Attached Figure Description

[0061] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0062] Figure 1 A flowchart of the vulnerability detection model training method provided in the embodiments of this application;

[0063] Figure 2 A flowchart illustrating the training method for the feature compression model provided in this application embodiment;

[0064] Figure 3 This is a schematic diagram of the structure of the vulnerability detection model training device provided in the embodiments of this application;

[0065] Figure 4 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0066] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. The components of the embodiments of this application described and shown in the accompanying drawings can generally be arranged and designed in various different configurations.

[0067] Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely to illustrate selected embodiments of the application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.

[0068] It should be noted that similar labels and letters in the following figures indicate similar items. Therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures.

[0069] In the description of this application, it should be noted that the terms "first," "second," "third," etc., are used only for distinguishing descriptions and should not be construed as indicating or implying relative importance. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0070] Based on the above statement, as introduced in the background technology, existing TCL script vulnerability detection methods usually rely on static code analysis and rule matching. These methods not only have limited coverage and depth of vulnerability detection, but also have difficulty adapting to the dynamic changes and complexity of scripts.

[0071] Based on the discovery of the aforementioned technical problems, the inventors, through creative labor, proposed the following technical solutions to solve or improve these problems. It should be noted that the deficiencies in the solutions of the prior art are the result of the inventors' practical experience and careful research. Therefore, the discovery process of the aforementioned problems and the solutions proposed in the embodiments of this application below should be considered contributions made by the inventors to this application during the inventive process, and should not be construed as technical content known to those skilled in the art.

[0072] Therefore, this embodiment provides a vulnerability detection model training method. In this method, an electronic device obtains a vulnerability prediction loss for a sample script set based on the vulnerability prediction information of the current first model to be trained; it acquires the variability of the sample script set and updates the current first model to be trained based on the variability and the vulnerability prediction loss, whereby the variability characterizes the complexity of the sample script set. The above steps are iterated until the iteration stopping condition for the vulnerability detection model is met. Thus, during the training process of the first model to be trained, the network parameters are automatically adjusted according to the complexity of the sample script set, enabling the trained vulnerability detection model to have good generalization ability and adapt to the dynamic changes and complexity of the scripts.

[0073] It should be noted that the electronic device implementing this method can be a desktop computer, server, or similar device capable of providing sufficient computing power. The server can be a single server or a group of servers. The server group can be centralized or distributed (e.g., the servers can be a distributed system). In some embodiments, the server can be local or remote relative to the user terminal. In some embodiments, the server can be implemented on a cloud platform; by way of example only, the cloud platform can include private cloud, public cloud, hybrid cloud, community cloud, distributed cloud, inter-cloud, multi-cloud, or any combination thereof. In some embodiments, the server can be implemented on an electronic device having one or more components.

[0074] Furthermore, the method provided in this embodiment is applicable not only to TCL scripts, but also to scripting languages ​​such as Python, JavaScript, PHP, Java, and Golang, as well as non-scripting languages.

[0075] To make the solution provided in this embodiment clearer, a server will be used as an electronic device below, and in conjunction with... Figure 1Each step of the method is described in detail. However, it should be understood that the operations in the flowchart may not be implemented in sequence, and steps without logical contextual relationships may be reversed in order or implemented simultaneously. Furthermore, those skilled in the art, guided by the content of this application, may add one or more other operations to the flowchart, or remove one or more operations from the flowchart. Figure 1 As shown, the method includes:

[0076] S1A, based on the vulnerability prediction information of the current first model to be trained on the sample script set, obtains the vulnerability prediction loss for the sample script set.

[0077] In some implementations of the aforementioned sample script set, the sample script set may include multiple first original scripts, in which script code is recorded in text form. In this case, the first model to be trained directly processes the first original scripts in text form. In other implementations, the sample script set may also include script features extracted from the multiple first original scripts by a preprocessing model. In this case, the first model to be trained no longer processes the first original scripts in text form directly, but instead processes the feature vectors extracted from the original scripts.

[0078] Regardless of the method used, the text-based script code needs to be converted into vector data that the model can accept. This can be achieved using word embedding techniques. Taking the original TCL script as an example, the server maps each original TCL script and its parameters into a vector in a high-dimensional space. These vectors represent the semantic relationships between words; for example, TCL keywords such as "set" and "global" are converted into unique vector representations. Then, a "Bag-of-Words" model is used to count the frequency of each word in the script, thereby generating a fixed-length vector.

[0079] Regarding the first raw script mentioned above, continuing with the TCL raw script as an example, it can be collected from various digital backends. These raw scripts involve functions such as script automation processing, configuration management, and environment settings. The collected TCL raw scripts are stored in plain text format. For example, the specific storage format can be JSON, which includes metadata such as the original script's source code, file path, and modification time.

[0080] {

[0081] "script": "ax1settiming_constraints{-ax2-ax3}",

[0082] "path":" / ax4 / project / ax5 / ",

[0083] "last_modified": "2022-01-01T12:00:00"

[0084] }

[0085] Then, the collected TCL original scripts were labeled manually. That is, professional software engineers classified and labeled the TCL original scripts according to their functions and potential vulnerability categories. Each original script was categorized into four types of tags: "no vulnerability", "syntax error", "execution error" and "security risk" based on its potential risks and error types.

[0086] Based on the above examples of sample script sets, it was found in practice that conventional machine learning mechanisms require adjustments to the parameters of each network layer, resulting in long training times.

[0087] To address this, the first model to be trained in this embodiment can use an Extreme Learning Machine (ELM), where each input sample script is directly passed to the output layer of the ELM without going through any hidden layers. Therefore, the ELM primarily adjusts the weights of the output layer during training. Its core idea is to randomly set the network's input weights and hidden layer biases, and these parameters remain unchanged once determined. Finally, the output layer weights can be directly calculated using the least squares method to minimize the network's output error.

[0088] Therefore, unlike traditional neural networks, the training process of the Extreme Learning Machine does not involve iteratively adjusting the weights and biases of the hidden layers, but only adjusting the weights of the output layer, which can significantly improve the training speed.

[0089] Based on the above embodiments' description of the first model to be trained and the sample script set, this embodiment also provides the following loss function for the first model to be trained:

[0090] As described in the above embodiments, in traditional Extreme Learning Machines (ELMs), the output layer weights are typically obtained by solving a least-squares problem. Similar to traditional ELMs, the output layer weights qW in this embodiment are also implemented by minimizing the loss function qL:

[0091]

[0092] In the formula, qH represents the output matrix of the hidden layer, qt represents the target output matrix, and qH T This represents the transpose of qH. In this embodiment, the loss function qL is an improved loss function derived from the Green's function qG(qy,qt), and its expression is as follows:

[0093]

[0094] In the formula, qyi This represents the vulnerability prediction information for the i-th sample script, qt i Let represent the true label of the i-th sample script, and γ represent the regularization coefficient, which is used to control the influence of the Green's function term. Thus, the vulnerability prediction loss can be calculated using the loss function described above.

[0095] The Green's function term ∫qG(qy,qt)dq is used to handle the optimization of complex decision boundaries, where qy represents the predicted output matrix. The expression for ∫qG(qy,qt)dq is as follows:

[0096]

[0097] In the formula, σ vy This represents the scaling parameter based on the data distribution, used to enhance the model's sensitivity to output errors, especially near the error boundary. After integration, the result is expressed as:

[0098]

[0099] Based on the above introduction to the loss function and vulnerability prediction loss, please refer to [link to previous text]. Figure 1 The vulnerability detection model training method provided in this embodiment also includes:

[0100] S2A obtains the variability of the sample script set, and updates the current first model to be trained based on the variability and the vulnerability prediction loss.

[0101] Here, variability characterizes the complexity of the sample script set. In a specific implementation, the server calculates the variance of the sample script set; based on the variance, the variability of the sample script set is obtained. In a specific implementation, the variance can be mapped to a value between 0 and 1, which is then used as the variability of the sample script set. For example, the variability can be calculated using the following expression:

[0102] q v =Var(q) x )

[0103] In the formula, q v q represents variance. x This represents the sample scripts in the sample script set, and the function Var() calculates the variance. Based on the calculated variance, it can be mapped to a variability between 0 and 1 using the following expression:

[0104]

[0105] In the formula, q λ This indicates the variability of the sample script set.

[0106] Regarding the variability of the aforementioned sample script set, it should be noted that this embodiment differs from the traditional Extreme Learning Machine in that, to adapt to the variability of the scripts, this embodiment employs an adaptive network redundancy adjustment mechanism based on the loss function to automatically adjust the network parameters according to the complexity of the training data. Therefore, as an optional implementation, the server can predict the loss based on vulnerabilities to obtain the initial parameter update amount of the first model to be trained; adjust the initial parameter update amount using variability to obtain the target parameter update amount of the first model to be trained; and update the current first model to be trained based on the target parameter update amount.

[0107] In a specific implementation, the server calculates the vulnerability prediction loss based on the aforementioned loss function qL, and then calculates the weight update amount for each weight and the bias update amount for each bias. Finally, the weight update amount and bias update amount are fine-tuned using variability. The specific expressions are as follows:

[0108] qW′=qW+q λ ΔqW

[0109] qb′=qb+q λ Δqb

[0110] In the formula, qW′ represents the updated weight, ΔqW represents the weight update amount, qW represents the weight to be updated, and q λ qb′ represents the variability of the sample script set; qb′ represents the updated bias; qb represents the bias to be updated; and Δqb represents the amount of bias update.

[0111] S3A: Determine whether the iteration stopping condition for the vulnerability detection model is met. If not, return to step S1A for execution; otherwise, execute step S4A.

[0112] S4A will use the current first model to be trained as the vulnerability detection model.

[0113] In this way, the network parameters are automatically adjusted according to the complexity of the sample script set, so that the trained vulnerability detection model has good generalization ability and can adapt to the dynamic changes and complexity of the scripts.

[0114] Based on the vulnerability detection model obtained in the above embodiments, this embodiment also provides a script vulnerability detection method. In this method, an electronic device acquires a script to be detected; a vulnerability detection model trained using the vulnerability detection model training method processes the script to obtain vulnerability prediction information for the script. This vulnerability prediction information is one of "no vulnerability," "syntax error," "execution error," or "security risk." This improves the efficiency of script detection.

[0115] Furthermore, in practice, it has been found that as dimensionality increases, data becomes increasingly sparse in space, and even similar instances can be far apart due to slight differences in some dimensions. This causes originally closely related semantic information to become scattered. Therefore, as described in the embodiments above, in some implementations, the sample script set includes script features of multiple first original scripts. However, unlike conventional techniques, the script features of multiple first original scripts are obtained by dimensionality reduction of the initial features of multiple first original scripts using a feature compression model.

[0116] For this feature compression model, this embodiment employs an autoencoder algorithm for feature dimensionality reduction. However, unlike traditional autoencoders, this embodiment uses an autoencoder neural network with dual loss as the feature dimensionality reduction algorithm. In this algorithm, the autoencoder neural network compresses the input feature vector into a low-dimensional representation, and the decoder attempts to reconstruct the input features from this low-dimensional representation. Traditional autoencoders typically use reconstruction error as the optimization target. Inspired by the ability of regularization terms to sparsify the model structure, this embodiment uses regularization terms to sparsify the features, enabling the feature compression model to retain maximum information content while becoming more sparse. This allows for the revelation of more structural information in the encoded feature representation, enabling a deeper understanding of the inherent structural characteristics of the data.

[0117] Therefore, this embodiment uses a model with an encoder and a decoder as the second model to be trained. The following section will combine... Figure 2 The training method for the second model to be trained is explained in detail:

[0118] S1B, based on the reconstructed feature set of the sample feature set by the current second model to be trained, obtains the reconstruction error of the sample feature set.

[0119] The sample feature set includes script features of multiple second original scripts. These multiple second original scripts and multiple first original scripts can be the same original script or different original scripts; this embodiment does not specifically limit this.

[0120] It should be noted that the second training model is initialized before inputting the sample feature set. In this embodiment, the encoder is represented as E, the decoder as D, and the encoder's weights and biases as θ. E The weights and biases of the decoder are represented as θ. D The encoder and decoder weights and biases are initialized in a random manner.

[0121] After initialization, the script features in the sample feature set are sequentially processed by the encoder and decoder of the second training model to obtain the corresponding reconstructed features. Let the sample feature set be represented as x. aThe encoder transforms the encoded feature set z into a low-dimensional space, expressed as z = E(x). a ,θ E Then, decoder D reconstructs the encoded feature set z into the reconstructed feature set. The expression is

[0122] Since the reconstruction error is used to measure the difference between the sample feature set and the reconstructed feature set, it is calculated as follows:

[0123]

[0124] In the formula, Indicates the reconstruction error. This represents the i-th script feature in the sample feature set. Let represent the reconstructed feature of the i-th script feature, and n represent the number of script features in the sample feature set.

[0125] S2B sparsifies the encoded feature set output by the encoder to obtain sparse self-expression error.

[0126] The encoded feature set includes multiple encoded vectors. In this embodiment, the encoded feature set is sparsified by using a sparse self-expression matrix that represents the similarity information between the multiple encoded vectors. To this end, this embodiment also provides an optional implementation method for step S2B:

[0127] S2B-1 generates a self-expression matrix based on the encoded feature set.

[0128] In a specific implementation, the server can calculate the Euclidean distance between the feature matrix composed of multiple encoding vectors and the transpose of the feature matrix to obtain an initial self-expression matrix; adjust the initial self-expression matrix by a preset matrix density coefficient to obtain an optimized self-expression matrix; and normalize the optimized self-expression matrix to obtain a sparse self-expression matrix.

[0129] S2B-2 calculates the vector distance between each encoded vector and its own sparse vector.

[0130] Here, the sparse vector of each encoding vector is the vector after the encoding vector is mapped by the sparse self-expression matrix.

[0131] S2B-3 uses the sum of all vector distances as the sparse self-expression error.

[0132] For example, suppose the sparse self-expression error is represented as Its expression is:

[0133]

[0134] In the formula, n represents the number of script features in the sample feature set, and α ed Z represents the regularization parameter. i Let represent the i-th encoded feature in the encoded feature set, and S represent the self-expression matrix. The expression for the self-expression matrix is:

[0135] S = softmax(-γ) pe ||ZZ T ||)

[0136] In the formula, Z represents a matrix consisting of all encoded features. T Represents the transpose of Z, γ pe The default matrix density coefficients are represented by ||||, which represents the calculation of Euclidean distance, and softmax() represents the function used for normalization.

[0137] S3B uses the sparse self-expression error as a regularization term to constrain the reconstruction error, and updates the current second training model.

[0138] Based on the above reconstruction error With sparse self-expression error The two are weighted together to obtain the weighted error L. A The expression is:

[0139]

[0140] Based on the above embodiments' description of reconstruction error, sparse self-expression error, and weighted error, and referring to the figures, the method provided in this embodiment further includes:

[0141] S4B: Determine whether the iteration stopping condition for the feature compression model is met. If not, return to step S1B for execution; otherwise, execute step S5B.

[0142] S5B uses the current second model to be trained as the feature compression model.

[0143] Thus, when evaluating the feature extraction performance of the second model to be trained, the sparse self-expression layer forces more structural information to be revealed in the encoded feature representation, thereby enabling a deeper exploration of the inherent structural characteristics of the data. This allows the trained feature compression model to reduce the feature dimension while minimizing information loss.

[0144] It should be understood that although the collection, annotation, and preprocessing of sample data are time-consuming and labor-intensive processes, insufficient sample data can easily lead to poor model generalization ability and affect model accuracy. Therefore, the original scripts used during the training of the first and second models mentioned above, in addition to the scripts written by the developers, also include some extended scripts generated based on the scripts written by the developers.

[0145] Therefore, in this embodiment, the server also generates samples based on a deep feature space-extended generative adversarial network (GAN) algorithm to achieve data augmentation. The deep feature space-extended GAN consists of two parts: a generator (G) and a discriminator (D). Through the cooperation of the generator (G) and the discriminator (D), sufficiently realistic augmented scripts can be obtained. Continuing with the TCL script as an example, during model training, the generator is responsible for generating near-realistic script code data, while the discriminator is responsible for distinguishing whether the input data comes from the real dataset or the data generated by the generator. After sufficient training, the augmented script generated by the generator can successfully deceive the discriminator.

[0146] Based on the same inventive concept as the vulnerability detection model training method provided in this embodiment, this embodiment also provides a vulnerability detection model training device. This device includes at least one software functional module that can be stored in a memory or embedded in a software. A processor in an electronic device is used to execute the executable module stored in the memory. For example, the software functional modules and computer programs included in this device. Please refer to... Figure 3 Functionally, the device may include:

[0147] The model loss module 11 is used to obtain the vulnerability prediction loss of the sample script set based on the vulnerability prediction information of the current first model to be trained on the sample script set.

[0148] The model training module 12 is used to obtain the variability of the sample script set and update the current first model to be trained based on the variability and vulnerability prediction loss. The variability represents the complexity of the sample script set.

[0149] Iterate through the above steps until the iteration stopping condition for the vulnerability detection model is met.

[0150] In this embodiment, the model loss module 11 is used to implement... Figure 1 In step S1A, the model training module 12 is used to implement... Figure 1 Steps S2A-S4A in the above method are described above. Therefore, for a detailed description of each module, please refer to the specific implementation method of the corresponding step. It should be noted that the vulnerability detection model training device can also implement other steps or sub-steps of the method through the above modules or other modules, which will not be repeated in this embodiment.

[0151] In addition, the functional modules in the various embodiments of this application can be integrated together to form an independent part, or each module can exist independently, or two or more modules can be integrated to form an independent part.

[0152] It should also be understood that if the above embodiments are implemented as software functional modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application.

[0153] Therefore, this embodiment also provides a storage medium, which is a computer-readable storage medium. This storage medium stores a computer program, which, when executed by a processor, implements the vulnerability detection model training method or script vulnerability detection method provided in this embodiment. The storage medium can be any medium capable of storing program code, such as a USB flash drive, external hard drive, read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk.

[0154] This embodiment provides an electronic device for implementing a vulnerability detection model training method. For example... Figure 4 As shown, the electronic device may include a processor 22 and a memory 21. The memory 21 stores a computer program, and the processor reads and executes the computer program corresponding to the above embodiments in the memory 21 to implement the vulnerability detection model training method or script vulnerability detection method provided in this embodiment.

[0155] See also Figure 4 The electronic device also includes a communication unit 23. The memory 21, processor 22 and communication unit 23 are electrically connected to each other directly or indirectly through system bus 24 to realize data transmission or interaction.

[0156] The memory 21 can be an information recording device based on any electronic, magnetic, optical, or other physical principles, used to record execution instructions, data, etc. In some embodiments, the memory 21 can be, but is not limited to, volatile memory, non-volatile memory, memory drive, etc.

[0157] In some embodiments, the volatile memory may be random access memory (RAM); in some embodiments, the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory, etc.; in some embodiments, the storage drive may be a disk drive, solid-state drive, any type of storage disk (such as optical disc, DVD, etc.), or similar storage media, or a combination thereof.

[0158] The communication unit 23 is used to send and receive data over a network. In some embodiments, the network may include a wired network, a wireless network, a fiber optic network, a telecommunications network, an intranet, the Internet, a local area network (LAN), a wide area network (WAN), a wireless local area network (WLAN), a metropolitan area network (MAN), a public switched telephone network (PSTN), a Bluetooth network, a ZigBee network, or a near field communication (NFC) network, or any combination thereof. In some embodiments, the network may include one or more network access points. For example, the network may include wired or wireless network access points, such as base stations and / or network switching nodes, through which one or more components of the service request processing system can connect to the network to exchange data and / or information.

[0159] The processor 22 may be an integrated circuit chip with signal processing capabilities, and may include one or more processing cores (e.g., a single-core processor or a multi-core processor). By way of example only, the processor described above may include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), an Application Specific Instruction-set Processor (ASIP), a Graphics Processing Unit (GPU), a Physics Processing Unit (PPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a microcontroller unit, a Reduced Instruction Set Computing (RISC) computer, or a microprocessor, or any combination thereof.

[0160] Understandable. Figure 4 The structure shown is for illustrative purposes only. Electronic devices may also have more advanced features. Figure 4 Showing more or fewer components, or having with Figure 4 The different configurations shown. Figure 4 The components shown can be implemented using hardware, software, or a combination thereof.

[0161] It should be understood that the apparatus and methods disclosed in the above embodiments can also be implemented in other ways. The apparatus embodiments described above are merely illustrative. For example, the flowcharts and block diagrams in the accompanying drawings show the architecture, functionality, and operation of possible implementations of apparatus, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram and / or flowchart, and combinations of blocks in block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0162] The above descriptions are merely various embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A method for training a vulnerability detection model, characterized in that, The method includes: Based on the vulnerability prediction information of the current first model to be trained on the sample script set, the vulnerability prediction loss for the sample script set is obtained. Obtain the variability of the sample script set, and update the current first training model based on the variability and the vulnerability prediction loss, wherein the variability characterizes the complexity of the sample script set. Iterate through the above steps until the current first model to be trained meets the iteration stopping condition for a vulnerability detection model; The sample script set includes script features of multiple first original scripts. The script features of the multiple first original scripts are obtained by dimensionality reduction of the initial features of the multiple first original scripts using a feature compression model. The method further includes: Based on the reconstructed feature set of the sample feature set by the current second model to be trained, the reconstruction error of the sample feature set is obtained, wherein the sample feature set is processed by the encoder and decoder of the second model to be trained in sequence to obtain the reconstructed feature set; The encoded feature set output by the encoder is sparsified to obtain the sparse self-expression error; The sparse self-expression error is used as a regularization term to constrain the reconstruction error, and the current second training model is updated. The above steps are iterated until the current second model to be trained satisfies the iteration stopping condition for the feature compression model.

2. The vulnerability detection model training method according to claim 1, characterized in that, Obtaining the variability of the sample script set includes: Calculate the variance of the sample script set; The variance is used to obtain the variability of the sample script set.

3. The vulnerability detection model training method according to claim 1, characterized in that, Based on the variability and the vulnerability prediction loss, update the current first model to be trained, including: Based on the loss prediction of the vulnerability, the initial parameter update amount of the first model to be trained is obtained; The initial parameter update amount is adjusted using the variability to obtain the target parameter update amount of the first model to be trained. The first model to be trained is updated based on the target parameter update amount.

4. The vulnerability detection model training method according to claim 1, characterized in that, The encoded feature set output by the encoder is sparsified to obtain sparse self-expression error, including: A sparse self-expression matrix is ​​generated based on the encoded feature set, wherein the encoded feature set includes multiple encoded vectors, and the sparse self-expression matrix carries similarity information among the multiple encoded vectors; Calculate the vector distance between each encoded vector and its own sparse vector, wherein the sparse vector of each encoded vector is the vector after the encoded vector is mapped by the sparse self-expression matrix; The sum of all the vector distances is taken as the sparse self-expression error.

5. The vulnerability detection model training method according to claim 4, characterized in that, Based on the encoded feature set, a sparse self-expression matrix is ​​generated, including: Calculate the Euclidean distance between the feature matrix formed by the multiple encoding vectors and the transpose of the feature matrix to obtain the initial self-expression matrix; The optimized self-expression matrix is ​​obtained by adjusting the initial self-expression matrix using a preset matrix density coefficient. The optimized self-expression matrix is ​​normalized to obtain the sparse self-expression matrix.

6. A script vulnerability detection method, characterized in that, The method includes: Obtain the script to be tested; The vulnerability detection model trained by the vulnerability detection model training method according to any one of claims 1-5 is used to process the script to be detected to obtain vulnerability prediction information of the script to be detected.

7. A vulnerability detection model training device, characterized in that, The device includes: The model loss module is used to obtain the vulnerability prediction loss for the sample script set based on the vulnerability prediction information of the current first model to be trained on the sample script set. The model training module is used to obtain the variability of the sample script set and update the current first model to be trained based on the variability and the vulnerability prediction loss, wherein the variability represents the complexity of the sample script set. Iterate through the above steps until the current first model to be trained is used as the stopping condition for iteration of the vulnerability detection model; The sample script set includes script features of multiple first original scripts, which are obtained by dimensionality reduction of the initial features of the multiple first original scripts through a feature compression model. The model loss module is further configured to obtain the reconstruction error of the sample feature set based on the reconstructed feature set of the sample feature set by the current second model to be trained, wherein the sample feature set is processed sequentially by the encoder and decoder of the second model to be trained to obtain the reconstructed feature set; The encoded feature set output by the encoder is sparsified to obtain the sparse self-expression error; The model training module is also used to update the current second model to be trained by using the sparse self-expression error as a regularization term to constrain the reconstruction error. The above steps are iterated until the current second model to be trained satisfies the iteration stopping condition for the feature compression model.

8. A storage medium, characterized in that, The storage medium stores a computer program, which, when executed by a processor, implements the vulnerability detection model training method according to any one of claims 1-5 or the script vulnerability detection method according to claim 6.

9. An electronic device, characterized in that, The electronic device includes a processor and a memory. The memory stores a computer program. When the computer program is executed by the processor, it implements the vulnerability detection model training method according to any one of claims 1-5 or the script vulnerability detection method according to claim 6.