A large model vulnerability detection data expansion method and device, equipment and storage medium
By identifying and modifying the non-data processing logic in the vulnerability detection code, vulnerability detection code with the same logic as the original code is generated, which solves the problem of low accuracy of vulnerability detection data and improves the accuracy of vulnerability detection for large models.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ELECTRIC POWER RES INST OF GUANGDONG POWER GRID CO LTD
- Filing Date
- 2026-04-08
- Publication Date
- 2026-06-19
AI Technical Summary
The accuracy of vulnerability detection data in existing technologies is low, especially when expanding vulnerability detection benchmark test data, there is a problem of inaccurate vulnerability labeling.
By identifying the data processing logic code in the original vulnerability detection code, other code is identified as code to be mutated. The input and output data types are parsed, applicable mutation operations are selected, and the code to be mutated is mutated based on the target mutation operation to generate the mutated code. Finally, the mutated code and the data processing logic code are integrated.
It improves the accuracy of vulnerability detection code, ensures that the generated vulnerability detection code retains the original data processing logic, and enhances the reliability of large-scale model vulnerability detection capability evaluation.
Smart Images

Figure CN122241719A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of vulnerability detection technology, and in particular to a method, apparatus, device, and storage medium for expanding large-scale vulnerability detection data. Background Technology
[0002] With advancements in AI hardware and software technologies, Large Language Models (LLMs) have demonstrated strong programming capabilities, and the security knowledge they possess has also endowed them with certain vulnerability detection capabilities. To provide an effective reference for applying LLMs to code auditing, it is necessary to accurately evaluate their vulnerability detection capabilities. Testing these capabilities requires the use of vulnerability detection benchmark datasets.
[0003] However, high-quality vulnerability detection benchmark data is scarce. Currently, there are two methods to expand vulnerability detection benchmark data. One method is to collect vulnerability-related code based on CVE libraries and open-source code platform modification commit records using automated methods (such as web crawlers). This code is then used to mark vulnerabilities based on the information before and after the modification, forming the data in the vulnerability detection benchmark set. However, since open-source projects often contain multiple vulnerability-related commits, it's impossible to determine which specific vulnerability a modification relates to, thus affecting the accuracy of vulnerability labeling in the vulnerability detection data. The other method is to perform natural language processing on the vulnerability detection benchmark data to generate new vulnerability detection test data. However, simple semantic processing carries the risk of modifying key data processing logic in the vulnerability detection benchmark data, leading to unreliable accuracy in the generated new vulnerability detection test data. Summary of the Invention
[0004] This invention provides a method, apparatus, device, and storage medium for expanding large-scale vulnerability detection data, which can solve the problem of low accuracy of vulnerability detection data in the prior art.
[0005] To address the aforementioned technical problems, this invention provides a method for expanding large-scale vulnerability detection data, comprising: Identify the data processing logic code in the original vulnerability detection code, and determine the code other than the data processing logic code as code to be mutated; Analyze the original vulnerability detection code to determine the input and output data types; Based on the input data type and the output data type, several applicable mutation operations are selected from the preset mutation rule library; Randomly select a target mutation operation from among several applicable mutation operations; Based on the target mutation operation, the code to be mutated is mutated to obtain the mutated code; By integrating the mutated code and the data processing logic code, target vulnerability detection code is generated for detecting the vulnerability detection capability of large models.
[0006] As a preferred embodiment, the step of identifying the data processing logic code in the original vulnerability detection code and determining the code other than the data processing logic code as code to be mutated includes: Several code behavior characteristics were identified in the original vulnerability detection code; Each code behavior feature is compared with preset data processing code features and preset non-data processing code features to select the first code behavior feature used to characterize data processing. Locate the first code behavior feature in the original vulnerability detection code, determine the code segment corresponding to the first code behavior feature as data processing logic code, and record the line number data of the data processing logic code; Based on the code line number data, all code in the original vulnerability detection code that is located before the starting line of the data processing logic code is identified as preprocessing code, and all code in the original vulnerability detection code that is located after the ending line of the data processing logic code is identified as postprocessing code. The preprocessing code and postprocessing code are combined to generate the code to be mutated.
[0007] As a preferred embodiment, the step of parsing the original vulnerability detection code to determine the input data type and output data type includes: Identify the input and output parameter types in the function signature of the original vulnerability detection code; When neither the input parameter type nor the output parameter type is null, the input parameter type is determined as the input data type, and the output parameter type is determined as the output data type. When the input parameter type or the output parameter type is empty, the original vulnerability detection code is input into the preset parameter type recognition model so that the preset parameter type recognition model outputs the predicted input parameter type and the predicted output parameter type. If the predicted input parameter type is not null, then the predicted input parameter type is used as the input data type; if the predicted input parameter type is null, then the input data type is determined to be an unrecognized type. If the predicted output parameter type is not null, then the predicted output parameter type is used as the output data type; if the predicted output parameter type is null, then the output data type is determined to be an unrecognized type.
[0008] As a preferred embodiment, the step of selecting several applicable mutation operations from a preset mutation rule base based on the input data type and the output data type includes: When there are no unrecognized types in the input data type and the output data type, the input processing mutation, output processing mutation and parameter expansion mutation in the preset mutation rule library will be determined as applicable mutation operations; When the input data type is an unrecognized type and the output data type is not an unrecognized type, the input processing mutation and output processing mutation in the preset mutation rule library are determined as applicable mutation operations; When the input data type is not an unrecognized type and the output data type is an unrecognized type, the input processing mutation and parameter expansion mutation in the preset mutation rule library will be determined as applicable mutation operations; When the input data type is an unrecognized type and the output data type is an unrecognized type, the input processing mutation in the preset mutation rule base is determined as the applicable mutation operation.
[0009] As a preferred embodiment, when the target mutation operation is input processing mutation, the step of performing mutation processing on the code to be mutated based on the target mutation operation to obtain the mutated code includes: When the input data type is not an unrecognized type, several type-oriented preprocessing steps are determined based on the input data type; a target type-oriented preprocessing step is randomly selected from the several type-oriented preprocessing steps, and a first new code is generated based on the target type-oriented preprocessing step; the first new code is added to the preprocessing code in the code to be mutated to generate the mutated code; wherein, the type-oriented preprocessing includes data type conversion, data cleaning or data verification; When the input data type is an unrecognized type, a target type-independent preprocessing is randomly selected from several preset type-independent preprocessing methods, and a first new code is generated based on the target type-independent preprocessing. The first new code is added to the preprocessing code in the code to be mutated to generate the mutated code. The preset type-independent preprocessing includes business rule verification, general identifier addition, or data anonymization.
[0010] As a preferred embodiment, when the target mutation operation is output processing mutation, the step of performing mutation processing on the code to be mutated based on the target mutation operation to obtain the mutated code includes: Several data post-processing operations are determined based on the output data type; wherein, the data post-processing operations include data structure format conversion, data deduplication, or data type conversion; Randomly select a target data post-processing operation from among several data post-processing operations, and generate a second new code based on the target data post-processing operation; The second new code is added to the post-processing code of the code to be mutated to generate the mutated code.
[0011] As a preferred embodiment, when the target mutation operation is a parameter expansion mutation, the step of performing mutation processing on the code to be mutated based on the target mutation operation to obtain the mutated code includes: Identify the business meaning of input parameters in the original vulnerability detection code; The business execution logic is determined based on the data processing logic code; Based on the business meaning of the input parameters and the business execution logic, determine the business-related parameters of the input parameters; By summing the input parameters and the business-related parameters, new input parameters are generated; Based on the new input parameters, the preprocessed code in the code to be mutated is modified to generate the mutated code.
[0012] Accordingly, the present invention provides an expansion device for large model vulnerability detection data, including: a code extraction module, a type identification module, a mutation operation filtering module, a mutation operation selection module, a mutation processing module, and a code integration module; The code extraction module is used to identify the data processing logic code in the original vulnerability detection code, and to identify the code other than the data processing logic code as code to be mutated. The type identification module is used to parse the original vulnerability detection code and determine the input data type and output data type. The mutation operation filtering module is used to filter out several applicable mutation operations from a preset mutation rule library based on the input data type and the output data type. The mutation operation selection module is used to randomly select a target mutation operation from a plurality of applicable mutation operations; The mutation processing module is used to perform mutation processing on the code to be mutated based on the target mutation operation to obtain the mutated code; The code integration module is used to integrate the mutated code and the data processing logic code to generate target vulnerability detection code for detecting large model vulnerability detection capabilities.
[0013] The present invention also provides a terminal device, comprising: a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, wherein when the processor executes the computer program, it implements the steps of the method for expanding large model vulnerability detection data as described in the present invention.
[0014] The present invention also provides a computer-readable storage medium item, comprising: a stored computer program, wherein, when the computer program is running, the device on which the computer-readable storage medium is located executes the steps of the method for expanding large model vulnerability detection data of the present invention.
[0015] Compared with the prior art, the embodiments of the present invention have the following beneficial effects: This invention provides a method for expanding large-scale vulnerability detection data. It identifies the data processing logic code in the original vulnerability detection code and determines the code other than the data processing logic code as the code to be mutated. It parses the input and output data types of the original vulnerability detection code; based on the input and output data types, it selects several applicable mutation operations from a preset mutation rule library and randomly selects a target mutation operation; it mutates the code to be mutated based on the target mutation operation to obtain the mutated code; and it integrates the mutated code and the data processing logic code to generate target vulnerability detection code for detecting large-scale vulnerability detection capabilities. This invention first identifies the data processing logic code in the original vulnerability detection code, mutates only the code other than the data processing logic code, and retains the core data processing logic of the original vulnerability detection code; based on the input and output data types of the original vulnerability detection code, it mutates the code to be mutated to generate the mutated code; finally, it integrates the mutated code and the data processing logic code to generate target vulnerability detection code with the same data processing logic as the original vulnerability detection code, thereby improving both the quality and accuracy of the vulnerability detection code. Attached Figure Description
[0016] To more clearly illustrate the technical solution of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0017] Figure 1 A flowchart illustrating an embodiment of the method for expanding large model vulnerability detection data provided by the present invention; Figure 2 This is a schematic diagram of one embodiment of the device for expanding large model vulnerability detection data provided by the present invention. Detailed Implementation
[0018] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings of the embodiments. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0019] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains; the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the application; the terms “comprising” and “having”, and any variations thereof, in the specification, claims, and foregoing description of the drawings are intended to cover non-exclusive inclusion.
[0020] In the description of the embodiments of this application, technical terms such as "first" and "second" are used only to distinguish different objects and should not be construed as indicating or implying relative importance or implicitly specifying the number, specific order, or primary and secondary relationship of the indicated technical features. In the description of the embodiments of this application, "multiple" means two or more, unless otherwise explicitly defined.
[0021] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.
[0022] In the description of the embodiments in this application, the term "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this document generally indicates that the preceding and following related objects have an "or" relationship.
[0023] In the description of the embodiments of this application, the term "multiple" refers to two or more (including two), similarly, "multiple sets" refers to two or more (including two sets), and "multiple pieces" refers to two or more (including two pieces).
[0024] In the description of the embodiments of this application, unless otherwise expressly specified and limited, technical terms such as "installation," "connection," "joining," and "fixing" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral part; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; they can refer to the internal communication of two components or the interaction between two components. For those skilled in the art, the specific meaning of the above terms in the embodiments of this application can be understood according to the specific circumstances.
[0025] See Figure 1To address the issue of low accuracy in vulnerability detection data in existing technologies, an embodiment of the present invention provides a method for expanding large-scale vulnerability detection data. This method includes steps 101 to 106, each step of which is detailed below: Step 101: Identify the data processing logic code in the original vulnerability detection code, and identify the code other than the data processing logic code as the code to be mutated.
[0026] In this embodiment of the invention, the original vulnerability detection code is used to detect whether a large language model has vulnerabilities. The original vulnerability detection code consists of a pre-processing section, a core detection section, and a post-processing section. The pre-processing section contains all the code before the execution of the core detection logic, mainly responsible for functions such as input parameter preprocessing and detection environment preparation. The core detection section is the original core logic (core business / computation logic) of vulnerability detection, responsible for core functions such as vulnerability feature matching and risk assessment, equivalent to data processing logic code. The post-processing section contains all the code after the execution of the core detection logic, mainly responsible for functions such as post-processing of detection results, output format adjustment, and data return. To retain the core detection section of the original vulnerability detection code, before performing mutation processing, it is necessary to first remove the data processing logic code from the original vulnerability detection code, and determine the remaining part as the code to be mutated.
[0027] As a preferred embodiment, the data processing logic code in the original vulnerability detection code is identified, and the code other than the data processing logic code is identified as code to be mutated, including: Several code behavior characteristics were identified in the original vulnerability detection code; Each code behavior feature is compared with preset data processing code features and preset non-data processing code features to select the first code behavior feature used to characterize data processing. Locate the first code behavior feature in the original vulnerability detection code, determine the code segment corresponding to the first code behavior feature as data processing logic code, and record the line number data of the data processing logic code; Based on the code line number data, all code in the original vulnerability detection code that is located before the starting line of the data processing logic code is identified as preprocessing code, and all code in the original vulnerability detection code that is located after the ending line of the data processing logic code is identified as postprocessing code. The preprocessing code and postprocessing code are combined to generate the code to be mutated.
[0028] In this embodiment of the invention, the data processing logic code is the part of the code that transforms, calculates, filters, and aggregates data, used to change the form and content of the data. Code behavior characteristics are the external manifestations of the code that directly reflect its function, equivalent to the fingerprint of the code. Therefore, by identifying the code behavior characteristics in the original vulnerability detection code, the data processing logic code can be extracted.
[0029] First, all code behavior features in the original vulnerability detection code are identified. These features include those representing data processing and those representing non-data processing. The extracted code behavior features are then compared with preset data processing and non-data processing code features. The similarity between these features and the preset data processing and non-data processing features is calculated. The code behavior feature with a greater similarity to the preset data processing features is designated as the first code behavior feature for representing data processing. This first code behavior feature is then located in the original vulnerability detection code, allowing the extraction of the data processing logic code. Based on the line numbers corresponding to the data processing logic code, the start and end lines of the data processing logic code can be determined. Based on these start and end lines, the original vulnerability detection code can be divided into three parts. Specifically, all code in the original vulnerability detection code before the starting line of the data processing logic code is identified as preprocessing code, and all code in the original vulnerability detection code after the ending line of the data processing logic code is identified as postprocessing code. Based on the preprocessing code and postprocessing code, code to be mutated is generated.
[0030] Step 102: Parse the original vulnerability detection code to determine the input data type and output data type.
[0031] In this embodiment of the invention, the mutation operations in the preset mutation rule base include mutation operations that change the data type and mutation operations that do not change the data type. Therefore, in order to determine the applicable mutation operation for the original vulnerability detection code, it is necessary to first determine the input data type and output data type of the original vulnerability detection code. The input data type and output data type can be obtained by machine parsing the original vulnerability detection code.
[0032] As a preferred embodiment, parsing the original vulnerability detection code to determine the input data type and output data type includes: Identify the input and output parameter types in the function signature of the original vulnerability detection code; When neither the input parameter type nor the output parameter type is null, the input parameter type is determined as the input data type, and the output parameter type is determined as the output data type. When the input parameter type or the output parameter type is empty, the original vulnerability detection code is input into the preset parameter type recognition model so that the preset parameter type recognition model outputs the predicted input parameter type and the predicted output parameter type. If the predicted input parameter type is not null, then the predicted input parameter type is used as the input data type; if the predicted input parameter type is null, then the input data type is determined to be an unrecognized type. If the predicted output parameter type is not null, then the predicted output parameter type is used as the output data type; if the predicted output parameter type is null, then the output data type is determined to be an unrecognized type.
[0033] In this embodiment of the invention, the input data type and output data type can generally be identified from the function signature of the original vulnerability detection code. Therefore, parsing the original vulnerability detection code first involves obtaining the function signature and identifying the input parameter type and output parameter type in the function signature. If neither the identified input parameter type nor the output parameter type is null, the identified input parameter type is determined as the input data type, and the identified output parameter type is determined as the output data type. If either the identified input parameter type or the output parameter type contains null values, the original vulnerability detection code needs to be input into a preset parameter type recognition model so that the model can predict the input parameter type and the output parameter type.
[0034] The training samples for the parameter type recognition model include vulnerability detection code whose function signature does not contain input parameter type and output parameter type, vulnerability detection code whose function signature contains input parameter type but does not contain output parameter type, and vulnerability detection code whose function signature does not contain input parameter type but contains output parameter type, as well as the input parameter type marker and output parameter type marker corresponding to each vulnerability detection code.
[0035] Once the parameter type identification model infers reasonable predicted input and output parameter types based on the original vulnerability detection code, it needs to determine whether the predicted input and output parameter types are null values. If the predicted input parameter type is not null, it is used as the input data type; if it is null, it is determined as an unrecognized type. Similarly, if the predicted output parameter type is not null, it is used as the output data type; if it is null, it is determined as an unrecognized type. If either the input or output data type is unrecognized, no mutation operations related to changing the data type are performed on the original vulnerability detection code.
[0036] Step 103: Based on the input data type and the output data type, select several applicable mutation operations from the preset mutation rule library.
[0037] In this embodiment of the invention, the preset mutation rule base contains a large number of effective mutation rules corresponding to specific mutation operations. The basic principles of mutation operations include: the introduced mutations themselves should have low difficulty in understanding and implementation, avoiding systematic errors in the large language model's understanding of the functions corresponding to the mutation rules due to overly complex mutation content, thereby affecting the reliability of the evaluation; all introduced mutations must have good engineering practice significance and clear semantic interpretability, reflecting common needs in real software development, thereby ensuring that the mutated target code can correspond to real potential needs.
[0038] In this embodiment of the invention, the mutation operations in the preset mutation rule base include mutation operations that do not change the data type, mutation operations that change the input data type, and mutation operations that change the output data type. Therefore, if the input data type or output data type is an unrecognized type, some mutation operations in the preset mutation rule base are not applicable. Therefore, before determining the target mutation operation, it is necessary to first filter out applicable mutation operations based on the input data type and output data type.
[0039] As a preferred embodiment, based on the input data type and the output data type, several applicable mutation operations are selected from a preset mutation rule base, including: When there are no unrecognized types in the input data type and the output data type, the input processing mutation, output processing mutation and parameter expansion mutation in the preset mutation rule library will be determined as applicable mutation operations; When the input data type is an unrecognized type and the output data type is not an unrecognized type, the input processing mutation and output processing mutation in the preset mutation rule library are determined as applicable mutation operations; When the input data type is not an unrecognized type and the output data type is an unrecognized type, the input processing mutation and parameter expansion mutation in the preset mutation rule library will be determined as applicable mutation operations; When the input data type is an unrecognized type and the output data type is an unrecognized type, the input processing mutation in the preset mutation rule base is determined as the applicable mutation operation.
[0040] In this embodiment of the invention, the mutation operations in the preset mutation rule base include input processing mutation, output processing mutation, and parameter expansion mutation. Output processing mutation requires changing the output data type, and parameter expansion mutation requires changing the input data type. Therefore, parameter expansion mutation is not applicable when the input data type is unrecognized, and output processing mutation is not applicable when the output data type is unrecognized.
[0041] Step 104: Randomly select a target mutation operation from among the applicable mutation operations.
[0042] In this embodiment of the invention, once the applicable mutation operation corresponding to the original vulnerability detection code is determined, a target mutation operation can be randomly selected from the applicable mutation operations to mutate the original vulnerability detection code according to the target mutation operation, thereby expanding a new target vulnerability detection code.
[0043] Step 105: Perform mutation processing on the code to be mutated based on the target mutation operation to obtain the mutated code.
[0044] In this embodiment of the invention, after a target mutation operation is randomly determined, the code to be mutated is mutated based on that target mutation operation. The target mutation operation includes input processing mutation, output processing mutation, or parameter expansion mutation. Each target mutation operation corresponds to a different mutation rule, therefore the mutation processing performed on the code to be mutated also differs.
[0045] As a preferred embodiment, when the target mutation operation is input processing mutation, the step of performing mutation processing on the code to be mutated based on the target mutation operation to obtain the mutated code includes: When the input data type is not an unrecognized type, several type-oriented preprocessing steps are determined based on the input data type; a target type-oriented preprocessing step is randomly selected from the several type-oriented preprocessing steps, and a first new code is generated based on the target type-oriented preprocessing step; the first new code is added to the preprocessing code in the code to be mutated to generate the mutated code; wherein, the type-oriented preprocessing includes data type conversion, data cleaning or data verification; When the input data type is an unrecognized type, a target type-independent preprocessing is randomly selected from several preset type-independent preprocessing methods, and a first new code is generated based on the target type-independent preprocessing. The first new code is added to the preprocessing code in the code to be mutated to generate the mutated code. The preset type-independent preprocessing includes business rule verification, general identifier addition, or data anonymization.
[0046] In this embodiment of the invention, when the target mutation operation is input processing mutation, the preprocessing code in the code to be mutated is mutated. Specifically, a reasonable preprocessing procedure for the input data is introduced before the data processing logic code. This preprocessing procedure includes preprocessing related to the input data type, such as data type conversion, data cleaning, or data validation, as well as preprocessing unrelated to the input data type, such as business rule validation, adding general identifiers, or data anonymization. Therefore, the input data type needs to be clearly defined before determining the preprocessing to be introduced.
[0047] In this embodiment of the invention, when the input data type is not an unidentified type, several type-oriented preprocessing steps related to the input data type can be determined, and then a target type-oriented preprocessing step can be randomly selected from the type-oriented preprocessing steps. Then, a first new code is generated according to the target type-oriented preprocessing step, and the first new code is added to the preprocessing code in the code to be mutated.
[0048] Data type conversion refers to the conversion of the input data type, which includes strings, numbers, booleans, and lists. When the target type is preprocessed as a data type conversion, a data type different from the current input data type can be randomly selected to generate the first new code corresponding to the data type conversion.
[0049] Data cleaning operations vary depending on the data type. For example, cleaning operations for strings include removing spaces, removing special characters, and unifying capitalization; cleaning operations for numeric data include precision correction and outlier replacement; and cleaning operations for lists and tuples include filtering out empty elements.
[0050] The data validation operations vary depending on the data type. For example, validation operations for string data types include length validation and content validation; validation operations for numeric data types include range validation and non-empty validation; and validation operations for list and dictionary data types include length validation and non-empty validation.
[0051] In this embodiment of the invention, when the input data type is not an unrecognized type, a target type-independent preprocessing can be randomly selected from the preset type-independent preprocessing, and then a first new code can be generated based on the target type-independent preprocessing, and the first new code can be added to the preprocessing code in the code to be mutated.
[0052] The business rule validation includes disabled value filtering, business blacklist validation, and business threshold validation. The disabled value filtering logic is: determine if the input data is a disabled value; if so, replace the input data with the default value. The business blacklist validation logic is: determine if the input data is in the business blacklist; if so, intercept or replace the input data. The business threshold validation logic is: determine if the input data exceeds a preset business threshold; if so, modify the data to the maximum business threshold.
[0053] The general identifier addition includes adding metadata, adding status markers, and adding unique identifiers. The logic for adding metadata is to encapsulate the input as a dictionary and add a processing time or source field. The logic for adding status markers is to tag the input data with labels such as "preprocessed" or "source system." The logic for adding unique identifiers is to generate a unique ID for the input data.
[0054] Data anonymization includes fixed-rule anonymization, substitution anonymization, and keyword anonymization. Fixed-rule anonymization hides the middle part of sensitive content (such as phone numbers / ID cards). Substitution anonymization replaces sensitive values with a mask. Keyword anonymization detects and hides sensitive keywords.
[0055] As a preferred embodiment, when the target mutation operation is output processing mutation, the step of performing mutation processing on the code to be mutated based on the target mutation operation to obtain the mutated code includes: Several data post-processing operations are determined based on the output data type; wherein, the data post-processing operations include data structure format conversion, data deduplication, or data type conversion; Randomly select a target data post-processing operation from among several data post-processing operations, and generate a second new code based on the target data post-processing operation; The second new code is added to the post-processing code of the code to be mutated to generate the mutated code.
[0056] In this embodiment of the invention, when the target mutation operation is output processing mutation, the post-processing code in the code to be mutated is mutated. Specifically, a reasonable post-processing operation for the output data is introduced after the data processing logic code. Different data types correspond to different post-processing operations, so it is necessary to first determine several data post-processing operations based on the output data type, then randomly select a target data post-processing operation from among these operations, and then generate a second new code based on the target data post-processing operation. This second new code is then added to the post-processing code in the code to be mutated. The data post-processing operations include data structure transformation, data deduplication, or data type conversion.
[0057] Among them, data structure transformation refers to transforming the structure of the output data. The structure includes single value, structured dictionary and nested container, etc. When the post-processing operation of the target data is data structure transformation, a structure that is different from the current output data type can be randomly selected to generate a second new code corresponding to the data structure transformation.
[0058] The deduplication operations vary depending on the data type. For example, deduplication for lists includes ordered deduplication; deduplication for dictionaries includes key-based deduplication (keeping the first / last value); deduplication for nested lists includes deduplication after flattening; and deduplication is not required for non-container data types.
[0059] Data type conversion refers to the conversion of the output data type, which includes strings, numbers, booleans, and lists. When the target data post-processing operation is a data type conversion, a data type different from the current output data type can be randomly selected to generate a second new code corresponding to the data type conversion.
[0060] As a preferred embodiment, when the target mutation operation is a parameter expansion mutation, the step of performing mutation processing on the code to be mutated based on the target mutation operation to obtain the mutated code includes: Identify the business meaning of input parameters in the original vulnerability detection code; The business execution logic is determined based on the data processing logic code; Based on the business meaning of the input parameters and the business execution logic, determine the business-related parameters of the input parameters; By summing the input parameters and the business-related parameters, new input parameters are generated; Based on the new input parameters, the preprocessed code in the code to be mutated is modified to generate the mutated code.
[0061] In this embodiment of the invention, when the target mutation operation is parameter expansion mutation, the preprocessing code in the code to be mutated is mutated. Specifically, the original input data and output data are retained, reasonable parameters and related processing related to the input data are introduced, and then the preprocessing code in the code to be mutated is modified to generate the mutated code.
[0062] In this embodiment of the invention, parameters related to the input data are introduced. First, the business meaning of the input parameters in the original vulnerability detection code is determined, and the business execution logic is determined based on the data processing logic code. Then, the business-related parameters of the input parameters are determined according to the business meaning and business execution logic. The input parameters and business-related parameters are summarized to generate new input parameters. The preprocessing code and the content related to the code input parameters in the code to be mutated are modified according to the new input parameters. Finally, the mutated code is generated.
[0063] For example, suppose the business meaning of the input parameters is the unit price and quantity of the product, and the business execution logic of the data processing logic code is to calculate the product price. Then the parameters related to the input data can be the discount rate, tax rate, or minimum consumption amount. Therefore, one parameter can be randomly selected from the discount rate, tax rate, and minimum consumption amount as the business-related parameter.
[0064] Step 106: Integrate the mutated code and the data processing logic code to generate target vulnerability detection code for detecting the vulnerability detection capability of large models.
[0065] In this embodiment of the invention, after obtaining the mutated code according to the above-described mutation operation, it is integrated with the data processing logic code to obtain target vulnerability detection code for detecting large model vulnerabilities. Specifically, the preprocessing code in the mutated code is integrated before the starting line of the data processing logic code, and the postprocessing code in the mutated code is integrated after the ending line of the data processing logic code.
[0066] The mutations introduced in this invention are all relatively simple and basic operations, and do not introduce software vulnerabilities. Furthermore, since the data processing logic in the original vulnerability detection code is not modified during the mutation process, the data processing logic of the generated target vulnerability detection code is identical to that of the original vulnerability detection code. When using the target vulnerability detection code to perform vulnerability detection on a large language model, the same evaluation metrics and benchmarks as the original vulnerability detection code are used, as well as the vulnerability and non-vulnerability markers from the original vulnerability detection code.
[0067] Implementing the above embodiments has the following effects: This invention provides a method for expanding large-scale vulnerability detection data. It identifies the data processing logic code in the original vulnerability detection code and determines the code other than the data processing logic code as the code to be mutated. It parses the input and output data types of the original vulnerability detection code; based on the input and output data types, it selects several applicable mutation operations from a preset mutation rule library and randomly selects a target mutation operation; it mutates the code to be mutated based on the target mutation operation to obtain the mutated code; and it integrates the mutated code and the data processing logic code to generate target vulnerability detection code for detecting large-scale vulnerability detection capabilities. This invention first identifies the data processing logic code in the original vulnerability detection code, mutates only the code other than the data processing logic code, and retains the core data processing logic of the original vulnerability detection code; based on the input and output data types of the original vulnerability detection code, it mutates the code to be mutated to generate the mutated code; finally, it integrates the mutated code and the data processing logic code to generate target vulnerability detection code with the same data processing logic as the original vulnerability detection code, thereby improving both the quality and accuracy of the vulnerability detection code.
[0068] like Figure 2 As shown, based on the above method embodiments, corresponding apparatus embodiments are provided; One embodiment of the present invention provides an expansion device for large model vulnerability detection data, including: a code extraction module, a type identification module, a mutation operation filtering module, a mutation operation selection module, a mutation processing module, and a code integration module; The code extraction module is used to identify the data processing logic code in the original vulnerability detection code, and to identify the code other than the data processing logic code as code to be mutated. The type identification module is used to parse the original vulnerability detection code and determine the input data type and output data type. The mutation operation filtering module is used to filter out several applicable mutation operations from a preset mutation rule library based on the input data type and the output data type. The mutation operation selection module is used to randomly select a target mutation operation from a plurality of applicable mutation operations; The mutation processing module is used to perform mutation processing on the code to be mutated based on the target mutation operation to obtain the mutated code; The code integration module is used to integrate the mutated code and the data processing logic code to generate target vulnerability detection code for detecting large model vulnerability detection capabilities.
[0069] As a preferred embodiment, the data processing logic code in the original vulnerability detection code is identified, and the code other than the data processing logic code is identified as code to be mutated, including: Several code behavior characteristics were identified in the original vulnerability detection code; Each code behavior feature is compared with preset data processing code features and preset non-data processing code features to select the first code behavior feature used to characterize data processing. Locate the first code behavior feature in the original vulnerability detection code, determine the code segment corresponding to the first code behavior feature as data processing logic code, and record the line number data of the data processing logic code; Based on the code line number data, all code in the original vulnerability detection code that is located before the starting line of the data processing logic code is identified as preprocessing code, and all code in the original vulnerability detection code that is located after the ending line of the data processing logic code is identified as postprocessing code. The preprocessing code and postprocessing code are combined to generate the code to be mutated.
[0070] As a preferred embodiment, parsing the original vulnerability detection code to determine the input data type and output data type includes: Identify the input and output parameter types in the function signature of the original vulnerability detection code; When neither the input parameter type nor the output parameter type is null, the input parameter type is determined as the input data type, and the output parameter type is determined as the output data type. When the input parameter type or the output parameter type is empty, the original vulnerability detection code is input into the preset parameter type recognition model so that the preset parameter type recognition model outputs the predicted input parameter type and the predicted output parameter type. If the predicted input parameter type is not null, then the predicted input parameter type is used as the input data type; if the predicted input parameter type is null, then the input data type is determined to be an unrecognized type. If the predicted output parameter type is not null, then the predicted output parameter type is used as the output data type; if the predicted output parameter type is null, then the output data type is determined to be an unrecognized type.
[0071] As a preferred embodiment, based on the input data type and the output data type, several applicable mutation operations are selected from a preset mutation rule base, including: When there are no unrecognized types in the input data type and the output data type, the input processing mutation, output processing mutation and parameter expansion mutation in the preset mutation rule library will be determined as applicable mutation operations; When the input data type is an unrecognized type and the output data type is not an unrecognized type, the input processing mutation and output processing mutation in the preset mutation rule library are determined as applicable mutation operations; When the input data type is not an unrecognized type and the output data type is an unrecognized type, the input processing mutation and parameter expansion mutation in the preset mutation rule library will be determined as applicable mutation operations; When the input data type is an unrecognized type and the output data type is an unrecognized type, the input processing mutation in the preset mutation rule base is determined as the applicable mutation operation.
[0072] As a preferred embodiment, when the target mutation operation is input processing mutation, the step of performing mutation processing on the code to be mutated based on the target mutation operation to obtain the mutated code includes: When the input data type is not an unrecognized type, several type-oriented preprocessing steps are determined based on the input data type; a target type-oriented preprocessing step is randomly selected from the several type-oriented preprocessing steps, and a first new code is generated based on the target type-oriented preprocessing step; the first new code is added to the preprocessing code in the code to be mutated to generate the mutated code; wherein, the type-oriented preprocessing includes data type conversion, data cleaning or data verification; When the input data type is an unrecognized type, a target type-independent preprocessing is randomly selected from several preset type-independent preprocessing methods, and a first new code is generated based on the target type-independent preprocessing. The first new code is added to the preprocessing code in the code to be mutated to generate the mutated code. The preset type-independent preprocessing includes business rule verification, general identifier addition, or data anonymization.
[0073] As a preferred embodiment, when the target mutation operation is output processing mutation, the step of performing mutation processing on the code to be mutated based on the target mutation operation to obtain the mutated code includes: Several data post-processing operations are determined based on the output data type; wherein, the data post-processing operations include data structure format conversion, data deduplication, or data type conversion; Randomly select a target data post-processing operation from among several data post-processing operations, and generate a second new code based on the target data post-processing operation; The second new code is added to the post-processing code of the code to be mutated to generate the mutated code.
[0074] As a preferred embodiment, when the target mutation operation is a parameter expansion mutation, the step of performing mutation processing on the code to be mutated based on the target mutation operation to obtain the mutated code includes: Identify the business meaning of input parameters in the original vulnerability detection code; The business execution logic is determined based on the data processing logic code; Based on the business meaning of the input parameters and the business execution logic, determine the business-related parameters of the input parameters; By summing the input parameters and the business-related parameters, new input parameters are generated; Based on the new input parameters, the preprocessed code in the code to be mutated is modified to generate the mutated code.
[0075] Implementing the above embodiments has the following effects: This invention provides an extension device for large-scale vulnerability detection data. It identifies the data processing logic code in the original vulnerability detection code and determines the code other than the data processing logic code as the code to be mutated. It parses the input and output data types of the original vulnerability detection code; based on the input and output data types, it selects several applicable mutation operations from a preset mutation rule base and randomly selects a target mutation operation; it mutates the code to be mutated based on the target mutation operation to obtain the mutated code; and it integrates the mutated code and the data processing logic code to generate target vulnerability detection code for detecting large-scale vulnerability detection capabilities. This invention first identifies the data processing logic code in the original vulnerability detection code, mutates only the code other than the data processing logic code, and retains the core data processing logic of the original vulnerability detection code; based on the input and output data types of the original vulnerability detection code, it mutates the code to be mutated to generate the mutated code; finally, it integrates the mutated code and the data processing logic code to generate target vulnerability detection code with the same data processing logic as the original vulnerability detection code, thereby improving both the quality and accuracy of the vulnerability detection code.
[0076] It is understood that the above-described device embodiments correspond to the method embodiments of the present invention, and can implement the extended method for providing large model vulnerability detection data in any of the above-described method embodiments of the present invention.
[0077] It should be noted that the device embodiments described above are merely illustrative, and some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Furthermore, in the accompanying drawings of the device embodiments provided by this invention, the connection relationships between modules indicate that they have communication connections, which can specifically be implemented as one or more communication buses or signal lines. Those skilled in the art can understand and implement this without any creative effort.
[0078] Based on the above embodiments of the method for expanding large model vulnerability detection data, another embodiment of the present invention provides a terminal device, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor. When the processor executes the computer program, it implements the method for expanding large model vulnerability detection data of any embodiment of the present invention.
[0079] For example, in this embodiment, the computer program can be divided into one or more modules, which are stored in the memory and executed by the processor to complete the present invention. The one or more modules may be a series of computer program instruction segments capable of performing a specific function, which describe the execution process of the computer program in the terminal device.
[0080] The terminal device may be a desktop computer, laptop, handheld computer, or cloud server, etc. The terminal device may include, but is not limited to, a processor and a memory.
[0081] The processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor. The processor is the control center of the terminal device, connecting all parts of the terminal device via various interfaces and lines.
[0082] Based on the above-described method embodiments, another embodiment of the present invention provides a computer-readable storage medium including a stored computer program, wherein, when the computer program is executed, it controls the device where the computer-readable storage medium is located to execute the method for expanding large model vulnerability detection data as described in any of the above-described method embodiments of the present invention.
[0083] The modules / units integrated in the device / terminal equipment, if implemented as software functional units and sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the above embodiments of the present invention can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium, etc.
[0084] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above descriptions are merely specific embodiments of the present invention and are not intended to limit the scope of protection of the present invention. In particular, it should be noted that any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention for those skilled in the art.
Claims
1. A large model vulnerability detection data expansion method, characterized in that, include: Identify the data processing logic code in the original vulnerability detection code, and determine the code other than the data processing logic code as code to be mutated; Analyze the original vulnerability detection code to determine the input and output data types; Based on the input data type and the output data type, several applicable mutation operations are selected from the preset mutation rule library; Randomly select a target mutation operation from among several applicable mutation operations; Based on the target mutation operation, the code to be mutated is mutated to obtain the mutated code; By integrating the mutated code and the data processing logic code, target vulnerability detection code is generated for detecting the vulnerability detection capability of large models.
2. The extension method of large model vulnerability detection data according to claim 1, characterized in that, The process of identifying the data processing logic code in the original vulnerability detection code and determining the code other than the data processing logic code as code to be mutated includes: Several code behavior characteristics were identified in the original vulnerability detection code; Each code behavior feature is compared with preset data processing code features and preset non-data processing code features to select the first code behavior feature used to characterize data processing. Locate the first code behavior feature in the original vulnerability detection code, determine the code segment corresponding to the first code behavior feature as data processing logic code, and record the line number data of the data processing logic code; Based on the code line number data, all code in the original vulnerability detection code that is located before the starting line of the data processing logic code is identified as preprocessing code, and all code in the original vulnerability detection code that is located after the ending line of the data processing logic code is identified as postprocessing code. The preprocessing code and postprocessing code are combined to generate the code to be mutated.
3. The extension method of large model vulnerability detection data according to claim 2, characterized in that, The process of parsing the original vulnerability detection code to determine the input and output data types includes: Identify the input and output parameter types in the function signature of the original vulnerability detection code; When neither the input parameter type nor the output parameter type is null, the input parameter type is determined as the input data type, and the output parameter type is determined as the output data type. When the input parameter type or the output parameter type is empty, the original vulnerability detection code is input into the preset parameter type recognition model so that the preset parameter type recognition model outputs the predicted input parameter type and the predicted output parameter type. If the predicted input parameter type is not null, then the predicted input parameter type is used as the input data type; if the predicted input parameter type is null, then the input data type is determined to be an unrecognized type. If the predicted output parameter type is not null, then the predicted output parameter type is used as the output data type; if the predicted output parameter type is null, then the output data type is determined to be an unrecognized type.
4. The extension method of large model vulnerability detection data according to claim 3, characterized in that, The step of selecting several applicable mutation operations from a preset mutation rule base based on the input data type and the output data type includes: When there are no unrecognized types in the input data type and the output data type, the input processing mutation, output processing mutation and parameter expansion mutation in the preset mutation rule library will be determined as applicable mutation operations; When the input data type is an unrecognized type and the output data type is not an unrecognized type, the input processing mutation and output processing mutation in the preset mutation rule library are determined as applicable mutation operations; When the input data type is not an unrecognized type and the output data type is an unrecognized type, the input processing mutation and parameter expansion mutation in the preset mutation rule library will be determined as applicable mutation operations; When the input data type is an unrecognized type and the output data type is an unrecognized type, the input processing mutation in the preset mutation rule base is determined as the applicable mutation operation.
5. The extension method of large model vulnerability detection data according to claim 4, characterized in that, When the target mutation operation is input processing mutation, the step of performing mutation processing on the code to be mutated based on the target mutation operation to obtain the mutated code includes: When the input data type is not an unrecognized type, several type-oriented preprocessing steps are determined based on the input data type; a target type-oriented preprocessing step is randomly selected from the several type-oriented preprocessing steps, and a first new code is generated based on the target type-oriented preprocessing step; the first new code is added to the preprocessing code in the code to be mutated to generate the mutated code; wherein, the type-oriented preprocessing includes data type conversion, data cleaning or data verification; When the input data type is an unrecognized type, a target type-independent preprocessing is randomly selected from several preset type-independent preprocessing methods, and a first new code is generated based on the target type-independent preprocessing. The first new code is added to the preprocessing code in the code to be mutated to generate the mutated code. The preset type-independent preprocessing includes business rule verification, general identifier addition, or data anonymization.
6. The extension method of large model vulnerability detection data according to claim 4, characterized in that, When the target mutation operation is output processing mutation, the step of performing mutation processing on the code to be mutated based on the target mutation operation to obtain the mutated code includes: Several data post-processing operations are determined based on the output data type; wherein, the data post-processing operations include data structure format conversion, data deduplication, or data type conversion; Randomly select a target data post-processing operation from among several data post-processing operations, and generate a second new code based on the target data post-processing operation; The second new code is added to the post-processing code of the code to be mutated to generate the mutated code.
7. The extension method of large model vulnerability detection data according to claim 4, characterized in that, When the target mutation operation is a parameter expansion mutation, the step of performing mutation processing on the code to be mutated based on the target mutation operation to obtain the mutated code includes: Identify the business meaning of input parameters in the original vulnerability detection code; The business execution logic is determined based on the data processing logic code; Based on the business meaning of the input parameters and the business execution logic, determine the business-related parameters of the input parameters; By summing the input parameters and the business-related parameters, new input parameters are generated; Based on the new input parameters, the preprocessed code in the code to be mutated is modified to generate the mutated code.
8. An extension device of large model vulnerability detection data, characterized in that, include: The module includes a code extraction module, a type identification module, a mutation operation filtering module, a mutation operation selection module, a mutation processing module, and a code integration module. The code extraction module is used to identify the data processing logic code in the original vulnerability detection code, and to identify the code other than the data processing logic code as code to be mutated. The type identification module is used to parse the original vulnerability detection code and determine the input data type and output data type. The mutation operation filtering module is used to filter out several applicable mutation operations from a preset mutation rule library based on the input data type and the output data type. The mutation operation selection module is used to randomly select a target mutation operation from a plurality of applicable mutation operations; The mutation processing module is used to perform mutation processing on the code to be mutated based on the target mutation operation to obtain the mutated code; The code integration module is used to integrate the mutated code and the data processing logic code to generate target vulnerability detection code for detecting large model vulnerability detection capabilities.
9. A terminal device, characterized in that, The system includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, wherein, when the processor executes the computer program, it implements the method for expanding large model vulnerability detection data as described in any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, include: A stored computer program, wherein, when the computer program is executed, the device containing the computer-readable storage medium is controlled to perform the method for expanding large model vulnerability detection data as described in any one of claims 1-7.