Sample generation method, security detection method, device, equipment and storage medium
By decomposing and generalizing the initial sample code of cybersecurity incidents, candidate sample code simulating successful attacks is generated, solving the problem of insufficient training data and improving the performance of the security detection model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SANGFOR TECH INC
- Filing Date
- 2024-12-31
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, the scarcity of cybersecurity incidents makes it difficult to obtain large amounts of high-quality training data, which affects the detection performance of large-scale security models.
By decomposing the initial sample code into sample functional unit code, performing generalization processing to generate candidate sample code, and conducting simulated attack tests, the code that successfully simulated the attack is selected as the training sample to enrich the sample library of the security detection model.
It enables the generation of a large number of reliable training samples based on a small number of real sample codes, improving the stability and reliability of the security detection model and enabling it to better resist various forms of attacks.
Smart Images

Figure CN122309341A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and more specifically, to a sample generation method, a security detection method, an apparatus, a device, and a storage medium. Background Technology
[0002] With the development of network technology, network security technology has emerged. Network security technology primarily aims to maintain the security of computer communication networks, mainly including the normal operation of network hardware and software, and the security of data exchange. In practical applications, the frequent occurrence of network attacks often poses a threat to the network security of systems.
[0003] With the development of deep learning and machine learning technologies, large-scale security models (such as intrusion detection systems and malware classifiers) have gradually become important tools for defending against cyberattacks. However, the effectiveness of these models largely depends on the quality of the training data, especially the quality of adversarial examples. However, due to the relatively small number of actual cybersecurity incidents, it is difficult to obtain large amounts of high-quality training data. Consequently, the trained large-scale security models cannot fully learn the characteristics of cybersecurity incidents and therefore cannot accurately detect them. Summary of the Invention
[0004] In view of this, embodiments of this application propose a sample generation method, a security detection method, an apparatus, a device, and a storage medium, which can generate a large number of reliable target sample codes for training a security detection model based on a small number of real initial sample codes, greatly enriching the content of the sample library of the security detection model, thereby improving the reliability of the security detection model trained based on the sample library.
[0005] In a first aspect, embodiments of this application provide a sample generation method, which includes: acquiring multiple initial sample codes, wherein the initial sample codes are program codes corresponding to network security events; decomposing the initial sample codes according to code functions to obtain at least one sample functional unit code corresponding to each initial sample code; generating multiple candidate sample codes based on the sample functional unit codes corresponding to each of the multiple initial sample codes; performing simulated attack tests on each candidate sample code to obtain simulated attack test data and simulated attack test results for the candidate sample codes; and using the simulated attack test data corresponding to the candidate sample codes whose simulated attack results indicate successful simulated attacks as training samples to train a security detection model based on the training samples.
[0006] Secondly, embodiments of this application provide a security detection method, comprising: acquiring data to be detected; using a security detection model trained with training samples generated by a sample generation method to perform attack prediction on the data to be detected, thereby obtaining an attack prediction result for the data to be detected.
[0007] Thirdly, embodiments of this application provide a sample generation apparatus, the apparatus comprising: a code acquisition module for acquiring multiple initial sample codes, wherein the initial sample codes are program codes corresponding to network security events; a code decomposition module for decomposing the initial sample codes according to code functions to obtain at least one sample functional unit code corresponding to each initial sample code; a code generation module for generating multiple candidate sample codes based on the sample functional unit codes corresponding to each of the multiple initial sample codes; and a code testing module for performing simulated attack tests on each candidate sample code to obtain simulated attack test data and simulated attack test results for the candidate sample codes, and using the simulated attack test data corresponding to the candidate sample codes whose simulated attack results indicate successful simulated attacks as training samples, so as to train a security detection model based on the training samples.
[0008] In one possible implementation, the code generation module includes a generalization processing submodule and a code generation submodule. The generalization processing submodule is used to perform generalization processing on the sample functional unit code to obtain generalized sample functional unit code. The code generation submodule is used to generate multiple candidate sample codes based on multiple generalized sample functional unit codes, each candidate sample code including at least one generalized sample functional unit code.
[0009] In one possible implementation, the generalization processing submodule is further configured to generate generalized sample functional unit code based on the generalization instructions and the sample functional unit code using a generative language model.
[0010] In one possible implementation, the generalization processing submodule is further used to call a generalization tool to perform generalization processing on the sample functional unit code, thereby obtaining the generalized sample functional unit code.
[0011] In one possible implementation, the sample generation device further includes an attack prediction module, a second loss acquisition module, and a second parameter adjustment module. The attack prediction module is used to perform attack prediction on the training samples using an initial security detection model to obtain an attack prediction result, which is used to indicate whether the attack was successful. The second loss acquisition module is used to obtain a second model loss based on the attack prediction result of the training samples and the simulated attack test result corresponding to the training samples. The second parameter adjustment module is used to adjust the model parameters of the initial security detection model based on the second model loss to minimize the first model loss until the second iteration termination condition is met, thereby obtaining the target security detection model.
[0012] In one possible implementation, the code generation module includes a function determination submodule, a code partitioning submodule, a code group determination submodule, and a code generation submodule. The function determination submodule is used to determine the functions required for various network security event types. The code partitioning submodule is used to partition the generalized sample functional unit codes according to their functions to obtain multiple code groups. The code group determination submodule is used to determine at least one candidate code group from the multiple code groups based on the functions required for the network security event type, wherein the functions corresponding to the candidate code groups are the functions required for the network security event type. The code generation submodule is used to generate candidate sample code corresponding to the network security event type based on the code in the candidate code group corresponding to the network security event type.
[0013] Fourthly, embodiments of this application provide a security detection device, including a data acquisition module for acquiring data to be detected; and a security detection module for using a security detection model trained with training samples generated by a sample generation device to perform attack prediction on the data to be detected, thereby obtaining an attack prediction result for the data to be detected.
[0014] Fifthly, embodiments of this application provide an electronic device, including a processor and a memory; one or more programs are stored in the memory and configured to be executed by the processor to implement the above-described method.
[0015] Sixthly, embodiments of this application provide a computer-readable storage medium storing program code, wherein the above-described method is executed when the program code is run by a processor.
[0016] In a seventh aspect, embodiments of this application provide a computer program product or computer program that includes computer instructions stored in a computer-readable storage medium. A processor of a computer device retrieves the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the method described above.
[0017] This application provides a sample generation method, security detection method, apparatus, device, and storage medium. The method includes: acquiring multiple initial sample codes, wherein the initial sample codes are program codes corresponding to network security events; decomposing the initial sample codes according to their functions to obtain at least one sample functional unit code corresponding to each initial sample code; generating multiple candidate sample codes based on the sample functional unit codes corresponding to each of the multiple initial sample codes; performing simulated attack tests on each candidate sample code to obtain the simulated attack test results for the candidate sample codes; and determining the candidate sample codes whose simulated attack test results indicate successful simulated attacks as target sample codes for training a security detection model. By employing the above method, each initial sample code is decomposed into smaller functional units (i.e., sample functional unit codes) to generate multiple candidate sample codes. Subsequently, by performing simulated attack tests on the candidate sample codes, the sample test results are selected from the candidate sample codes as target sample codes whose simulated attacks were successful. This achieves the generation of a large number of virtual sample codes (i.e., target sample codes) based on a small number of real initial sample codes. Subsequent training of the security detection model based on these target sample codes enables the security detection model to learn to resist various forms of attacks, while also improving its stability and reliability. Attached Figure Description
[0018] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 A schematic flowchart of a sample generation method provided in an embodiment of this application is shown;
[0020] Figure 2 It shows Figure 1 A flowchart illustrating step S130;
[0021] Figure 3 It shows Figure 2 A flowchart illustrating step S134;
[0022] Figure 4 This paper illustrates another flowchart of a sample generation method provided in an embodiment of this application.
[0023] Figure 5 This illustration shows another flowchart of a sample generation method provided in an embodiment of this application;
[0024] Figure 6 A schematic flowchart of a security detection method proposed in an embodiment of this application is shown;
[0025] Figure 7 A flowchart of a sample generation method provided in an embodiment of this application is shown;
[0026] Figure 8 A connection block diagram of a sample generation apparatus according to an embodiment of this application is shown;
[0027] Figure 9 This paper shows a connection block diagram of a security detection device according to an embodiment of this application;
[0028] Figure 10 A structural block diagram of an electronic device for performing the methods of embodiments of this application is shown. Detailed Implementation
[0029] Exemplary embodiments will now be described more fully with reference to the accompanying drawings. However, these exemplary embodiments can be implemented in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided to make this application more comprehensive and complete, and to fully convey the concept of the exemplary embodiments to those skilled in the art.
[0030] Furthermore, the described features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. Numerous specific details are provided in the following description to give a thorough understanding of embodiments of this application. However, those skilled in the art will recognize that the technical solutions of this application can be practiced without one or more of the specific details, or other methods, components, apparatuses, steps, etc., can be employed. In other instances, well-known methods, apparatuses, implementations, or operations are not shown or described in detail to avoid obscuring various aspects of this application.
[0031] The block diagrams shown in the accompanying drawings are merely functional entities and do not necessarily correspond to physically independent entities. That is, these functional entities can be implemented in software, in one or more hardware modules or integrated circuits, or in different network and / or processor devices and / or microcontroller devices.
[0032] The flowcharts shown in the accompanying drawings are merely illustrative and do not necessarily include all content and operations / steps, nor do they necessarily have to be performed in the described order. For example, some operations / steps can be broken down, while others can be combined or partially combined; therefore, the actual execution order may change depending on the specific circumstances.
[0033] It should be noted that "multiple" in this article refers to two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. The character " / " generally indicates that the preceding and following related objects have an "or" relationship.
[0034] It should also be noted that in this embodiment of the application, the collection, use, processing and storage of application information are all subject to the user's permission and must comply with the regulations of the region.
[0035] This application provides a sample generation method that can be applied to electronic devices, which may be servers, terminal devices, vehicles, or a combination of the above.
[0036] In some embodiments, the server may be an independent physical server, a server cluster or distributed system consisting of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDN), and big data and artificial intelligence platforms.
[0037] Terminal devices can be smartphones, tablets, laptops, desktop computers, smart voice interaction devices, smart home appliances, in-vehicle terminals, etc., but are not limited to these.
[0038] Figure 1 The present application provides a sample generation method that can be applied to electronic devices. The method includes:
[0039] Step S110: Obtain multiple initial sample codes, wherein the initial sample codes are program codes corresponding to network security incidents.
[0040] Among them, network security incidents refer to events in which network information systems are affected by malicious acts. These incidents may include, but are not limited to, events caused by malware, vulnerability exploitation tools, and attack proxy software.
[0041] The program code corresponding to the cybersecurity incident may include code snippets or complete applications directly related to the aforementioned cybersecurity incident, as well as scripts or tools used to cause the aforementioned cybersecurity incident. This type of code serves as an initial sample, providing basic material for the generation of subsequent samples.
[0042] Among them, the types of cybersecurity incidents to which the initial sample codes belong are diverse.
[0043] Multiple initial sample codes can be obtained by retrieving them from a publicly available database storing malware samples, or by retrieving multiple initial sample codes uploaded by a user. The above methods are merely illustrative and are not specifically limited in this embodiment.
[0044] Step S120: Decompose the initial sample code according to its function to obtain at least one sample function unit code corresponding to each initial sample code.
[0045] Decomposition refers to the process of breaking down a large, complex code segment or program into multiple smaller, more manageable, and understandable parts according to its functional characteristics. Sample functional unit code refers to a small code snippet extracted from the initial sample code, representing a specific function. Each sample functional unit code does only one thing and can work in conjunction with other sample functional unit codes to accomplish a larger task.
[0046] Specifically, step S120 above can be achieved by using static code analysis tools or refactoring tools to identify functional units that are suitable for decomposition into independent units, thereby decomposing the initial sample code according to its function and obtaining at least one sample functional unit code corresponding to each initial sample code.
[0047] Step S130: Generate multiple candidate sample codes based on the sample functional unit codes corresponding to each of the multiple initial sample codes.
[0048] In one possible implementation, step S130 may involve dividing the sample functional unit codes corresponding to the multiple initial sample codes according to their functions to obtain multiple code groups; and constructing multiple candidate sample codes based on the functional requirements corresponding to various types of network security events. Each candidate sample code corresponds to a type of network security event, and the sample functional unit codes of each candidate sample code satisfy the functional requirements of the corresponding network security event.
[0049] Step S140: Perform a simulated attack test on each candidate sample code to obtain the simulated attack test data and simulated attack test results of the candidate sample code. Use the simulated attack test data corresponding to the candidate sample code whose simulated attack was successful as a training sample so as to train the security detection model based on the training sample.
[0050] Specifically, a pre-deployed BAS (Breach and Attack Simulation) system can be used to conduct simulated attack tests on each candidate sample code, and the security and resistance capabilities of the intrusion and attack simulation system can be obtained when each candidate sample code is tested for simulated attacks, thereby obtaining the simulated attack test data and results of the candidate sample code.
[0051] The simulated attack test data may include one or more of the following: network traffic data (e.g., data packets entering and leaving the network, which may be data streams of various protocols such as HTTP, HTTPS, FTP, etc.), system logs (e.g., log files generated by the operating system or application), user behavior data (e.g., user login patterns, operating habits, etc.), or files (e.g., suspicious executable files, documents, or other types of files).
[0052] By employing the above method, each initial sample code is decomposed into smaller functional units (i.e., sample functional unit codes), and after generalization processing, multiple candidate sample codes are generated based on the multiple generalized sample functional unit codes. This effectively expands the number of candidate sample codes and also improves their diversity. Subsequently, by using the simulated attack test data corresponding to the candidate sample codes that indicate successful simulated attacks as training samples, the security detection model can be trained based on the training samples. This achieves the generation of a large number of virtual sample codes (i.e., target sample codes) based on a small number of real initial sample codes. Subsequently, by training the security detection model based on the aforementioned target sample codes, the security detection model can learn to resist various forms of attacks, while also improving its stability and reliability.
[0053] Please see Figure 2 In one possible implementation, step S130 includes: step S132: generalizing the sample functional unit code to obtain the generalized sample functional unit code.
[0054] Generalizing sample functional unit code refers to using specific techniques or tools to make code snippets originally written for a specific task applicable to a wider range of scenarios or similar tasks. The goal of this generalization is to enhance the adaptability and reusability of the code, reducing the workload of rewriting code for each new problem. In cybersecurity incident simulation or other fields, the generalized sample functional unit code should be able to flexibly respond to changing requirements in different environments while maintaining its original functionality.
[0055] In one possible implementation, step S132 above can be: calling a generalization tool to perform generalization processing on the sample functional unit code to obtain the generalized sample functional unit code.
[0056] The generalization tool can be ANTLR (Another Tool for Language Recognition) or a reflection tool, as long as it can generalize the code.
[0057] One or more generalization tools can be used to generalize the same sample functional unit code to obtain one or more generalized sample functional unit codes.
[0058] In another possible implementation, step S132 above can also be to use a generative language model to generate generalized sample functional unit code based on the generalized instructions and the sample functional unit code.
[0059] A generalization instruction is a piece of text that guides a generative language model to transform specific functional unit code into more broadly applicable code. For example, a generalization instruction may include text indicating requirements for abstraction of identifiers such as specific variable names, function names, or class names in the code. It may also include text indicating adjustments to the code structure, such as changing control flow logic, refactoring loop structures, or simplifying conditional statements.
[0060] Generative language models, in this context, refer to generative language models based on the Transformer architecture. They utilize large-scale pre-trained data and autoregressive models to generate coherent and natural conversational text. Their pre-training data includes billions of words of text data, such as encyclopedia entries, news articles, and novels. During pre-training, the model learns the probability distribution of the text data and uses this distribution to generate new text data. The generation process is context-based, generating responses based on given prompts. This allows the model to generate coherent and natural text, adapting to different scenarios and topics. A prompt is a piece of text or a set of keywords used to guide the generative language model in generating text. The design of the prompt is crucial to the performance and effectiveness of the generative language model. A good prompt should guide the model to generate text relevant to the prompt and should contain sufficient information so that the model can generate accurate and coherent text.
[0061] In the embodiments of this application, Prompt may include at least generalization instructions and sample functional unit codes, which are used to guide the model to generate text related to Prompt, that is, sample functional unit codes after generalization.
[0062] In one possible implementation, the generative language model used for generalization in this application embodiment can be the aforementioned network model trained using text data at the level of billions of words, such as encyclopedias, news articles, and novels. In this case, the generative language model can generate generalized sample functional unit codes with similar functions but different expressions based on the given sample functional unit codes and generalization instructions.
[0063] It's worth noting that because pre-trained generative language models are not limited to domain-specific rules or patterns, they exhibit diversity in the generalized sample functional unit code they generate. This diversity is particularly important in the cybersecurity field. Attackers often rely on understanding existing code patterns to design exploit strategies; introducing diverse sample functional unit code increases the difficulty for attackers. Even if they master a common coding style, they cannot easily predict other possible variations.
[0064] In another possible implementation, the generative language model used for generalization in this application embodiment can also be trained using sample data based on the pre-trained generative language model described above. The sample data includes a first sample code, a sample generalization instruction corresponding to the first sample code, and a reference generalization code. Therefore, the generative language model used for generalization can significantly improve the performance in the code generalization domain while maintaining its original advantages, thereby generating more accurate and demand-compliant code.
[0065] By using the above method, a sample functional unit code can be generalized to obtain one or more generalized sample functional unit codes.
[0066] Step S134: Generate multiple candidate sample codes based on the multiple generalized sample functional unit codes, each candidate sample code including at least one generalized sample functional unit code.
[0067] In one possible implementation, a sample functional unit code is generalized to obtain multiple generalized sample functional units. Step S134 includes: dividing the multiple generalized sample functional units corresponding to the same sample functional unit into the same unit group; constructing a candidate sample code from one generalized sample functional unit in each unit group corresponding to the initial sample code, thereby obtaining at least one candidate sample code corresponding to each initial sample code.
[0068] For example, if decomposing an initial sample code yields two sample functional unit codes, A and B, and generalizing sample functional unit code A yields three generalized sample functional unit codes, A1, A2, and A3, and generalizing sample functional unit code B yields three generalized sample functional unit codes, B1, B2, and B3, then the combinations of generalized sample functional unit codes included in the multiple candidate sample codes can be: A1+B1, A1+B2, A1+B3, A2+B1, A2+B2, A2+B3, A3+B1, A3+B2, and A3+B3. By using the above method, the candidate sample codes can be expanded.
[0069] In one possible implementation, its reference Figure 3 As shown, step S134 above includes:
[0070] Step S1341: Determine the required functions for each of the various network security incident types.
[0071] Different types of cybersecurity incidents require different functionalities. For example, malware includes software functions such as downloaders, installers, and C&C communication; vulnerability exploitation tools include functions such as scanning, automated attacks, zero-day exploitation, and backdoor installation; and attack proxy software includes functions such as traffic forwarding, protocol spoofing, log balancing, and session hijacking.
[0072] Specifically, step S1341 above may involve obtaining the functions required for each of the various network security time types.
[0073] Step S1342: Divide the generalized sample functional unit codes according to their functions to obtain multiple code groups.
[0074] Specifically, the generalized sample functional unit code is grouped according to the functional characteristics it represents. The key here is to ensure that the members within each code group work around the same or similar functions. For example, when dealing with SQL injection, all code related to database operations (such as connections and queries) can be grouped together; while when considering malware propagation, separate code groups should be established for functions such as downloading, installation, and remote control. This not only helps in accurately locating the required resources later, but also simplifies the management and retrieval process.
[0075] It is worth mentioning that the multiple sample functional units obtained from the decomposition process can also be grouped according to the functional characteristics they represent, so that the multiple code groups include the generalized sample functional unit code as well as the decomposed sample functional unit code.
[0076] Step S1343: Determine at least one candidate code group from the plurality of code groups based on the functionality required for the network security incident type.
[0077] The functions corresponding to the candidate code groups are the functions required for the network security incident type.
[0078] By employing the above step S1343, at least one candidate code group corresponding to each type of network security incident can be determined.
[0079] Step S1344: Generate candidate sample code corresponding to the network security event type based on the code in the candidate code group corresponding to the network security event type.
[0080] Specifically, candidate sample codes corresponding to the network security incident type can be generated from the candidate code groups corresponding to the network security incident type. Each candidate sample code is generated based on a sample functional unit code selected from each candidate code group.
[0081] For example, taking multiple code groups including six code groups: E, F, G, H, I, and J, if the candidate code groups corresponding to a certain network security incident type are E, F, and G, and each candidate code group includes two generalized sample functional unit codes, such as candidate code group E including e1 and e2, candidate code group F including f1 and f2, and candidate code group G including g1 and g2, then the combination of the generalized sample functional unit codes included in the multiple candidate sample codes can be e1+f1+g1, e1+f1+g2, e1+f2+g1, e1+f2+g2, e2+f1+g1, e2+f1+g2, e2+f2+g1, and e2+f2+g2.
[0082] By adopting the above settings to group the generalized sample functional unit codes according to their functional characteristics and to determine candidate code groups for different types of cybersecurity events, the quality of the candidate code groups can be ensured. This ensures that the generated candidate code groups have functions related to the type of cybersecurity event, thus ensuring that they meet the functional requirements of specific cybersecurity events. Therefore, using the candidate sample codes generated above can more accurately simulate actual attack behaviors or defense mechanisms.
[0083] In one possible implementation, step S130 specifically involves using a generative language model to generate generalized sample functional unit codes based on the generalized instructions and the sample functional unit codes. In this case, please refer to [link to relevant documentation]. Figure 4 As shown, the generative language model is trained in the following way:
[0084] Step S160: Obtain multiple sets of sample data, each set of sample data including a first sample code, a sample generalization instruction corresponding to the first sample code, and a reference generalization code.
[0085] In one possible implementation, step S160 may specifically involve: acquiring multiple first sample codes; calling a generalization tool to perform generalization processing on the first sample codes to obtain reference generalized code; and generating sample generalization instructions based on the functional characteristics of the generalization tool.
[0086] Different generalization tools use different sample generalization instructions, which guide how to convert the initial sample code into a more general and adaptive form. These instructions specify the concrete rules for generalizing the initial sample code.
[0087] Step S170: Use a pre-trained generative language model to generalize the code based on the first sample code and the sample generalization instructions corresponding to the sample code to obtain the generalized prediction code.
[0088] Among them, the pre-trained generative language model has been extensively trained on a large amount of text data and has a certain ability to understand and generate language.
[0089] By utilizing a pre-trained generative language model to generalize code based on the first sample code and the sample generalization instructions corresponding to the sample code, the pre-trained generative language model can attempt to generate a more reasonable generalized prediction code based on the knowledge it has learned.
[0090] Step S180: Based on the generalized prediction code and the reference generalization code, obtain the first model loss.
[0091] The first model loss measures the difference between the generalized prediction code and the reference generalized code. This can be achieved by defining a loss function, such as using cross-entropy loss or mean squared error loss, based on the generalized prediction code and the reference generalized code.
[0092] The loss function quantifies the degree of mismatch between the two models; a lower value indicates better model performance. When the generalized prediction code is very close to the reference generalized code, the loss value will be small; conversely, a larger difference will lead to a higher loss value. This mechanism provides a clear direction for subsequent adjustments to model parameters—namely, minimizing the loss of the first model.
[0093] Step S190: Adjust the model parameters of the generative language model based on the first model loss to minimize the first model loss until the first iteration termination condition is met, and obtain the trained generative language model.
[0094] The backpropagation algorithm updates the model parameters based on the rate of change (gradient) of the first model's loss relative to each parameter, aiming to find an optimal set of model parameters that minimizes the average loss across the entire training set. This process often involves multiple iterations, where the loss is recalculated and parameters are adjusted accordingly with new data, until a predetermined stopping condition is met, such as reaching the maximum number of iterations or the loss decreasing by less than a certain threshold, resulting in the trained generative language model.
[0095] By adopting the above methods, the generative language model can learn knowledge about the functional requirements, code structure and patterns, and application of generalization instructions for various types of network security incidents. This ensures that the generalized code generated when using the trained generative language model for code generalization is more in line with the needs of actual application scenarios.
[0096] Please see Figure 5 As shown, after obtaining multiple target sample codes, the method further includes:
[0097] Step S220: Use the initial security detection model to perform attack prediction on the training samples to obtain attack prediction results, which are used to indicate whether the attack is successful.
[0098] By using training samples from the initial security detection model to perform attack predictions, the initial security detection model can attempt to predict attack outcomes based on its learned knowledge.
[0099] Step S230: Obtain the second model loss based on the attack prediction results of the training samples and the simulated attack test results corresponding to the training samples.
[0100] The second model loss measures the difference between the attack prediction result and the sample label. This can be achieved by defining a loss function, such as using cross-entropy loss or mean squared error loss based on the attack prediction result and the sample label.
[0101] The loss function quantifies the degree of mismatch between the two, with a lower value indicating better model performance. When the attack prediction is very close to the sample label, the loss value will be small; conversely, a larger difference will lead to a higher loss value. This mechanism provides a clear direction for subsequent model parameter adjustments—namely, minimizing the initial model loss.
[0102] Step S240: Adjust the model parameters of the initial security detection model based on the second model loss to minimize the first model loss until the second iteration termination condition is met, and obtain the target security detection model.
[0103] The backpropagation algorithm updates the model parameters based on the rate of change (gradient) of the second model loss relative to each parameter, aiming to find an optimal set of model parameters that minimizes the average loss across the entire training set. This process often involves multiple iterations, where the loss is recalculated and parameters are adjusted accordingly with new data, until a predetermined stopping condition is met, such as reaching the maximum number of iterations or the loss decreasing by less than a certain threshold, resulting in the trained target security detection model.
[0104] By obtaining training sample code with explicit labels (indicating whether an attack was successful) from a sample library and using these samples to train an initial security detection model, the model can learn the characteristics and patterns of different types of cybersecurity incidents. The resulting target security detection model will have higher detection accuracy and can more reliably identify potential security threats in practical applications. Furthermore, because the sample library contains a rich and diverse collection of training sample code, including target sample code, it ensures that the target security detection model trained using the sample library is effective not only against known types of attacks but also against unknown or variant attacks. This allows the target security detection model to maintain a high detection rate and a low false positive rate even when encountering new or complex attack methods. Therefore, by adopting the above-described method steps of this application, the performance of the target security detection model is greatly improved.
[0105] Please see Figure 6 As shown in the embodiments of this application, a security detection method is also provided, the method comprising:
[0106] Step S310: Obtain the data to be detected.
[0107] The data to be detected can be program code (e.g., application source code, script files, etc., used to detect whether there is malicious coding or potential vulnerabilities in the code), network traffic data (e.g., data packets entering and leaving the network, which can be data streams of various protocols such as HTTP, HTTPS, FTP, etc. This type of data is usually used to detect abnormal activities in the network, such as DDoS attacks, SQL injection attempts, etc.), system logs (e.g., log files generated by the operating system or application, which record the system's running status and events that occur, and can be used to detect abnormal behavior patterns, such as unauthorized access attempts or system configuration changes), user behavior data (e.g., user login patterns, operating habits, etc., used to identify behaviors that do not conform to known normal behavior patterns, which may indicate internal threats or account intrusion), or files (e.g., suspicious executable files, documents, or other types of files, which may contain malware, such as viruses, Trojans, or spyware).
[0108] The data to be detected can be obtained in real time directly from network devices (such as routers and switches) or servers; it can also be obtained by calling application programming interfaces, or by deploying proxies or middleware on the client or server side; or it can be obtained by receiving data uploaded by users, without specific limitations here.
[0109] Step S320: Use the security detection model trained by the training samples generated by the sample generation method to perform attack prediction on the data to be detected, and obtain the attack prediction result of the data to be detected.
[0110] By adopting the above method, it is possible to use the trained and accurate security model for security detection, thereby ensuring the reliability of security detection of the data to be processed.
[0111] Please see Figure 7 As shown, this application provides a sample generation method, which includes the following stages:
[0112] 1. Sample Collection
[0113] Initial sample code can be collected through open-source intelligence (such as websites like GitHub); or by collecting programs from cybersecurity incidents that occur at customer sites. This initial sample code includes software program code corresponding to malware, exploit tools, attack agents, and backdoors.
[0114] 2. Sample generalization
[0115] The initial sample code is generalized using the steps S120-S140 described above to obtain multiple candidate sample codes.
[0116] 3. Sample Attacks and Simulation
[0117] A Business Assurance (BAS) system and a high-quality sample library are built around the network attack chain. Candidate sample code is deployed to the BAS system, and the BAS system is run to conduct network security effectiveness evaluation. The results of the simulated attack test are simulated in real time. The candidate sample code that indicates the simulated attack was successful is determined as the target sample code for training the security detection model and added to the sample library.
[0118] 4. Training of a large-scale security model
[0119] The security big model is trained using samples from the sample library. Since only high-quality adversarial samples are retained after the sample data passes through the BAS system, the big model is trained using high-quality adversarial samples and around common attack scenarios (such as APT attacks and ransomware attacks), thereby improving the cybersecurity practical effect of the security inspection big model and reducing the problems of big model illusion and misjudgment.
[0120] It should be understood that although the steps in the flowcharts of the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.
[0121] Please see Figure 8 Another embodiment of this application provides a sample generation device 400, which includes: a code acquisition module 410 for acquiring multiple initial sample codes, wherein the initial sample codes are program codes corresponding to network security events; a code decomposition module 420 for decomposing the initial sample codes according to code functions to obtain at least one sample functional unit code corresponding to each initial sample code; a code generation module 440 for generating multiple candidate sample codes based on the sample functional unit codes corresponding to each of the multiple initial sample codes; and a code testing module 450 for performing simulated attack tests on each candidate sample code to obtain simulated attack test data and simulated attack test results for the candidate sample codes, and using the simulated attack test data corresponding to the candidate sample codes whose simulated attack results indicate successful simulated attacks as training samples to train a security detection model based on the training samples.
[0122] In one possible implementation, the code generation module 330 includes a generalization processing submodule and a code generation submodule. The generalization processing submodule is used to perform generalization processing on the sample functional unit code to obtain generalized sample functional unit code. The code generation submodule is used to generate multiple candidate sample codes based on multiple generalized sample functional unit codes, and each candidate sample code includes at least one generalized sample functional unit code.
[0123] In one possible implementation, the code generation submodule is further configured to generate generalized sample functional unit code based on the generalized instructions and the sample functional unit code using a generative language model.
[0124] In one possible implementation, the sample generation device 400 further includes a first sample acquisition module, a generalization prediction module, a first loss acquisition module, and a first parameter adjustment module. The first sample acquisition module is used to acquire multiple sets of sample data, each set of sample data including a first sample code, a sample generalization instruction corresponding to the first sample code, and a reference generalization code. The generalization prediction module is used to perform code generalization based on the first sample code and the sample generalization instruction corresponding to the sample code using a pre-trained generative language model to obtain a generalized prediction code. The first loss acquisition module is used to obtain a first model loss based on the generalized prediction code and the reference generalization code. The first parameter adjustment module is used to adjust the model parameters of the generative language model based on the first model loss to minimize the first model loss until a first iteration termination condition is met, thus obtaining a trained generative language model.
[0125] In one possible implementation, the first sample acquisition module is further configured to acquire multiple first sample codes; call a generalization tool to perform generalization processing on the first sample codes to obtain reference generalized code; and generate sample generalization instructions according to the functional characteristics of the generalization tool.
[0126] In one possible implementation, the generalization processing submodule is further used to call a generalization tool to perform generalization processing on the sample functional unit code, thereby obtaining the generalized sample functional unit code.
[0127] In one possible implementation, the sample generation device 300 further includes an attack prediction module, a second loss acquisition module, and a second parameter adjustment module; the attack prediction module is used to perform attack prediction on the training samples using an initial security detection model to obtain an attack prediction result, the attack prediction result being used to indicate whether the attack was successful; the second loss acquisition module is used to obtain a second model loss based on the attack prediction result of the training samples and the simulated attack test result corresponding to the training samples; the second parameter adjustment module is used to adjust the model parameters of the initial security detection model based on the second model loss to minimize the first model loss until the second iteration termination condition is reached, thereby obtaining the target security detection model.
[0128] In one possible implementation, the code generation module 330 includes a function determination submodule, a code partitioning submodule, a code group determination submodule, and a code generation submodule. The function determination submodule is used to determine the functions required for each of various network security event types. The code partitioning submodule is used to partition the generalized sample functional unit codes according to their functions to obtain multiple code groups. The code group determination submodule is used to determine at least one candidate code group from the multiple code groups based on the functions required for the network security event type, wherein the functions corresponding to the candidate code groups are the functions required for the network security event type. The code generation submodule is used to generate candidate sample code corresponding to the network security event type based on the code in the candidate code group corresponding to the network security event type.
[0129] like Figure 9 As shown, another embodiment of this application provides a security detection device 500, which includes: a data acquisition module 510 for acquiring data to be detected; and a security detection module 520 for using a security detection model trained with training samples generated by the above-mentioned sample generation device to perform attack prediction on the data to be detected, and obtain the attack prediction result of the data to be detected.
[0130] Each module in the above-described device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device in hardware form, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module. It should be noted that the device embodiments in this application correspond to the foregoing method embodiments. The specific principles of the device embodiments can be found in the foregoing method embodiments, and will not be repeated here.
[0131] The following will combine Figure 10 This application describes an electronic device.
[0132] Please see Figure 10 Based on the sample generation method provided in the above embodiments, this application embodiment also provides another electronic device 100 including a processor 102 capable of executing the aforementioned method. The electronic device 100 can be a server, a terminal device, or a vehicle. The terminal device can be a smartphone, tablet computer, computer, or portable computer, etc.
[0133] The electronic device 100 also includes a memory 104. The memory 104 stores a program that can execute the contents of the foregoing embodiments, and the processor 102 can execute the program stored in the memory 104.
[0134] The processor 102 may include one or more cores for data processing and message matrix units. The processor 102 connects to various parts within the electronic device 100 using various interfaces and lines, and performs various functions and processes data by running or executing instructions, programs, code sets, or instruction sets stored in the memory 104, and by calling data stored in the memory 104. Optionally, the processor 102 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). The processor 102 may integrate one or more of the following: Central Processing Unit (CPU), Graphics Processing Unit (GPU), and modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the displayed content; and the modem handles wireless communication. It is understood that the modem may also not be integrated into the processor 102 and may be implemented separately using a communication chip.
[0135] The memory 104 may include random access memory (RAM) or read-only memory (ROM). The memory 104 can be used to store instructions, programs, code, code sets, or instruction sets. The memory 104 may include a program storage area and a data storage area. The program storage area may store instructions for implementing an operating system, instructions for implementing at least one function, instructions for implementing the various method embodiments described below, etc. The data storage area may also store data acquired by the electronic device 100 during use (e.g., parking location, environmental information, and images).
[0136] The electronic device 100 may also include a network module and a screen. The network module is used to receive and transmit electromagnetic waves, converting electromagnetic waves into electrical signals, thereby enabling communication with communication networks or other devices, such as audio playback devices. The network module may include various existing circuit elements used to perform these functions, such as antennas, radio frequency transceivers, digital signal processors, encryption / decryption chips, SIM cards, memory, etc. The network module can communicate with various networks such as the Internet, corporate intranets, and wireless networks, or communicate with other devices via wireless networks. The aforementioned wireless networks may include cellular telephone networks, wireless local area networks, or metropolitan area networks. The screen can display interface content and perform data interaction, such as displaying the aforementioned interface and triggering operations through the screen.
[0137] In some embodiments, the electronic device 100 may further include a peripheral interface 106 and at least one peripheral device. The processor 102, memory 104, and peripheral interface 106 can be connected via a bus or signal line. Each peripheral device can be connected to the peripheral interface via a bus, signal line, or circuit board. Specifically, the peripheral device includes at least one of the following: a radio frequency component 108, a positioning component 112, a camera 114, an audio component 116, a display screen 118, and a power supply 122.
[0138] Peripheral interface 106 can be used to connect at least one I / O (Input / Output) related peripheral device to processor 102 and memory 104. In some embodiments, processor 102, memory 104 and peripheral interface 106 are integrated on the same chip or circuit board; in some other embodiments, any one or two of processor 102, memory 104 and peripheral interface 106 can be implemented on separate chips or circuit boards, and this application embodiment does not limit this.
[0139] The radio frequency (RF) component 108 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The RF component 108 communicates with communication networks and other communication devices via electromagnetic signals. The RF component 108 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals back into electrical signals. Optionally, the RF component 108 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, etc. The RF component 108 can communicate with other terminals via at least one wireless communication protocol. This wireless communication protocol includes, but is not limited to: the World Wide Web, metropolitan area networks, intranets, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and / or WiFi (Wireless Fidelity) networks. In some embodiments, the RF component 108 may also include circuitry related to NFC (Near Field Communication), which is not limited in this application.
[0140] Positioning component 112 is used to locate the current geographic location of an electronic device to enable navigation or LBS (Location Based Service). Positioning component 112 can be a positioning component based on the US GPS (Global Positioning System), BeiDou system, or Galileo system.
[0141] Camera 114 is used to capture images or videos. Optionally, camera 114 includes a front-facing camera and a rear-facing camera. Typically, the front-facing camera is located on the front panel of the electronic device 100, and the rear-facing camera is located on the back of the electronic device 100. In some embodiments, there are at least two rear-facing cameras, which are any one of a main camera, a depth-sensing camera, a wide-angle camera, and a telephoto camera, to achieve background blurring by fusion of the main camera and the depth-sensing camera, panoramic shooting by fusion of the main camera and the wide-angle camera, VR (Virtual Reality) shooting, or other fusion shooting functions. In some embodiments, camera 114 may also include a flash. The flash can be a single-color temperature flash or a dual-color temperature flash. A dual-color temperature flash refers to a combination of a warm light flash and a cool light flash, which can be used for light compensation at different color temperatures.
[0142] Audio component 116 may include a microphone and a speaker. The microphone is used to collect sound waves from the user and the environment, and convert the sound waves into electrical signals that are input to processor 102 for processing, or input to radio frequency component 108 for voice communication. For stereo acquisition or noise reduction purposes, there may be multiple microphones, each located at a different part of electronic device 100. The microphone may also be an array microphone or an omnidirectional microphone. The speaker is used to convert electrical signals from processor 102 or radio frequency component 108 into sound waves. The speaker may be a conventional diaphragm speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, it can convert electrical signals not only into sound waves that humans can hear, but also into sound waves that humans cannot hear for purposes such as ranging. In some embodiments, audio component 114 may also include a headphone jack.
[0143] Display screen 118 is used to display a UI (User Interface). This UI may include graphics, text, icons, videos, and any combination thereof. When display screen 118 is a touch display screen, it also has the ability to collect touch signals on or above its surface. These touch signals can be input as control signals to processor 102 for processing. In this case, display screen 118 can also be used to provide virtual buttons and / or a virtual keyboard, also known as soft buttons and / or a soft keyboard. In some embodiments, there may be one display screen 118, which serves as the front panel of electronic device 100; in other embodiments, there may be at least two display screens, respectively disposed on different surfaces of electronic device 100 or in a folded design; in still other embodiments, display screen 118 may be a flexible display screen, disposed on a curved or folded surface of electronic device 100. Furthermore, display screen 118 may be configured as a non-rectangular irregular shape, i.e., a non-rectangular screen. Display screen 118 may be made of materials such as LCD (Liquid Crystal Display) or OLED (Organic Light-Emitting Diode).
[0144] Power supply 122 is used to supply power to various components in electronic device 100. Power supply 122 can be alternating current, direct current, a disposable battery, or a rechargeable battery. When power supply 122 includes a rechargeable battery, the rechargeable battery can be a wired rechargeable battery or a wireless rechargeable battery. A wired rechargeable battery is a battery that is charged via a wired line, while a wireless rechargeable battery is a battery that is charged via a wireless coil. The rechargeable battery can also be used to support fast charging technology.
[0145] This application also provides a structural block diagram of a computer-readable storage medium. The computer-readable medium stores program code, which can be called by a processor to execute the methods described in the above method embodiments.
[0146] Computer-readable storage media can be electronic storage devices such as flash memory, EEPROM (Electrically Erasable Programmable Read-Only Memory), EPROM, hard disk, or ROM. Optionally, computer-readable storage media includes non-transitory computer-readable storage medium. The computer-readable storage medium has storage space for program code that performs any of the method steps described above. This program code can be read from or written to one or more computer program products. The program code can be compressed, for example, in a suitable form.
[0147] This application also provides a computer program product or computer program that includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the methods described in the various optional implementations above.
[0148] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. A sample generation method, characterized in that, The method includes: Obtain multiple initial sample codes, which are program codes corresponding to network security incidents; The initial sample code is decomposed according to its function to obtain at least one sample function unit code corresponding to each initial sample code; Multiple candidate sample codes are generated based on the sample functional unit codes corresponding to the multiple initial sample codes; Each candidate sample code is subjected to a simulated attack test to obtain simulated attack test data and simulated attack test results. The simulated attack test data corresponding to the candidate sample code whose simulated attack was successful, as indicated by the simulated attack test results, is used as a training sample to train the security detection model based on the training sample.
2. The method according to claim 1, characterized in that, The generation of multiple candidate sample codes based on the sample functional unit codes corresponding to each of the multiple initial sample codes includes: The sample functional unit code is generalized to obtain the generalized sample functional unit code; Multiple candidate sample codes are generated based on the generalized sample functional unit codes, and each candidate sample code includes at least one generalized sample functional unit code.
3. The method according to claim 2, characterized in that, The generalization process of the sample functional unit code to obtain the generalized sample functional unit code includes: Generative language models are used to generate generalized sample functional unit codes based on generalized instructions and the sample functional unit codes.
4. The method according to any one of claims 1-3, characterized in that, After using the simulated attack test data corresponding to the candidate sample code indicating a successful simulated attack as training samples, the method further includes: The initial security detection model is used to perform attack prediction on the training samples to obtain attack prediction results, which are used to indicate whether the attack was successful. The second model loss is obtained based on the attack prediction results of the training samples and the simulated attack test results corresponding to the training samples; The model parameters of the initial security detection model are adjusted based on the second model loss to minimize the first model loss until the second iteration termination condition is met, thus obtaining the target security detection model.
5. The method according to any one of claims 1-3, characterized in that, The generation of multiple candidate sample codes based on the generalized sample functional unit codes includes: Determine the required functionalities for each of the various types of cybersecurity incidents; The generalized sample functional unit codes are divided according to their functions to obtain multiple code groups; At least one candidate code group is determined from the plurality of code groups based on the functions required for the network security incident type, wherein the functions corresponding to the candidate code group are the functions required for the network security incident type. Based on the code in the candidate code group corresponding to the network security event type, generate candidate sample code corresponding to the network security event type.
6. A security detection method, characterized in that, The method includes: Acquire the data to be tested; The security detection model trained using the training samples generated by the sample generation method as described in any one of claims 1-5 is used to perform attack prediction on the data to be detected, thereby obtaining the attack prediction result of the data to be detected.
7. A sample generation device, characterized in that, The device includes: The code acquisition module is used to acquire multiple initial sample codes, which are program codes corresponding to network security incidents. The code decomposition module is used to decompose the initial sample code according to the code function to obtain at least one sample functional unit code corresponding to each initial sample code; The code generation module is used to generate multiple candidate sample codes based on the sample functional unit codes corresponding to each of the multiple initial sample codes; The code testing module is used to perform simulated attack tests on each candidate sample code, obtain simulated attack test data and simulated attack test results for the candidate sample code, and use the simulated attack test data corresponding to the candidate sample code whose simulated attack results indicate that the simulated attack was successful as training samples, so as to train the security detection model based on the training samples.
8. A safety detection device, characterized in that, The device includes: The data acquisition module is used to acquire the data to be detected. The security detection module is used to perform attack prediction on the data to be detected using a security detection model trained with training samples generated by the sample generation device as described in claim 7, and to obtain the attack prediction result of the data to be detected.
9. An electronic device, characterized in that, include: One or more processors; Memory; One or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs being configured to perform the method as described in any one of claims 1-5 or 6.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores program code that can be invoked by a processor to perform the method as described in any one of claims 1-5 or 6.
11. A computer program product, comprising a computer program / instructions, characterized in that, When executed by a processor, the computer program / instructions implement the steps of the method described in any one of claims 1-5 or 6.