A message generation method, program product, electronic device and storage medium
By using a targeted message generation method guided by reinforcement learning agents, the problem of low efficiency in traditional fuzz testing is solved, and more efficient deep vulnerability discovery is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HUBEI TIANRONGXIN NETWORK SECURITY TECH CO LTD
- Filing Date
- 2026-04-13
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional fuzz testing methods for network protocols are inefficient, generate only a few test cases, are difficult to discover deep vulnerabilities, and rely on experience to prevent the rule base from updating and evolving itself.
A reinforcement learning agent is used to select actions to perform based on the test state. A generative model generates directional distortion messages, and the policy network is optimized by combining feedback information. The data pool is then updated to optimize the generative model.
This improves the targeting and diversity of fuzz testing, generates more messages that can trigger deeper issues, reduces resource waste, and enhances testing efficiency.
Smart Images

Figure CN122247730A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of network security technology, and more specifically, to a message generation method, program product, electronic device, and storage medium. Background Technology
[0002] In the fields of cybersecurity and software testing, fuzzing is a commonly used vulnerability discovery technique. Its basic idea is to input a large amount of unexpected and distorted test data into the target object and observe whether the target exhibits crashes, abnormal behaviors, or other issues, thereby discovering potential security vulnerabilities. Traditional network protocol fuzzing mainly uses random mutation or packet generation based on predefined rule bases, which is inefficient and the generated test cases remain relatively simple. Summary of the Invention
[0003] The purpose of this application is to provide a message generation method, program product, electronic device, and storage medium to improve the above-mentioned problems.
[0004] In a first aspect, embodiments of this application provide a message generation method, comprising: pre-training a generative model using an initial data pool and initializing a reinforcement learning agent; wherein the generative model is used to generate corresponding messages based on input condition vectors; the reinforcement learning agent selects an action to execute from an action space based on the current test state; wherein the executed action is used to describe the features of the message to be generated; encoding the selected executed action into a condition vector and inputting it into the generative model; the generative model generating a message that conforms to the description of the executed action based on the condition vector; sending the generated message to a target under test for testing and obtaining feedback information; calculating the reward value of the current test based on the feedback information, updating the current test state, and updating the policy network of the reinforcement learning agent; filtering messages from the test process to update the data pool; optimizing the generative model using the updated data pool, and generating test messages by the optimized generative model.
[0005] In the above implementation process, by using a reinforcement learning agent to select actions based on the test state, the generation direction of messages can be focused on the most valuable test areas, reducing the resource waste of blind randomness in traditional methods. By encoding the actions as conditional vectors and inputting them into the generative model, targeted message generation is achieved, and the generated messages accurately reflect the distortion characteristics expected by the agent. By calculating reward values based on feedback information and updating the policy network, the agent can continuously learn from test experience and gradually optimize its decision-making capabilities. By selecting high-value messages from the testing process to update the data pool, and using the updated data pool to optimize the generative model, the generative model can continuously evolve, generating more messages that can trigger deeper questions. This also improves the relevance and diversity of test messages, enhancing the overall effectiveness of fuzz testing.
[0006] Optionally, in this embodiment, the generative model is a conditional generative adversarial network, which includes a generator and a discriminator. The generator is used to generate messages based on the input conditional vector and random noise, and the discriminator is used to determine the authenticity of the messages. The generative model is pre-trained using an initial data pool, which includes: using legitimate messages and seed abnormal messages in the initial data pool to perform adversarial training on the generator and discriminator.
[0007] In the above implementation process, adversarial training between the generator and discriminator enables the generative model to learn the basic structure of legitimate messages and the abnormal patterns of seed abnormal messages from the initial data pool, making subsequent messages more compliant with requirements. Based on this, combined with the decision-making guidance of the reinforcement learning agent, the generated messages can accurately reflect the desired distortion characteristics, avoiding the blindness of random generation in traditional methods. By calculating reward values based on feedback information and updating the policy network, the agent's decision-making ability continuously improves during testing, enabling more targeted selection of generation directions.
[0008] Optionally, in this embodiment of the application, the current test status includes at least one of the following information: coverage change vector, mutation field heatmap, depth reward decay factor, and exploration degree.
[0009] In the above implementation process, by constructing the current test state including coverage change vector, mutation field heatmap, depth reward decay factor and exploration degree, the reinforcement learning agent can perceive the dynamic changes of the test environment, select the action to be executed according to these state information, so that the message generation direction can keep up with the test progress, focus on high-value fields and deep paths, and maintain message diversity.
[0010] Optionally, in the embodiments of this application, the action space includes discrete actions and / or continuous actions; discrete actions include target field selection and mutation type selection, wherein the mutation type is at least one of boundary value, random byte flip, dependency field conflict and state machine violation; continuous actions include mutation intensity, which is used to control the degree of message distortion.
[0011] In the aforementioned implementation, by constructing an action space containing discrete and continuous actions, the reinforcement learning agent can finely control the direction of message generation from multiple dimensions, including target field selection, mutation type selection, and mutation intensity. Target field selection in discrete actions allows the agent to focus on testing specific fields in the protocol; mutation type selection provides various distortion methods to adapt to different testing needs; and mutation intensity in continuous actions allows the agent to adjust the severity of distortion. By combining the perception and feedback information of the current test state with reward calculation, the agent can dynamically adjust action selection according to the test progress, making the generated messages more targeted in exploring high-value regions.
[0012] Optionally, in this embodiment, the action space includes discrete actions and / or continuous actions; encoding the selected execution action into a condition vector includes: converting the discrete actions in the execution action into one-hot encoding or embedding vectors, and / or normalizing the continuous actions in the execution action; concatenating the encoded discrete actions and / or continuous actions to generate a condition vector of a preset dimension.
[0013] In the above implementation process, through adversarial training between the generator and the discriminator, the generative model learns the basic structure and abnormal patterns of the messages from the initial data pool; by constructing the current test state, which includes the coverage change vector, the mutation field heatmap, the depth reward decay factor, and the exploration degree, the reinforcement learning agent can fully perceive the dynamic changes of the test environment; by designing an action space that includes discrete and continuous actions, the agent can finely control the direction of message generation from multiple dimensions such as target field, mutation type, and mutation intensity.
[0014] Optionally, in this embodiment, the feedback information includes at least code coverage change, crash type, execution path depth, and / or test duration. Based on the feedback information, the reward value for the current test is calculated, specifically including: extracting the code coverage change from the feedback information and calculating the coverage reward component; and / or, determining the crash severity reward component based on the crash type; and / or, calculating the depth reward component based on the execution path depth; and / or, calculating the cost penalty component based on the test duration; and using the coverage reward component, crash severity reward component, depth reward component, and / or cost penalty component to calculate the reward value for the current test.
[0015] In the above implementation process, by extracting code coverage changes, crash types, execution path depths and test times from the feedback information, and calculating reward values containing multiple components, the agent can obtain comprehensive and detailed evaluation signals, guiding it to optimize towards covering core code, discovering high-risk vulnerabilities, and exploring deep paths.
[0016] Optionally, in this embodiment of the application, updating the policy network of the reinforcement learning agent includes: acquiring experience data from the test; the experience data includes the state before the update, the selected action, the reward value, and the state after the update; storing the experience data in a buffer; and using the experience data in the buffer, calculating the loss function through a reinforcement learning algorithm and backpropagating to update the parameters of the policy network.
[0017] In the above implementation process, by acquiring and storing experience data including the state before the update, the selected action, the reward value, and the state after the update into a buffer, and then using the experience data in the buffer to update the policy network through a reinforcement learning algorithm, the agent can continuously learn from the test experience and improve its optimization decision-making ability.
[0018] Optionally, in this embodiment of the application, the generated message is sent to the target under test for testing and feedback information is obtained, including: sending the message to the protocol stack of the device under test through the network interface; and monitoring the execution process of the target under test using instrumentation, debugger or sandbox technology to obtain feedback information.
[0019] In the above implementation process, the generated messages are accurately sent to the protocol stack of the device under test via the network interface, ensuring that the test messages can be normally received and processed by the target program. Simultaneously, instrumentation, debuggers, or sandboxing techniques are used to comprehensively monitor the execution process of the target device, enabling real-time capture of key feedback information such as changes in code coverage, program crashes, execution path depth, and test latency. This feedback information provides an accurate data foundation for subsequent reward calculation and state updates, allowing the reinforcement learning agent to learn and optimize based on real test results.
[0020] Optionally, in this embodiment of the application, filtering messages from the test process to update the data pool includes: filtering messages with reward values higher than a threshold from the test process, updating the data pool based on the filtered messages, and generating an updated data pool.
[0021] In the above implementation process, packets with reward values higher than a threshold are selected from the testing process. The data pool is then updated based on these high-value packets, allowing it to continuously absorb new attack knowledge learned by the system during testing. The updated data pool is used to subsequently optimize the generative model, enabling it to generate more packets with similar high-value characteristics.
[0022] Secondly, embodiments of this application also provide a message generation apparatus, comprising: a pre-training module for pre-training a generative model using an initial data pool and initializing a reinforcement learning agent; wherein the generative model is used to generate corresponding messages based on input condition vectors; an execution module for the reinforcement learning agent to select an action to execute from an action space based on the current test state; wherein the execution action is used to describe the features of the message to be generated; an initial message generation module for encoding the selected execution action into a condition vector and inputting it into the generative model; the generative model generates a message that conforms to the description of the execution action based on the condition vector; a testing module for sending the generated message to a target for testing and obtaining feedback information; an updating agent module for calculating the reward value of the current test based on the feedback information, updating the current test state, and updating the policy network of the reinforcement learning agent; and an optimization model module for filtering messages from the test process to update the data pool; optimizing the generative model using the updated data pool, and generating test messages from the optimized generative model.
[0023] Thirdly, embodiments of this application also provide a computer program product, including computer program instructions, which are executed by a processor to perform the method provided in the first aspect or any implementation thereof.
[0024] Fourthly, embodiments of this application also provide an electronic device, including: a processor and a memory, the memory storing computer program instructions, which are executed by the processor to perform the method provided in the first aspect or any implementation thereof.
[0025] Fifthly, embodiments of this application also provide a computer-readable storage medium storing computer program instructions, which, when executed by a processor, perform the method provided in the first aspect or any implementation thereof.
[0026] This application provides a message generation method, program product, electronic device, and storage medium. By employing a reinforcement learning agent that selects actions based on the test state, message generation is focused on the most valuable test areas, reducing the resource waste of blind randomness in traditional methods. By encoding the actions as conditional vectors and inputting them into the generative model, targeted message generation is achieved, and the generated messages accurately reflect the distortion characteristics desired by the agent. By calculating reward values based on feedback information and updating the policy network, the agent can continuously learn from test experience and gradually optimize its decision-making capabilities. By selecting high-value messages from the testing process to update the data pool and using the updated data pool to optimize the generative model, the model can continuously evolve, generating more messages that trigger deeper questions. This also improves the relevance and diversity of test messages, enhancing the overall effectiveness of fuzz testing. Attached Figure Description
[0027] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments of this application will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0028] Figure 1 A flowchart illustrating a message generation method provided in an embodiment of this application; Figure 2 A schematic diagram of a system workflow provided for an embodiment of this application; Figure 3 This is a schematic diagram of the structure of the message generation apparatus provided in the embodiments of this application; Figure 4This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0029] The embodiments of the technical solution of this application will now be described in detail with reference to the accompanying drawings. These embodiments are only used to more clearly illustrate the technical solution of this application and are therefore merely examples, and should not be used to limit the scope of protection of this application.
[0030] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit this application.
[0031] In the description of the embodiments of this application, technical terms such as "first" and "second" are used only to distinguish different objects and should not be construed as indicating or implying relative importance or implicitly specifying the number, specific order, or primary and secondary relationship of the indicated technical features. In the description of the embodiments of this application, "multiple" means two or more, unless otherwise explicitly defined.
[0032] Traditional network protocol fuzzing primarily employs two methods: one is based on random mutation, which involves randomly modifying certain bytes on top of normal data packets; the other is based on a predefined rule base, which modifies data packets according to known anomalous patterns (such as excessively long fields, incorrect checksums, etc.). However, both methods have significant shortcomings. Random mutation-generated test data lacks directionality, and many test cases fail to trigger the program's deeper logic, resulting in low testing efficiency. In the rule base approach, the rule base's construction heavily relies on the experience of cybersecurity experts, only covering known anomalous types and struggling to discover unknown vulnerabilities; furthermore, the rule base cannot self-update and evolve, making it ineffective against new protocols.
[0033] This application provides a message generation method, program product, electronic device, and storage medium. By using a reinforcement learning agent to select actions based on the test state, the message generation direction is focused on the most valuable test areas, reducing the resource waste of blind randomness in traditional methods. By encoding the actions as conditional vectors and inputting them into the generative model, targeted message generation is achieved, and the generated messages accurately reflect the distortion characteristics desired by the agent. By calculating reward values based on feedback information and updating the policy network, the agent can continuously learn from test experience and gradually optimize its decision-making capabilities. By selecting high-value messages from the testing process to update the data pool and using the updated data pool to optimize the generative model, the generative model can continuously evolve, generating more messages that can trigger deeper problems. This also improves the targeting and diversity of test messages, enhancing the overall effectiveness of fuzz testing.
[0034] Please see Figure 1 The illustration shows a flowchart of a message generation method provided in an embodiment of this application. The message generation method provided in this application can be applied to electronic devices, which may include physical devices such as servers, PCs, tablets, or smartphones, or virtual devices such as virtual machines or containers. The electronic device can be a single device, a combination of multiple devices, or a cluster of a large number of devices. The message generation method may include: Step S110: Pre-train the generative model using the initial data pool and initialize the reinforcement learning agent; wherein, the generative model is used to generate the corresponding message based on the input condition vector.
[0035] Step S120: The reinforcement learning agent selects an action to execute from the action space based on the current test state; wherein the action to execute is used to describe the message features to be generated.
[0036] Step S130: Encode the selected execution action into a condition vector and input it into the generative model; the generative model generates a message that conforms to the description of the execution action based on the condition vector.
[0037] Step S140: Send the generated message to the target under test for testing and obtain feedback information.
[0038] Step S150: Based on the feedback information, calculate the reward value for the current test, update the current test state, and update the policy network of the reinforcement learning agent.
[0039] Step S160: Filter messages from the test process to update the data pool; optimize the generative model using the updated data pool, and generate test messages from the optimized generative model.
[0040] In step S110, an initial data pool needs to be constructed first. The initial data pool contains basic legitimate packets and / or seed anomalous packets. Basic legitimate packets are normal communication data packets that conform to protocol standards and are captured from the network or generated according to protocol specifications, such as a complete TCP three-way handshake packet. Seed anomalous packets are manually defined or known malformed packet samples extracted from historical vulnerability databases, such as packets with checksum errors or illegal combinations of flag bits. The data in the initial data pool together constitute the initial sample space for generative model learning.
[0041] The generative model employs a conditional generative adversarial network (GAN), comprising two neural networks: a generator and a discriminator. During pre-training, packets from the initial data pool are used as real samples to input the discriminator; simultaneously, the generator receives a random noise vector and a randomly generated condition vector to generate forged packets to input into the discriminator. The discriminator attempts to distinguish between real and forged packets, while the generator attempts to generate packets capable of deceiving the discriminator. Through this adversarial training, the generator gradually learns to introduce controllable anomalies in specified fields based on the condition vector while maintaining the basic protocol structure of the packets. After pre-training, the generator possesses the ability to generate corresponding packets based on the condition vector. Simultaneously, the reinforcement learning agent needs to be initialized, including defining its state space, action space, and reward function, and randomly initializing the parameters of the policy network to prepare for subsequent online learning.
[0042] In step S120, the reinforcement learning agent can select the next generation direction based on the current test state. The current test state is a multi-dimensional feature vector used to describe the real-time situation of the test environment. For example, it may include the slope of coverage changes in recent rounds of testing, the statistical distribution of various crash events, the identifiers of frequently tested code modules, and the information entropy of the currently generated batch of messages. This state information comprehensively reflects the speed of test progress, the quality of vulnerability discovery, the distribution of test focus, and the diversity of generated messages.
[0043] The action space defines the range of decisions an agent can make. In this application, the action space adopts a hybrid discrete and continuous design, comprising three parts: target field selection, mutation type selection, and mutation intensity. Target field selection is a discrete action, choosing one or more fields from multiple fields in the protocol header, such as source port, destination port, sequence number, acknowledgment number, flags, window size, and checksum, as the key field for mutation in this round. Mutation type selection is also a discrete action, defining how to distort the selected field, specifically including boundary values, random byte flips, dependency field conflicts, state machine violations, and other methods. Mutation intensity is a continuous action, ranging from 0 to 1, used to control the degree of distortion, such as the proportion of byte flips or the multiple of numerical out-of-bounds errors. The agent's policy network outputs the probability distribution of actions based on the current state and samples it to select a specific action to execute.
[0044] In step S130, after the reinforcement learning agent selects an action to perform, it needs to convert the action into a condition vector that the generative model can understand. The encoding process includes: for discrete information in the action, such as the selected target field and mutation type, converting it into a one-hot encoded or embedded vector; for continuous information in the action, i.e., mutation intensity, normalizing it. Then, the encoded discrete and / or continuous information is concatenated to form a fixed-dimensional condition vector.
[0045] This conditional vector, along with random noise, is input into a pre-trained generative model. The generator in the generative model receives these two inputs, performs forward computation through a neural network, and outputs a complete message byte sequence. Because the conditional vector encodes the desired features of the action, the generated message focuses on the specified target field, distorting it according to the specified mutation type and strength, while maintaining the basic structure of other fields. For example, if the action specifies a flag as the target field, a state machine violation as the mutation type, and a strength of 0.7, the generated message might contain an illegal combination of SYN, ACK, and RST being set simultaneously.
[0046] In step S140, a batch of malformed packets generated in step S130 are sent to the target under test via the network interface. The target under test can be a protocol stack on a real device or a software implementation in a simulated environment. The sending method varies depending on the test interface; for example, TCP or UDP packets can be sent to a specified port of the target device via a raw socket.
[0047] While sending messages, various techniques can be used to monitor the execution process of the target under test in order to collect comprehensive feedback information. For example, instrumentation techniques can be used to record code coverage in real time to determine which basic blocks are newly executed; debuggers or sandboxing techniques can be used to capture abnormal events in the target program, including call stack information at the time of crash; and the execution path depth and test time of each test can also be recorded. This feedback information together constitutes the basic data for evaluating the effectiveness of this round of testing.
[0048] In step S150, after obtaining feedback information, the reward value for this round of testing needs to be calculated to provide learning signals for the reinforcement learning agent. The reward value is calculated using a hierarchical multi-objective function, including multiple components: the coverage reward component assigns different weights based on whether the newly covered code region is located in a preset key function, such as the core processing function of the protocol stack; newly covered core functions receive high scores, while newly covered ordinary regions receive basic scores; the crash severity reward component assesses the risk level based on the crash call stack; if the crash involves memory operation functions, it is judged as high-risk and given a high score; if it is located within a protocol state machine function, it receives a medium score; and ordinary crashes receive a low score; the execution path depth reward component is calculated based on the depth of the execution path to encourage exploration of deeper logic; and the cost penalty component is calculated based on the test time to control resource consumption. These components are weighted and summed to obtain the comprehensive reward value for this round of testing.
[0049] Then, update the current test state. Based on the feedback from this round of testing, recalculate each dimension in the state vector, such as updating the coverage change slope, updating the crash statistics distribution, updating the module identifiers for the current key tests, and calculating the information entropy of the messages generated in this round, to form a new state as the input for the next round of testing.
[0050] When updating the policy network of a reinforcement learning agent, the experience data from the current test round, including the state before the update, the selected action, the calculated reward value, and the state after the update, is first stored in an experience replay buffer. When the amount of experience in the buffer reaches a certain threshold, a batch of experience data is randomly sampled from it. The loss function is calculated using a reinforcement learning algorithm, and the parameters of the policy network are updated through backpropagation, enabling the agent to gradually learn to select actions that can obtain higher rewards.
[0051] In step S160, during the testing process, the system continuously filters out packets with high testing value. Specifically, a reward value threshold can be set, and packets with reward values exceeding this threshold in each round of testing are marked as high-value packets. These packets typically trigger new code coverage or cause crashes, representing new attack knowledge learned autonomously by the system. These high-value packets and their associated testing context information are collected to form a high-value packet library.
[0052] Every certain number of test rounds (e.g., every 100 rounds), the system updates the initial data pool using messages from the high-value message library, creating an updated data pool. This update can be achieved by adding high-value messages to the original data pool according to a preset ratio, replacing or supplementing existing seed anomalous data. Then, the updated data pool is used to retrain or fine-tune the generative model, enabling the generator to learn more feature distributions from high-value messages. After optimization, in subsequent test iterations, when the reinforcement learning agent selects new actions, the generative model can generate more anomalous messages with similar high-value features, thereby continuously improving the relevance and effectiveness of the test messages.
[0053] In the implementation of the above embodiments: By using a reinforcement learning agent to select actions based on the test state, the generation direction of messages can be focused on the most valuable test areas, reducing the resource waste of blind randomness in traditional methods. By encoding the actions as conditional vectors and inputting them into the generative model, targeted message generation is achieved, and the generated messages accurately reflect the distortion characteristics desired by the agent. By calculating reward values based on feedback information and updating the policy network, the agent can continuously learn from test experience and gradually optimize its decision-making capabilities. By selecting high-value messages from the testing process to update the data pool and using the updated data pool to optimize the generative model, the generative model can continuously evolve and generate more messages that can trigger deeper problems. This also improves the relevance and diversity of test messages and enhances the overall effect of fuzz testing.
[0054] Optionally, in this embodiment, the generative model is a conditional generative adversarial network, which includes a generator and a discriminator; the generator is used to generate messages based on the input conditional vector and random noise, and the discriminator is used to determine the authenticity of the messages.
[0055] The generator is a neural network whose input consists of two parts: a random noise vector and a condition vector; the output is a complete sequence of message bytes. The condition vector guides the generator to produce messages with specific characteristics, such as requiring the generated messages to modify flag fields or adopt a certain mutation type. The random noise ensures the diversity of the generated messages; even with the same condition vector, different noises will produce different messages. The discriminator is another neural network whose input is a message, and whose output is the probability or score that the message is genuine, used to determine whether the message comes from the real data pool or is forged by the generator.
[0056] The construction of the initial data pool is fundamental to pre-training. The initial data pool contains legitimate packets and seed anomalous packets. Legitimate packets are normal communication data packets that conform to protocol standards, captured from the network or generated according to protocol specifications, such as complete TCP three-way handshake packets and normal data transmission packets. Their purpose is to allow the generator to learn the normal structure of packets and protocol specifications. Seed anomalous packets are manually defined or known malformed packet samples extracted from historical vulnerability databases, such as packets with checksum errors, packets with illegal combinations of flag bits, and packets with overflowing length fields. Their purpose is to serve as initial bad examples, guiding the generator to learn anomalous patterns.
[0057] The generative model is pre-trained using an initial data pool, including: adversarial training of the generator and discriminator using legitimate packets and seed anomalous packets from the initial data pool.
[0058] The pre-training process employs an adversarial training approach. In each training round, a batch of real packets is randomly sampled from the initial data pool as positive samples input to the discriminator. The generator receives randomly generated noise vectors and randomly generated condition vectors, generating a batch of forged packets to input to the discriminator. The discriminator attempts to distinguish between real and forged input packets and outputs the judgment result; the generator attempts to deceive the discriminator as much as possible, making it unable to distinguish between them. Through this adversarial game, the discriminator continuously improves its discrimination ability, and the generator continuously improves the quality of its generated packets. Specifically, an alternating optimization approach is used during training: first, the generator parameters are fixed, and the discriminator parameters are updated to better distinguish between real and forged packets; then, the discriminator parameters are fixed again, and the generator parameters are updated to make the generated packets more deceptive to the discriminator. After multiple rounds of iterative training, the generator gradually learns to introduce controllable anomalies in specified fields based on the condition vector while maintaining the basic protocol structure of the packets, thus gaining the ability to generate corresponding packets based on the condition vector. After pre-training, the generator can be used for packet generation tasks in subsequent steps.
[0059] In the implementation of the above embodiments: through adversarial training between the generator and discriminator, the generative model learns the basic structure of legitimate messages and the abnormal patterns of seed abnormal messages from the initial data pool, making subsequent messages more compliant with requirements. Based on this, combined with the decision-making guidance of the reinforcement learning agent, the generated messages accurately reflect the desired distortion characteristics, avoiding the blindness of random generation in traditional methods. By calculating reward values based on feedback information and updating the policy network, the agent's decision-making ability continuously improves during testing, enabling more targeted selection of generation directions.
[0060] Optionally, in this embodiment, when the reinforcement learning agent makes a decision, it needs to input the current test state to perceive the real-time situation of the test environment. The current test state is a multi-dimensional feature vector used to describe the system's test progress and quality distribution over a past period. The current test state may include at least one of the following four pieces of information: coverage change vector, mutation field heatmap, depth reward decay factor, and exploration degree. The specific composition can be selected or combined according to the test objective.
[0061] The coverage change vector reflects the coverage trend of the test target code. In practice, instrumentation techniques are used to monitor the execution of the test target code in real time, recording the execution count of each basic block or function. Every fixed test round, the newly covered code areas are counted, and the change in coverage is calculated. The coverage changes from the most recent rounds are combined into a vector, for example, the slope of coverage growth over the last 5 rounds. The coverage change vector reflects the speed of test progress; if the slope continues to decrease, it indicates that the test may have reached a bottleneck, requiring the agent to adjust its strategy and explore new directions.
[0062] Mutation field heatmaps are used to record the frequency with which mutations of various protocol fields produce new effects. In implementation, a list of key fields for the protocol under test is first defined. For example, for TCP, this could include source port, destination port, sequence number, acknowledgment number, flags, window size, checksum, and urgent pointer. After each test, the fields that were the focus of the mutation are recorded, and it is determined whether the mutation produced positive feedback. A vector of the same length as the field list is maintained, with each element corresponding to the cumulative number or frequency of effective mutations for a field, forming the mutation field heatmap. The mutation field heatmap can reflect which fields are currently high-value attack surfaces, guiding the agent to focus on these fields.
[0063] The depth reward decay factor measures the execution depth reached by the current test, reflecting whether the test has touched upon the deeper logic of the code. In practice, during each test, the function call stack depth at the time of execution of the target under test is obtained through instrumentation or debugging techniques. The average call stack depth of the current test round or the most recent several rounds is calculated and compared with the historical maximum depth to obtain the depth reward decay factor, whose value is typically between 0 and 1. A larger factor indicates that the current test is closer to the historical deepest path, and the agent should continue to explore deeper logic; a smaller factor may indicate that the test is superficial, requiring an adjustment of the strategy to push deeper.
[0064] The exploration degree is used to assess the diversity of currently generated messages, preventing the agent from getting trapped in local optima and repeatedly generating similar messages. In practice, statistical analysis is performed on each batch of generated malformed messages to calculate their information entropy. Information entropy can be measured based on the distribution of message byte values, the distribution of field values, or the degree of variation in message structure. A higher exploration degree indicates better diversity in generated messages, suggesting the system is still conducting extensive exploration; a lower exploration degree indicates that generated messages tend to be homogeneous, requiring the agent to be encouraged to try new combinations of actions.
[0065] As one implementation, this state information can be achieved by integrating a data collection module into the fuzzing framework. The coverage change vector is calculated and updated in real-time by the coverage monitoring module; the mutation field heatmap is recorded by the feedback analysis module, which records the effective mutation fields for each test and maintains a cumulative vector; the depth reward decay factor is calculated by the execution path tracing module, which obtains the call stack depth and calculates the ratio; and the exploration degree is calculated by the message analysis module, which calculates the information entropy of each batch of generated messages. The calculated values are combined into a multi-dimensional state vector, which serves as the input to the reinforcement learning agent's policy network.
[0066] In the implementation of the above embodiments: by constructing the current test state including coverage change vector, mutation field heatmap, deep reward decay factor and exploration degree, the reinforcement learning agent can perceive the dynamic changes of the test environment, select the action to be executed according to these state information, so that the message generation direction can keep up with the test progress, focus on high-value fields and deep paths, and maintain the diversity of messages.
[0067] When a reinforcement learning agent makes decisions, the action space defines all possible choices the agent can make. In this embodiment, the action space in this step adopts a design that combines discrete and continuous actions, enabling the agent to finely control the direction of message generation. Discrete actions include two parts: target field selection and mutation type selection, while continuous actions include mutation intensity.
[0068] Optionally, the action space includes discrete actions and / or continuous actions; discrete actions include target field selection and mutation type selection, wherein the mutation type is at least one of boundary value, random byte flip, dependency field conflict and state machine violation; continuous actions include mutation strength, which is used to control the degree of message distortion.
[0069] Target field selection refers to the agent choosing one or more fields from multiple fields in the protocol header as the key mutation fields for this round. Taking the TCP protocol as an example, the selectable fields include source port, destination port, sequence number, acknowledgment number, data offset, flags, window size, checksum, urgent pointer, etc.
[0070] Mutation type selection refers to how the agent decides to distort a selected target field. Random byte flipping involves randomly flipping bits in the binary representation of a field, such as changing some bits from 0 to 1 or from 1 to 0. This mutation is used to test the tolerance of the target to random noise. Dependency field conflict involves creating logical contradictions between fields, such as setting the data offset field to 5 (indicating a TCP header length of 20 bytes), but actually constructing a 40-byte TCP header in the sent message, causing a discrepancy between the length field and the actual length. This mutation is used to test the robustness of the target in handling dependencies between fields. State machine violation refers to sending messages that do not conform to the protocol state transition rules, such as sending a packet with the ACK flag without completing the three-way handshake, or setting the SYN and FIN flags simultaneously when a connection has been established. This mutation is used to test defects in the protocol state machine implementation.
[0071] The mutation strength in continuous actions is a continuous numerical value ranging from 0 to 1, used to control the severity of distortion. The specific meaning of mutation strength varies depending on the mutation type: for boundary value mutations, the strength controls the deviation from the boundary value; for example, a strength of 0.5 indicates using an intermediate value instead of an extreme boundary. For random byte flips, the strength controls the proportion of bits flipped; a strength of 0.1 indicates flipping 10% of the bits, and a strength of 0.9 indicates flipping 90% of the bits. For dependency field conflicts, the strength controls the severity of the conflict. For state machine violations, the strength controls the obviousness of the violation. Mutation strength provides agents with finer control granularity, enabling smooth adjustment from slight distortion to severe distortion.
[0072] One implementation approach is to encode discrete and continuous actions into a unified output format. Discrete actions are typically represented using one-hot encoding or embedded vectors. For example, for selecting a target field with N optional fields, the selected field can be represented by an N-dimensional one-hot vector; for selecting M mutation types, the selected mutation type can be represented by an M-dimensional one-hot vector. Continuous actions directly output a floating-point number between 0 and 1 as the mutation strength. These outputs serve as inputs to the subsequent steps for generating conditional vectors, guiding the generative model to generate malformed messages that conform to the action description.
[0073] In the implementation of the above embodiments: by constructing an action space containing discrete and continuous actions, the reinforcement learning agent can finely control the direction of message generation from multiple dimensions, including target field selection, mutation type selection, and mutation intensity. Target field selection in discrete actions allows the agent to focus on testing specific fields in the protocol; mutation type selection provides various distortion methods to adapt to different testing needs; and mutation intensity in continuous actions allows the agent to adjust the severity of the distortion. Combining the perception and feedback information of the current test state with reward calculation, the agent can dynamically adjust action selection according to the test progress, making the generated messages more targeted in exploring high-value regions.
[0074] Optionally, in this embodiment, the action space includes discrete actions and / or continuous actions; encoding the selected action into a condition vector includes: The discrete actions in the execution process are converted into one-hot encoded or embedded vectors, and / or the continuous actions in the execution process are normalized; the encoded discrete actions and / or continuous actions are concatenated to generate a conditional vector of a preset dimension.
[0075] After the reinforcement learning agent selects an action to perform based on the current test state, the action needs to be converted into an input form that the generative model can understand, namely a condition vector. The generation of the condition vector requires encoding different types of actions in the action and concatenating the encoded results into a fixed-dimensional vector.
[0076] First, the discrete actions in the execution process are encoded. Discrete actions include target field selection and mutation type selection. For example, target field selection can be choosing one or more fields from multiple candidate fields, while mutation type selection is choosing one from multiple mutation types (such as boundary value, random byte flip, dependency field conflict, state machine violation). Technically, converting discrete actions into one-hot encoding is a common approach: for single-selection discrete actions, a vector of length equal to the number of options can be created, where the selected position is 1 and the rest are 0. For example, if "boundary value" is selected from 4 mutation types, the one-hot vector corresponding to the mutation type is [1,0,0,0]. For multi-selection discrete actions, such as selecting multiple target fields simultaneously, multi-hot encoding can be used, i.e., multiple positions are 1. Another option is to use embedding vectors, which map discrete actions to low-dimensional dense vectors through an embedding layer. This is often used in deep learning to handle categorical features. Embedding vectors can better express the potential relationships between categories; for example, similar fields can have similar embedding representations.
[0077] Secondly, the continuous actions in the execution process are normalized. Continuous actions, or mutation intensities, originally range from 0 to 1, but are usually already normalized values. The purpose of normalization is to make the numerical range of continuous actions consistent with that of the discretely encoded vectors, preventing certain features from dominating the model input due to excessively large dimensions.
[0078] Finally, the encoded discrete and / or continuous actions are concatenated to generate a condition vector of a preset dimension. Concatenation involves joining multiple vectors end-to-end to form a longer vector. For example, if the target field is encoded with a vector length of N, the mutation type with a vector length of M, and the normalized mutation strength is a scalar, then the length of the concatenated condition vector will be N+M+1. The preset dimension can be determined based on the actual design; for example, it can be set to a fixed 128 or 256 dimensions. If the length of the concatenated vector is insufficient, it can be padded with zeros; if it exceeds the limit, dimensionality reduction can be achieved using a fully connected layer. The generated condition vector will serve as input to the generator in the generative model, used together with random noise to generate malformed messages that conform to the description of the executed actions.
[0079] In the implementation of the above embodiments: through adversarial training between the generator and the discriminator, the generative model learns the basic structure and abnormal patterns of the message from the initial data pool; by constructing the current test state including the coverage change vector, the mutation field heatmap, the depth reward decay factor and the exploration degree, the reinforcement learning agent can fully perceive the dynamic changes of the test environment; by designing an action space including discrete actions and continuous actions, the agent can finely control the generation direction of the message from multiple dimensions such as the target field, mutation type and mutation intensity.
[0080] After the fuzz test is completed, the system obtains feedback information from the tested target. Based on this feedback information, the reward value for this round of testing needs to be calculated to provide learning signals for the reinforcement learning agent. In this step, the reward value is composed of multiple components, each reflecting a different aspect of the test effect. Optionally, in this embodiment, the feedback information includes at least code coverage changes, crash types, execution path depth, and / or test duration. Calculating the reward value for the current test based on the feedback information specifically includes: Extract code coverage change from feedback information and calculate coverage reward component; and / or, determine crash severity reward component based on crash type; and / or, calculate depth reward component based on execution path depth; and / or, calculate cost penalty component based on test duration.
[0081] Code coverage change refers to the number or proportion of newly covered code regions compared to previous tests. Technically, instrumentation techniques are used to monitor the execution of the target code in real time, recording the execution count of each basic block or function. After each test, the code regions executed in this round are compared with historical execution records to calculate the number of newly covered basic blocks or functions, which is the code coverage change. The coverage bonus is calculated based on this change, and a tiered reward system can be used: if the newly covered code is located in a preset critical function region, an exponentially higher reward is given, such as a base score multiplied by a factor of 3; if the newly covered code is in a regular function region, a base score is given, such as 1 point per newly covered block. This design aims to guide tests to delve deeper into core code regions.
[0082] A crash type refers to the specific manifestation of an anomaly in the target under test, which can be identified from the crash call stack. As one implementation method, when the target under test crashes, the call stack information at the time of the crash is captured using a debugger or sandbox, and the function names contained in the call stack are parsed. Based on whether the call stack contains specific functions, the crash type can be divided into different levels: if the crash call stack contains memory operation functions, it is considered a high-risk crash; if the crash call stack is located within a protocol state machine processing function, it is considered a state machine crash; other cases are considered ordinary crashes. The crash severity reward component is assigned based on the crash type; the higher the crash severity, the higher the reward value. For example, a high-risk crash is given 10 points, a state machine crash is given 5 points, an ordinary crash is given 1 point, and no crash is given 0 points.
[0083] Execution path depth refers to the function call stack depth or basic block level reached by the program during this round of testing. For example, instrumentation techniques can be used to record each function call and return, calculate the maximum call stack depth during execution, or record the number of basic blocks traversed in the execution path. A larger execution path depth indicates that the test message is more likely to touch upon the program's deeper logic. The depth bonus is calculated based on the execution path depth; for example, it can be calculated using a logarithmic function log(current call stack depth + 1). The 0.5 format ensures that the deeper the exploration, the higher the reward, but the rate of reward growth gradually slows down to avoid over-encouraging unlimited depth exploration.
[0084] Test time refers to the time consumed from sending a message to the completion of the test. Technically, the system records the start time before sending the message and the end time after collecting feedback, calculating the difference between the two as the test time. The cost penalty component is calculated based on the test time; for example, a linear penalty may be applied to the current round's time (in seconds). The penalty is set to 0.1, with larger penalties for longer execution times. The purpose is to control the consumption of test resources and prevent the agent from choosing generation strategies with excessively high computational complexity.
[0085] The reward value for the current test is calculated by combining the coverage reward component, the crash severity reward component, the depth reward component, and / or the cost penalty component.
[0086] When calculating the reward value for the current test, the above components are summed according to preset weights. The weight coefficients can be adjusted according to the test objectives. For example, if the focus is on discovering high-risk vulnerabilities, the weight of the crash severity reward component can be increased; if the focus is on improving code coverage, the weight of the coverage reward component can be increased. The final reward value will serve as a quantitative evaluation of this round of testing and will be used for subsequent policy network updates of the reinforcement learning agent.
[0087] In the implementation of the above embodiments: by extracting code coverage changes, crash types, execution path depths and test times from the feedback information, and calculating reward values containing multiple components accordingly, the agent can obtain comprehensive and detailed evaluation signals, guiding it to optimize towards covering core code, discovering high-risk vulnerabilities, and exploring deep paths.
[0088] After completing a round of testing and obtaining a reward, the policy network of the reinforcement learning agent needs to be updated using this experience data, enabling the agent to learn from the testing experience and gradually optimize its decision-making capabilities. Optionally, in this embodiment, updating the policy network of the reinforcement learning agent includes: Acquire experience data from the test; experience data includes the state before the update, the selected action, the reward value, and the state after the update; store the experience data in a buffer; use the experience data in the buffer to calculate the loss function through a reinforcement learning algorithm and backpropagate to update the parameters of the policy network.
[0089] Empirical data refers to the key information generated in a complete test iteration. The pre-update state refers to the current test state used by the reinforcement learning agent for decision-making at the start of this test round, such as the vector composed of the coverage change vector, mutation field heatmap, depth reward decay factor, and exploration degree described in step S310. The selected action refers to the action chosen by the agent based on the pre-update state, including target field selection, mutation type selection, and mutation strength. The reward value refers to the quantitative evaluation score of this test round calculated according to step S610. The updated state refers to the new state recalculated based on feedback information after the completion of this test round, which will serve as the input for the next test round. These four elements together constitute a complete empirical tuple, recording the result obtained by the agent after taking a specific action in a specific state.
[0090] A buffer is a first-in-first-out (FIFO) or randomly sampled data storage structure. Each time a test round is completed and empirical data is obtained, the empirical tuple is added to the buffer. The purpose of experience replay is to break the temporal correlation between empirical data, preventing the agent from learning biasedly from only the most recent consecutive experiences.
[0091] The parameters of the policy network are then updated using empirical data from the buffer. This update process can be triggered at regular test rounds or when the buffer has accumulated sufficient samples. During an update, a batch of empirical data is randomly sampled from the buffer, and the sampled state before the update is input into the current policy network to calculate the action probability distribution or action value of the network output. Simultaneously, based on the reward value and the updated state, a reinforcement learning algorithm, such as the Bellman equation in the DQN algorithm or the advantage function in the PPO algorithm, is used to calculate the target value. Then, the loss function between the policy network's output and the target value is calculated, such as mean squared error loss or cross-entropy loss. The gradient of the loss function with respect to the network parameters is calculated using the backpropagation algorithm, and the optimizer is used to update the policy network's parameters, gradually bringing the network's output closer to the target value. After multiple such updates, the policy network gradually learns to select actions that yield higher rewards in different states.
[0092] In the implementation of the above embodiments: by acquiring experience data including the state before the update, the selected action, the reward value, and the state after the update and storing it in a buffer, and then using the experience data in the buffer to update the policy network through a reinforcement learning algorithm, the agent can continuously learn from the test experience and improve its optimization decision-making ability.
[0093] Optionally, in this embodiment of the application, the generated message is sent to the target under test for testing, and feedback information is obtained, including: The message is sent to the protocol stack of the device under test through the network interface; the execution process of the target under test is monitored by instrumentation, debugger or sandbox technology to obtain feedback information.
[0094] First, determine the communication method of the target device. For example, the device under test might be an embedded device running a TCP / IP protocol stack, or a host running network services. The system constructs raw sockets based on the test protocol type, such as TCP, UDP, or ICMP, or encapsulates messages into data link layer or network layer packets using a standard network programming interface. These packets are then sent via the network interface card (NIC) to the IP address and specified port of the device under test. The sending rate and batch size can be controlled according to test requirements, for example, sending 100 messages per test round.
[0095] Instrumentation, debuggers, or sandboxing techniques are used to monitor the execution process of the target under test and obtain feedback information. Instrumentation refers to inserting additional monitoring instructions into the executable code of the target under test to record code execution paths and coverage information in real time. Specifically, dynamic binary instrumentation tools can be used to insert monitoring code during the target program's runtime. Debugging techniques involve attaching to the target process through a debugging interface to capture program exceptions and obtain call stack information at the time of a crash.
[0096] Sandbox technology refers to running the target under test in an isolated environment and monitoring its system calls, memory accesses, and other behaviors through virtualization or container technology. Feedback collected during monitoring includes changes in code coverage, crash types, execution path depth, and test duration. This information serves as the basis for subsequent reward calculations and status updates.
[0097] In the implementation of the above embodiments: the generated messages are accurately sent to the protocol stack of the device under test via the network interface, enabling the test messages to be normally received and processed by the target program; simultaneously, instrumentation, debugger, or sandbox techniques are used to comprehensively monitor the execution process of the target under test, enabling real-time capture of key feedback information such as changes in code coverage, program crash events, execution path depth, and test time. This feedback information provides an accurate data foundation for subsequent reward value calculation and state updates, allowing the reinforcement learning agent to learn and optimize based on real test results.
[0098] Optionally, in this embodiment of the application, filtering the message update data pool from the testing process includes: Messages with reward values higher than the threshold are selected from the test process, and the data pool is updated based on the selected messages to generate the updated data pool.
[0099] A pre-defined reward threshold is obtained. This threshold can be a fixed value or a dynamically adjusted value. For dynamically adjusted values, for example, the top 20% quantile can be used based on the historical reward distribution. After each round of testing, all packets generated in that round are compared with this threshold based on the calculated reward value, and packets with reward values higher than the threshold are selected. These packets usually have high testing value, such as triggering new code coverage areas, causing crashes, or having a large execution path depth, representing new attack knowledge learned autonomously by the system.
[0100] The selected messages are referred to as high-value messages, and the system collects and stores them in a temporary high-value message library. Every certain number of test rounds (e.g., every 100 rounds), the initial data pool is updated using messages from the high-value message library, generating an updated data pool. The update can be achieved by: adding high-value messages to the initial data pool according to a preset ratio, replacing or supplementing existing seed anomaly data; or by weighting the samples in the data pool according to the message's reward value, with messages having higher rewards having a greater probability of being sampled in subsequent training. The updated data pool contains more message samples that can trigger deeper problems, providing richer learning materials for subsequent generative model optimization.
[0101] In the implementation of the above embodiments: packets with reward values higher than a threshold are selected from the testing process. The data pool is then updated based on these high-value packets, enabling it to continuously absorb new attack knowledge learned by the system during testing. The updated data pool is used to subsequently optimize the generative model, allowing the model to generate more packets with similar high-value characteristics.
[0102] Please see Figure 2 The diagram shown is a schematic representation of a system workflow provided in an embodiment of this application.
[0103] In an optional embodiment, the overall operating mechanism of this application is a closed-loop system from initialization to continuous iteration. The core objective is to continuously generate high-value malformed messages for fuzz testing by leveraging the decision-making guidance of reinforcement learning agents and the message generation capabilities of generative models.
[0104] When the system starts, it first constructs an initial data pool containing legitimate messages and seed abnormal messages. It then uses this data pool to pre-train a conditional generative adversarial network, enabling the generator to generate messages based on conditional vectors. At the same time, it initializes the policy network of the reinforcement learning agent and defines its state space, action space, and reward function.
[0105] After entering the test iteration, each round of the loop runs according to the following process: The reinforcement learning agent first obtains the current test state, which includes multi-dimensional information such as coverage change vector, mutation field heatmap, depth reward decay factor, and exploration degree, comprehensively reflecting the test progress and quality distribution. Based on this state information, the agent selects an action to execute from its action space. This action includes decisions in three dimensions: target field selection, mutation type selection, and mutation strength, accurately describing the desired message characteristics.
[0106] The system then encodes the selected actions: discrete actions are converted into one-hot codes or embedding vectors, and continuous actions are normalized. The encoded results are then concatenated into a conditional vector of a preset dimension. This conditional vector, along with random noise, is input into a pre-trained generative model, which then outputs a batch of malformed messages that match the action descriptions.
[0107] The generated messages are sent to the protocol stack of the device under test via the network interface for testing. During the test, the system uses instrumentation, debugger, or sandboxing techniques to comprehensively monitor the execution process of the target under test, and collect feedback information such as changes in code coverage, crash types, execution path depth, and test time.
[0108] Based on this feedback, the system calculates the reward value for this round of testing. The reward value is composed of multiple components: a coverage reward component based on the change in code coverage, a crash severity reward component based on the crash type, a depth reward component based on the execution path depth, and a cost penalty component based on the test duration. These components are weighted and summed to obtain the final reward value, which quantitatively evaluates the effectiveness of this round of testing.
[0109] Based on the reward value and test feedback, the system updates the current test state, forming a new state vector for the next round of decision-making. Simultaneously, the experience data from this round of testing (including the state before the update, the selected action, the reward value, and the state after the update) is stored in an experience replay buffer. Every certain number of rounds, the system randomly samples a batch of experience data from the buffer, calculates the loss function using a reinforcement learning algorithm, and backpropagates it to update the parameters of the policy network, enabling the agent to gradually learn to select actions that yield higher rewards.
[0110] The system continuously filters messages with reward values exceeding a threshold from the testing process as high-value messages, periodically adding these messages to the data pool to generate an updated data pool. The updated data pool is then used to retrain or fine-tune the generative model, enabling the generator to learn more feature distributions from high-value messages.
[0111] After optimization, the generative model, in subsequent test iterations, can generate more malformed messages with similar high-value characteristics when the reinforcement learning agent chooses a new action. This process repeats continuously, allowing the system to learn and evolve during testing. The generated messages become increasingly targeted and effective with each iteration, enabling it to more effectively reach the deeper logic of the target and discover potential security vulnerabilities.
[0112] The following detailed description is provided in conjunction with specific examples.
[0113] 1. Test objectives and environment configuration: Target under test: A lightweight TCP / IP protocol stack of an embedded device.
[0114] Test interface: Send raw TCP packets to a specified port of the target device via a network socket.
[0115] Monitoring methods: Use code instrumentation (such as QEMU-based dynamic binary instrumentation) to collect code coverage and execution paths in real time; use crash monitoring processes to capture exceptions in the target program (such as segmentation faults and assertion failures).
[0116] 2. Data preparation and model initialization: Initial training set: Collects 10,000 valid TCP packets (including states such as normal connection, data transmission, and disconnection). An additional seed exception pool is introduced (rule-generated exception packets: such as checksum errors, illegal flag combinations (SYN+FIN), length field overflow, etc. PoC packets from publicly available vulnerability databases: packet fragments extracted from historical CVE vulnerabilities that can trigger known crashes). Data representation: Each TCP packet is normalized into a fixed-length byte sequence (e.g., a maximum of 60 bytes for the IP + TCP header), with padding for any shortfall.
[0117] Generative model pre-training: Using the above data package, pre-train a conditional Wasserstein GAN (cWGAN). This teaches the generator G to introduce controlled anomalies in specific fields (such as flags, length) while keeping the protocol framework generally valid.
[0118] 3. Specific configuration of reinforcement learning components: 1) State space: S_t = [C, M, D, E] C: Coverage change vector. Records the basic block IDs (hash values) of newly covered blocks in the last 5 batches, and identifies whether these blocks are located in key functions (such as tcp_input, tcp_process).
[0119] M: Mutation Field Heatmap. A vector of length n (the number of key fields in the TCP header) that records which fields have been frequently mutated over a period of time, leading to new overwriting or crashes. For example, the fields may correspond to: [source port, destination port, sequence number, acknowledgment number, data offset, flags (URG / ACK / PSH / RST / SYN / FIN), window size, checksum, urgent pointer].
[0120] D: Depth reward decay factor. The ratio of the average depth of the function call stack reached in the current test to the historical maximum depth.
[0121] E: Exploration degree. Based on the Shannon entropy of recently generated data packets, this avoids the strategy from getting trapped in local optima.
[0122] 2) Action Space: A_t = {target_field, mutation_type, intensity} target_field: Target field selection. Select 1-3 fields from the 9 fields in the TCP header above as the key fields for this round of mutation. For example: [flags, window size].
[0123] mutation_type: Mutation type selection. A discrete action that defines how to distort the selected field. 0: Boundary values (e.g., setting the window size to 0 or 65535); 1: Random byte flipping; 2: Dependency field conflict (e.g., the data offset field is set to 5 (meaning the header is 20 bytes long), but the actual sent packet header length is 40 bytes). 3: State machine violation (e.g., sending the ACK flag when SYN is not sent).
[0124] intensity: Mutation intensity. A continuous value (0.1-1.0) that controls the degree of mutation or the magnitude of noise addition.
[0125] 3) Reward function: R_t = R_cov + R_crash + R_depth - R_cost Coverage bonus R_cov: If the newly overridden base block is located inside the kernel function `tcp_input`: R_cov = base score(2.0) Key factor (3.0) = 6.0; Otherwise, if the newly overridden basic block is located inside the function `tcp_process`: R_cov = 2.0 2.0 = 4.0 Otherwise: R_cov = number of new overlay blocks 1.0; Crash reward R_crash: If a crash is triggered and the stack trace contains `memcpy` / `memmove`: R_crash = 10.0 (high-risk memory operation); Otherwise, if a crash is triggered and the stack trace is inside a protocol state machine function: R_crash = 5.0 (state machine error); Otherwise: R_crash = 1.0 (normal error); Depth reward R_depth: R_depth = log(current call stack depth + 1) 0.5; Cost penalty R_cost: R_cost = Time spent in this round (seconds) 0.1.
[0126] The following is a workflow example simulation using Table 1, assuming the initial strategy is random exploration.
[0127] Table 1. Workflow Example Simulation
[0128] Please see Figure 3 The diagram shown is a structural schematic of a message generation apparatus provided in an embodiment of this application; this application provides a message generation apparatus 200, including: The pre-training module 210 is used to pre-train the generative model using the initial data pool and initialize the reinforcement learning agent; wherein, the generative model is used to generate the corresponding message based on the input condition vector; The execution module 220 is used by the reinforcement learning agent to select an action to execute from the action space according to the current test state; wherein the action to execute is used to describe the message features to be generated. The initial message generation module 230 is used to encode the selected execution action into a condition vector and input it into the generative model; the generative model generates a message that conforms to the description of the execution action based on the condition vector. Test module 240 is used to send the generated message to the target under test for testing and to obtain feedback information; The agent module 250 is updated to calculate the reward value of the current test based on the feedback information, update the current test state, and update the policy network of the reinforcement learning agent. The optimization model module 260 is used to filter messages from the test process to update the data pool; optimize the generative model using the updated data pool; and generate test messages from the optimized generative model.
[0129] Optionally, in this embodiment, the message generation device 200 uses a conditional generative adversarial network as its generative model, which includes a generator and a discriminator. The generator generates messages based on the input conditional vector and random noise, and the discriminator determines the authenticity of the messages. The pre-training module 210 is specifically used to perform adversarial training on the generator and discriminator using legitimate messages and seed abnormal messages in the initial data pool.
[0130] Optionally, in this embodiment of the application, the message generation device 200's current test status includes at least one of the following information: coverage change vector, mutation field heatmap, depth reward decay factor, and exploration degree.
[0131] Optionally, in this embodiment of the application, the message generation device 200 has an action space including discrete actions and / or continuous actions; the discrete actions include target field selection and mutation type selection, and the mutation type is at least one of boundary value, random byte flip, dependency field conflict and state machine violation; the continuous actions include mutation intensity, which is used to control the degree of message distortion.
[0132] Optionally, in this embodiment, the message generation device 200 has an action space including discrete actions and / or continuous actions; the initial message generation module 230 is used to convert discrete actions in the execution actions into one-hot codes or embedded vectors, and / or normalize the continuous actions in the execution actions; and to concatenate the encoded discrete actions and / or continuous actions to generate a condition vector of a preset dimension.
[0133] Optionally, in this embodiment, the message generation device 200 provides feedback information including at least code coverage change, crash type, execution path depth, and / or test duration; the updating agent module 250 is used to extract the code coverage change from the feedback information and calculate the coverage reward component; and / or, determine the crash severity reward component based on the crash type; and / or, calculate the depth reward component based on the execution path depth; and / or, calculate the cost penalty component based on the test duration; and calculate the reward value of the current test using the coverage reward component, crash severity reward component, depth reward component, and / or cost penalty component.
[0134] Optionally, in this embodiment, the message generation device 200 and the agent update module 250 are used to acquire experience data in the test; the experience data includes the state before the update, the selected action, the reward value, and the state after the update; the experience data is stored in a buffer; the loss function is calculated and backpropagated using the experience data in the buffer through a reinforcement learning algorithm to update the parameters of the policy network.
[0135] Optionally, in this embodiment, the message generation device 200 and the test module 240 are used to send messages to the protocol stack of the device under test through a network interface; and to monitor the execution process of the target under test using instrumentation, debugger, or sandbox technology to obtain feedback information.
[0136] Optionally, in this embodiment of the application, the message generation device 200 and the optimization model module 260 are used to filter out messages with reward values higher than a threshold from the testing process, update the data pool based on the filtered messages, and generate an updated data pool.
[0137] It should be understood that this device corresponds to the above-described message generation method embodiment and is capable of performing the various steps involved in the above method embodiment. The specific functions of this device can be found in the description above, and detailed descriptions are omitted here to avoid repetition. The device includes at least one software functional module that can be stored in memory or embedded in the device's operating system (OS) in the form of software or firmware.
[0138] Please see Figure 4The diagram shows a structural schematic of an electronic device provided in an embodiment of this application. An electronic device 300 provided in this application includes a processor 310 and a memory 320. The memory 320 stores machine-readable instructions executable by the processor 310. When the machine-readable instructions are executed by the processor 310, the method described above is performed.
[0139] Figure 4 The components shown can be implemented using hardware, software, or a combination thereof. Electronic device 300 may be a physical device, such as a server or PC, or a virtual device, such as a virtual machine or virtualization container. Furthermore, electronic device 300 is not limited to a single device; it can be a combination of multiple devices or a cluster of numerous devices.
[0140] This application also provides a storage medium storing a computer program, which is executed by a processor to perform the above-described method.
[0141] The storage medium can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Red-Only Memory (PROM), Read-Only Memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.
[0142] This application also provides a computer program product, including computer program instructions, which are executed by a processor to perform the method described above.
[0143] It should be understood that the disclosed apparatus and methods can also be implemented in other ways, given the several embodiments provided in this application. The apparatus embodiments described above are merely illustrative. For example, the flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code, which contains one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, or they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram and / or flowchart, and combinations of blocks in block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.
[0144] In addition, the functional modules in the various embodiments of this application can be integrated together to form an independent part, or each module can exist independently, or two or more modules can be integrated to form an independent part.
[0145] The above description is only an optional implementation of the embodiments of this application, but the protection scope of the embodiments of this application is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the embodiments of this application should be covered within the protection scope of the embodiments of this application.
Claims
1. A message generation method, characterized in that, include: A generative model is pre-trained using an initial data pool, and a reinforcement learning agent is initialized; wherein the generative model is used to generate corresponding messages based on the input condition vector; The reinforcement learning agent selects an action to execute from the action space based on the current test state; wherein the executed action is used to describe the message features to be generated. The selected execution action is encoded as a condition vector and input into the generative model; the generative model generates a message that conforms to the description of the execution action based on the condition vector. The generated message is sent to the target under test for testing, and feedback information is obtained. Based on the feedback information, calculate the reward value for the current test, update the current test state, and update the policy network of the reinforcement learning agent; The data pool is updated by filtering messages during the testing process; the generative model is optimized using the updated data pool, and test messages are generated from the optimized generative model.
2. The method according to claim 1, characterized in that, The generative model is a conditional generative adversarial network, which includes a generator and a discriminator. The generator is used to generate messages based on the input conditional vector and random noise, and the discriminator is used to determine the authenticity of the messages. Pre-training a generative model using an initial data pool includes: performing adversarial training on the generator and the discriminator using legitimate packets and seed abnormal packets from the initial data pool.
3. The method according to claim 1, characterized in that, The current test status includes at least one of the following: coverage change vector, mutation field heatmap, depth reward decay factor, and exploration degree.
4. The method according to claim 1, characterized in that, The action space includes discrete actions and / or continuous actions; the discrete actions include target field selection and mutation type selection, wherein the mutation type is at least one of boundary value, random byte flip, dependency field conflict and state machine violation; the continuous actions include mutation intensity, which is used to control the degree of message distortion.
5. The method according to claim 1, characterized in that, The action space includes discrete actions and / or continuous actions; the selected action to be executed is encoded as a condition vector, including: The discrete actions in the execution actions are converted into one-hot codes or embedding vectors, and / or the continuous actions in the execution actions are normalized; The encoded discrete actions and / or continuous actions are concatenated to generate the condition vector of the preset dimension.
6. The method according to claim 1, characterized in that, The feedback information includes at least changes in code coverage, crash type, execution path depth, and / or test duration; based on the feedback information, the reward value for the current test is calculated, specifically including: Extract the code coverage change from the feedback information and calculate the coverage reward component; and / or, determine the crash severity reward component based on the crash type; and / or, calculate the depth reward component based on the execution path depth; and / or, calculate the cost penalty component based on the test duration. The reward value for the current test is calculated using the coverage reward component, the crash severity reward component, the depth reward component, and / or the cost penalty component.
7. The method according to claim 1, characterized in that, Updating the policy network of the reinforcement learning agent includes: Acquire experience data from the test; the experience data includes the state before the update, the selected action, the reward value, and the state after the update; The experience data is stored in a buffer; Using the empirical data in the buffer, a loss function is calculated through a reinforcement learning algorithm and backpropagated to update the parameters of the policy network.
8. The method according to claim 1, characterized in that, The generated message is sent to the target under test for testing, and feedback information is obtained, including: The message is sent to the protocol stack of the device under test via the network interface; The execution process of the target under test is monitored using instrumentation, debuggers, or sandbox technology to obtain the feedback information.
9. The method according to claim 1, characterized in that, Updating the data pool by filtering messages during the testing process includes: Messages with reward values higher than a threshold are selected from the test process, and the data pool is updated based on the selected messages to generate the updated data pool.
10. A computer program product, characterized in that, It includes computer program instructions that are executed by a processor to perform the method as described in any one of claims 1 to 9.
11. An electronic device, characterized in that, include: A processor and a memory, the memory storing computer program instructions that, when executed by the processor, perform the method as described in any one of claims 1 to 9.
12. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer program instructions that, when executed by a processor, perform the method as described in any one of claims 1 to 9.