Network attack traffic dataset generation method, system, medium, and device

By integrating simulated APT attack traffic into real network traffic and using timestamp and IP mapping algorithms to generate high-quality network attack datasets, this technology solves the problem of generating APT attack traffic in existing technologies and achieves effective dataset generation and detection capabilities.

CN118199917BActive Publication Date: 2026-06-26SHANGHAI JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI JIAOTONG UNIV
Filing Date
2024-01-26
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies struggle to generate high-quality datasets of network attack traffic, especially real network traffic simulating APT attacks, making it difficult to effectively study and detect advanced security threats.

Method used

By using real campus network traffic as background traffic and fusing it with simulated APT attack traffic, and employing timestamp and IP mapping algorithms, the APT attack traffic is integrated into the background traffic, generating a high-quality mixed network traffic dataset.

Benefits of technology

The generated dataset can simulate APT attacks in real network environments, possessing the ability to deceive and evade detection by defense systems, and provides a data generation solution for advanced security threat research.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118199917B_ABST
    Figure CN118199917B_ABST
Patent Text Reader

Abstract

The application provides a network attack flow data set generation method, system, medium and equipment, comprising the following steps: step 1: setting the network flow collected from a real campus network as background flow, and setting the APT attack flow simulated through a Caldera simulation as attack flow, and recording attack time and attack mode; step 2: recording the start time of each attack step, and constructing a timestamp sequence; step 3: extracting the start and end time from the background flow, fusing the attack flow into the background flow according to a preset rule, and obtaining a network attack flow data set; and step 4: generating malicious flow samples that deceive and escape the detection of a defense system based on the network attack flow data set. The application can fuse real network flow and simulated attack flow, thereby obtaining a high-quality network flow data set mixed with attack flow, and providing a data generation scheme for research directions such as APT attack that are difficult to obtain real data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing technology, specifically to a method, system, medium, and device for generating network attack traffic datasets. More particularly, it relates to a method for generating network attack traffic datasets based on data fusion. Background Technology

[0002] In today's increasingly complex cyberattack patterns, research into advanced security threats such as APT attacks faces significant challenges. In fact, real-world APT attacks and other advanced threats often target critical enterprises and organizations, with offensive and defensive confrontations frequently involving nation-states. For researchers, obtaining real-world network traffic data from such organizations is extremely difficult.

[0003] Patent document CN113158390A discloses a method for generating network attack traffic based on an auxiliary classification generative adversarial network. This method utilizes the principles of generative adversarial networks to generate malicious traffic samples that can deceive and evade detection by defense systems based on existing network attack traffic datasets. However, this patent cannot completely solve the existing technical problems, nor can it meet the needs of this invention. Summary of the Invention

[0004] In view of the deficiencies in the prior art, the purpose of this invention is to provide a method, system, medium and device for generating network attack traffic datasets.

[0005] The network attack traffic dataset generation method provided by the present invention includes:

[0006] Step 1: Set the network traffic collected from the real campus network as background traffic B. t And the APT attack traffic simulated by Caldera is set as attack traffic A. t For each phase of an APT attack, record the time and attack method;

[0007] Step 2: Record the start time of each attack step and construct a timestamp sequence T = {t1, t2, ..., t...} n}, where n is the attack phase count, and i is the attack phase number. t The start time of the i-th attack phase;

[0008] Step 3: From B t Extract the start and end times t min t max For t i and t i+1 The attack traffic of the i-th attack phase is merged into the background traffic according to a preset rule to obtain a network attack traffic dataset.

[0009] Step 4: Based on the network attack traffic dataset, generate malicious traffic samples that deceive and evade the detection of the defense system.

[0010] Preferably, the timestamp of each flow in the attack traffic is mapped, such that the mapped timestamp is consistent with the background traffic B. t Timeline adaptation, i.e., A t Time axis and B t The timeline is mapped proportionally, with the start timestamp of a specific flow within the attack traffic being t. h Then the corresponding timestamp t after mapping map for:

[0011]

[0012] Preferably, after completing the timestamp mapping, the source IP and destination IP in the attack traffic are mapped. For the actual attack path simulating an APT attack, P = {IP1, IP2, ..., IP...} n First, randomly construct the mapped attack path MP = {MIP1, MIP2, ..., MIP} from the background traffic. n};

[0013] MP satisfies the following: any two adjacent IPs in MP communicate in the background traffic; among all flows between any two adjacent IPs in MP, extract a set of flows such that the source and destination IPs of these flows are each pair of adjacent elements in MP, and the timestamps of these flows satisfy an increasing relationship.

[0014] Preferably, for attack traffic A t For each flow, determine the source IP type of the flow. If it is an IP in P, map it to the corresponding IP in MP. The source IP of the flow = IP. i Then the mapped source IP = MIP i ;

[0015] If the IP address is not an external IP address in P, then first obtain the mapped IP address of the destination IP address, denoted as MIP. i Next, the source IP will be mapped to a background traffic B. t A random external IP address in B satisfies: the random IP address is in B t In and MIP i There has been correspondence;

[0016] Finally, A t For each flow in the B process, the timestamp, source IP, and destination IP are replaced with the mapped values, and then added to B. t middle.

[0017] The network attack traffic dataset generation system provided by the present invention includes:

[0018] Module M1: Sets network traffic collected from the actual campus network as background traffic B. t And the APT attack traffic simulated by Caldera is set as attack traffic A. t For each phase of an APT attack, record the time and attack method;

[0019] Module M2: Records the start time of each attack step, constructing a timestamp sequence T = {t1, t2, ..., t...} n}, where n is the attack phase count, and i is the attack phase number. t The start time of the i-th attack phase;

[0020] Module M3: From B t Extract the start and end times t min t max For t i and t i+1 The attack traffic of the i-th attack phase is merged into the background traffic according to a preset rule to obtain a network attack traffic dataset.

[0021] Module M4: Generates malicious traffic samples that deceive and evade detection by defense systems based on network attack traffic datasets.

[0022] Preferably, the timestamp of each flow in the attack traffic is mapped, such that the mapped timestamp is consistent with the background traffic B. t Timeline adaptation, i.e., A t Time axis and B t The timeline is mapped proportionally, with the start timestamp of a specific flow within the attack traffic being t. h Then the corresponding timestamp t after mapping map for:

[0023]

[0024] Preferably, after completing the timestamp mapping, the source IP and destination IP in the attack traffic are mapped. For the actual attack path simulating an APT attack, P = {IP1, IP2, ..., IP...} n First, randomly construct the mapped attack path MP = {MIP1, MIP2, ..., MIP} from the background traffic. n};

[0025] MP satisfies the following: any two adjacent IPs in MP communicate in the background traffic; among all flows between any two adjacent IPs in MP, extract a set of flows such that the source and destination IPs of these flows are each pair of adjacent elements in MP, and the timestamps of these flows satisfy an increasing relationship.

[0026] Preferably, for attack traffic A t For each flow, determine the source IP type of the flow. If it is an IP in P, map it to the corresponding IP in MP. The source IP of the flow = IP. i Then the mapped source IP = MIP i ;

[0027] If the IP address is not an external IP address in P, then first obtain the mapped IP address of the destination IP address, denoted as MIP. i Next, the source IP will be mapped to a background traffic B. t A random external IP address in B satisfies: the random IP address is in B t In and MIP i There has been correspondence;

[0028] Finally, A t For each flow in the B process, the timestamp, source IP, and destination IP are replaced with the mapped values, and then added to B. t middle.

[0029] According to the present invention, a computer-readable storage medium storing a computer program is provided, wherein when the computer program is executed by a processor, it implements the steps of the network attack traffic dataset generation method.

[0030] The electronic device provided by the present invention includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the computer program is executed by the processor, it implements the steps of the network attack traffic dataset generation method.

[0031] Compared with the prior art, the present invention has the following beneficial effects:

[0032] This invention designs a data fusion algorithm that can fuse real network traffic with simulated attack traffic to obtain a high-quality network traffic dataset mixed with attack traffic; it simulates advanced security threats such as APT attacks and captures network traffic, and then integrates it into normal traffic collected from real large-scale networks through the data fusion algorithm to obtain labeled fused traffic data; it provides a data generation scheme for research directions such as APT attacks where it is difficult to obtain real data. Attached Figure Description

[0033] Other features, objects, and advantages of the present invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:

[0034] Figure 1 A flowchart for data fusion;

[0035] Figure 2 This is a diagram of the data fusion algorithm code. Detailed Implementation

[0036] The present invention will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present invention, but do not limit the invention in any way. It should be noted that those skilled in the art can make several changes and improvements without departing from the concept of the present invention. These all fall within the protection scope of the present invention.

[0037] Example 1

[0038] This invention provides a method for generating a network attack traffic dataset based on data fusion, wherein network traffic collected from a real campus network is set as background traffic B. t And the APT attack traffic simulated by Caldera is set as attack traffic A. t For each stage of an APT attack, the time and attack method are recorded. The attack methods are derived from the MITRE adversary emulation library referenced during Caldera simulation, which describes in detail the methods used in each step of the APT attack.

[0039] Specifically, the start time of each attack step is recorded to construct a timestamp sequence T = {t1, t2, ..., t...} n}, where n is the attack phase count, and i is the attack phase number. t The start time of the i-th attack phase; from B t Extract the start and end times t min t max In t i and t i+1 The attack traffic in the i-th attack phase is processed and merged into the background traffic according to certain rules.

[0040] like Figure 1 The core idea behind data fusion is to integrate attack traffic A t Integrate into background traffic B t During this process, the behavior of each attack phase is preserved while ensuring the rationality of the fusion. First, the timestamps of each flow in the attack traffic need to be mapped, so that the mapped timestamps match the background traffic B.t Timeline adaptation, i.e., A t Time axis and B t The timeline is mapped proportionally. For example, the start timestamp of a flow in the attack traffic is t. h Then the corresponding timestamp t after mapping map for:

[0041]

[0042] After completing the timestamp mapping, the source IP and destination IP in the attack traffic are mapped. Specifically, for the actual attack path P = {IP1, IP2, ..., IP...} simulating an APT attack... n First, according to the following rules, randomly construct the mapped attack path MP = {MIP1, MIP2, ..., MIP} from the background traffic. n}

[0043] MP needs to satisfy the following properties:

[0044] 1. Any two adjacent IPs in MP communicate in the background traffic;

[0045] 2. In all flows between any two adjacent IPs in MP, a set of flows can be extracted such that the source and destination IPs of these flows are each pair of adjacent elements in MP in sequence, and the timestamps of these flows satisfy an increasing relationship. That is, a path that satisfies the temporal causal relationship can be found among all nodes of MP in the background traffic. This ensures the rationality of IP mapping, that is, it does not change the APT behavior before and after data fusion.

[0046] Specifically, for attack traffic A t For each flow, first calculate the mapped timestamp, then determine the source IP type of the flow.

[0047] If it is an IP in P, then map it to the corresponding IP in MP. For example, the source IP of the flow = IP. i Then the mapped source IP = MIP i .

[0048] If the IP address is not an external IP address in P, its destination IP address must be in P. Therefore, first obtain the mapping IP address of the destination IP address (denoted as MIP). i Next, the source IP will be mapped to a background traffic B. t The random external IP address in the B needs to meet the following condition: the random IP address is in B t In and MIP i There has been correspondence.

[0049] Finally, At For each flow in the B process, the timestamp, source IP, and destination IP are replaced with the mapped values, and then added to B. t middle.

[0050] If neither the source IP nor the destination IP is in P, then the flow is unrelated to the simulated attack and is not considered by the data fusion algorithm. Therefore, when determining the type of the source IP, the statement "If the source IP is not in P, then the destination IP must be in P" is given.

[0051] In fact, if we ignore IP mapping and directly map the attack traffic A... t With background flow B t Merging the traffic would result in APT attack characteristics that are too obvious and unreasonable in the merged traffic (e.g., attack traffic A). t The simulated IP used will obviously not appear in background traffic B. t Using such fused traffic to test the performance of an APT attack detection system is clearly unconvincing.

[0052] like Figure 2 The detailed process of the data fusion algorithm is described in pseudocode. The algorithm first generates a mapped attack path MP using the rules mentioned above and the input attack path P (lines 1-4). Next, it determines the mapped timestamp using a proportional scaling formula (lines 6-7). IP mapping is the core part of the entire algorithm; we utilize P, MP, and the rules mentioned above to map the attack traffic A... t Each flow in the process is mapped to its source and destination IPs (lines 8-19). Finally, we replace the original timestamps and IPs in the attack traffic with the mapped timestamps and source and destination IPs, and add this to the background traffic B. t (Lines 20-23)

[0053] Example 2

[0054] The present invention also provides a network attack traffic dataset generation system, which can be implemented by executing the process steps of the network attack traffic dataset generation method. That is, those skilled in the art can understand the network attack traffic dataset generation method as a preferred embodiment of the network attack traffic dataset generation system.

[0055] The network attack traffic dataset generation system provided by the present invention includes: module M1: setting network traffic collected from a real campus network as background traffic B. t And the APT attack traffic simulated by Caldera is set as attack traffic A. tFor each APT attack phase, record the time and attack method; Module M2: Record the start time of each attack step, constructing a timestamp sequence T = {t1, t2, ..., t...} n}, where n is the attack phase count, and i is the attack phase number. t The start time of the i-th attack phase; Module M3: from B t Extract the start and end times t min t max For t i and t i+1 The attack traffic of the i-th attack phase is merged into the background traffic according to preset rules to obtain a network attack traffic dataset; Module M4: Based on the network attack traffic dataset, it generates malicious traffic samples that deceive and evade the detection of the defense system.

[0056] Map the timestamp of each flow in the attack traffic so that the mapped timestamp matches the background traffic B. t Timeline adaptation, i.e., A t Time axis and B t The timeline is mapped proportionally, with the start timestamp of a specific flow within the attack traffic being t. h Then the corresponding timestamp t after mapping map for:

[0057]

[0058] After completing the timestamp mapping, the source IP and destination IP in the attack traffic are mapped. For the actual attack path simulating an APT attack, P = {IP1, IP2, ..., IP...} n First, randomly construct the mapped attack path MP = {MIP1, MIP2, ..., MIP} from the background traffic. n};

[0059] MP satisfies the following: any two adjacent IPs in MP communicate in the background traffic; among all flows between any two adjacent IPs in MP, extract a set of flows such that the source and destination IPs of these flows are each pair of adjacent elements in MP, and the timestamps of these flows satisfy an increasing relationship.

[0060] For attack traffic A t For each flow, determine the source IP type of the flow. If it is an IP in P, map it to the corresponding IP in MP. The source IP of the flow = IP. i Then the mapped source IP = MIP i ;

[0061] If the IP address is not an external IP address in P, then first obtain the mapped IP address of the destination IP address, denoted as MIP. i Next, the source IP will be mapped to a background traffic B. t A random external IP address in B satisfies: the random IP address is in B t In and MIP i There has been correspondence;

[0062] Finally, A t For each flow in the B process, the timestamp, source IP, and destination IP are replaced with the mapped values, and then added to B. t middle.

[0063] Those skilled in the art will understand that, in addition to implementing the system, apparatus, and their modules provided by this invention in purely computer-readable program code, the same program can be implemented in the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers by logically programming the method steps. Therefore, the system, apparatus, and their modules provided by this invention can be considered a hardware component, and the modules included therein for implementing various programs can also be considered structures within the hardware component; alternatively, modules for implementing various functions can be considered both software programs implementing the method and structures within the hardware component.

[0064] Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art can make various changes or modifications within the scope of the claims, which do not affect the essence of the present invention. Unless otherwise specified, the embodiments and features described in this application can be arbitrarily combined with each other.

Claims

1. A method for generating a network attack traffic dataset, characterized in that, include: Step 1: Set the network traffic collected from the actual campus network as background traffic. The APT attack traffic simulated using Caldera was set as the attack traffic. For each phase of an APT attack, record the time and attack method; Step 2: Record the start time of each attack step and construct a timestamp sequence. Where n is the attack phase count, and i is... The start time of the i-th attack phase; Step 3: From Extract start and end times , ,for and The attack traffic of the i-th attack phase is merged into the background traffic according to a preset rule to obtain a network attack traffic dataset. Step 4: Based on the network attack traffic dataset, generate malicious traffic samples that deceive and evade the detection of the defense system to test the performance of the APT attack detection system; Map the timestamp of each flow in the attack traffic to the background traffic. Timeline adaptation, soon Timeline and The timeline is mapped proportionally, with the start timestamp of a specific flow within the attack traffic being... Then the mapped timestamp for: ; After completing the timestamp mapping, the source IP and destination IP in the attack traffic are mapped to simulate the actual attack path of an APT attack. First, a mapped attack path is randomly constructed from the background traffic. ; MP satisfies the following: any two adjacent IPs in MP communicate in the background traffic; among all flows between any two adjacent IPs in MP, extract a set of flows such that the source and destination IPs of these flows are each pair of adjacent elements in MP, and the timestamps of these flows satisfy an increasing relationship. For attack traffic For each flow, determine the source IP type. If it's an IP in P, map it to the corresponding IP in MP. The source IP of the flow = Then the mapped source IP = ; If the IP address is not an external IP address in P, then first obtain the mapped IP address of the destination IP address, denoted as . Next, the source IP will be mapped to a background traffic. The random external IP address in the list satisfies: the random IP address in Zhongyu There has been correspondence; Finally For each flow, the timestamp, source IP, and destination IP are replaced with the mapped values, and then added to the output. middle.

2. A network attack traffic dataset generation system, characterized in that, include: Module M1: Sets network traffic collected from the actual campus network as background traffic. The APT attack traffic simulated using Caldera was set as the attack traffic. For each phase of an APT attack, record the time and attack method; Module M2: Records the start time of each attack step, constructing a timestamp sequence. Where n is the attack phase count, and i is... The start time of the i-th attack phase; Module M3: From Extract start and end times , ,for and The attack traffic of the i-th attack phase is merged into the background traffic according to a preset rule to obtain a network attack traffic dataset. Module M4: Based on network attack traffic datasets, it generates malicious traffic samples that deceive and evade the detection of defense systems, and is used to test the performance of APT attack detection systems. Map the timestamp of each flow in the attack traffic to the background traffic. Timeline adaptation, soon Timeline and The timeline is mapped proportionally, with the start timestamp of a specific flow within the attack traffic being... Then the mapped timestamp for: ; After completing the timestamp mapping, the source IP and destination IP in the attack traffic are mapped to simulate the actual attack path of an APT attack. First, a mapped attack path is randomly constructed from the background traffic. ; MP satisfies the following: any two adjacent IPs in MP communicate in the background traffic; among all flows between any two adjacent IPs in MP, extract a set of flows such that the source and destination IPs of these flows are each pair of adjacent elements in MP, and the timestamps of these flows satisfy an increasing relationship. For attack traffic For each flow, determine the source IP type. If it's an IP in P, map it to the corresponding IP in MP. The source IP of the flow = Then the mapped source IP = ; If the IP address is not an external IP address in P, then first obtain the mapped IP address of the destination IP address, denoted as . Next, the source IP will be mapped to a background traffic. The random external IP address in the list satisfies: the random IP address in Zhongyu There has been correspondence; Finally For each flow, the timestamp, source IP, and destination IP are replaced with the mapped values, and then added to the output. middle.

3. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the steps of the network attack traffic dataset generation method of claim 1.

4. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the computer program is executed by the processor, it implements the steps of the network attack traffic dataset generation method of claim 1.