Network traffic packet header trajectory synthesis method and device, server and storage medium
By constructing a network traffic packet header trajectory synthesis model consisting of a preprocessor, generator, discriminator, and training controller, and combining reinforcement learning and data binning techniques, the problem of reliance on human resources and professional knowledge in existing technologies is solved, achieving efficient and flexible network traffic packet header trajectory synthesis and improving the consistency between the synthesized trajectory and the real trajectory features.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TSINGHUA UNIVERSITY
- Filing Date
- 2025-05-27
- Publication Date
- 2026-06-12
AI Technical Summary
In existing technologies, network traffic packet header trajectory synthesis methods rely on human resources and expertise, are costly and inflexible, and are difficult to adapt to complex network environments. Furthermore, machine learning-driven models struggle to maintain consistency between flow-granular time series and statistical features when generating categorical features.
By constructing a network traffic packet header trajectory synthesis model consisting of a preprocessor, generator, discriminator, and training controller, and training it with real network traffic data packets and real traffic, the model finally generates network traffic packet header trajectories that meet the preset feature distribution. Combining reinforcement learning and data binning techniques, the numerical features and class features are unified, and the model is optimized using Wasserstein loss.
It enables efficient synthesis of network traffic packet header trajectories that conform to the real feature distribution in complex network environments, saving manual feature extraction costs, improving the consistency between the synthesized trajectory and the real trajectory features, and adapting to different network scenarios.
Smart Images

Figure CN120567488B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of electronic digital data processing technology, and in particular to a method, apparatus, server, and storage medium for synthesizing network traffic packet header trajectories. Background Technology
[0002] Modern networks increasingly rely on supervised machine learning to assist in network security and quality of service (QoS) management tasks, such as classification-based network intrusion detection and application type identification under encrypted traffic. Network traffic header traces with category labels can be used to develop network security and QoS management tools based on supervised machine learning models. For example, network traffic header traces with category labels containing attack traffic such as Distributed Denial of Service (DDoS), port scanning, or brute-force attacks can be used for training and testing network intrusion detection systems. By identifying attack categories, network intrusion detection systems can determine the severity of the attack and take appropriate action. Network traffic header traces with category labels containing traffic from various applications can help identify application service types under encrypted traffic (TLS, etc.) and anonymous protocols (VPN, Tor, etc.), providing more granular network supervision and traffic control for different application services to ensure their security and quality.
[0003] Network operators and organizations with proprietary networks can capture large amounts of recent, real-world network traffic packet headers and provide traffic labels using existing classification models combined with manual annotation. However, due to potential policy, privacy, and commercial restrictions, internet entities such as network operators, which possess large amounts of labeled network traffic data, cannot directly share this data, which may contain sensitive user information (such as IP addresses). Among related technologies, network traffic trajectory synthesis technology can address the problem of sharing labeled network traffic trajectory datasets.
[0004] However, in related technologies, the selection of parameters in rule-based and expert knowledge-driven network traffic header trajectory synthesis requires a large amount of human resources and expert knowledge, which limits its ability to simulate complex network environments. In network model-driven network traffic header trajectory synthesis, the generation effect depends on the degree of adaptation between the model and network traffic, making it difficult to flexibly cope with different network scenarios and complex traffic distributions, which urgently needs improvement. Summary of the Invention
[0005] This application provides a method, apparatus, server, and storage medium for synthesizing network traffic packet header trajectories, in order to solve the technical problems in related technologies, where the synthesis of network traffic packet header trajectories driven by rules and expert knowledge has high labor costs and relies on professional knowledge, thus restricting the simulation capability in complex network environments, and the synthesis of network traffic packet header trajectories driven by network models relies on the adaptability between the model and network traffic, resulting in poor flexibility.
[0006] The first aspect of this application provides a method for synthesizing network traffic packet header trajectories, applied to a server. The method includes the following steps: acquiring real network traffic data packets and real traffic; training a pre-constructed initial network traffic packet header trajectory synthesis model using the real network traffic data packets and the real traffic to obtain a final network traffic packet header trajectory synthesis model, wherein the initial network traffic packet header trajectory synthesis model consists of a preprocessor, a generator, a discriminator, and a training controller; and using the final network traffic packet header trajectory synthesis model to obtain a network traffic packet header trajectory that satisfies preset real trajectory feature packet granularity and flow granularity feature distribution conditions.
[0007] Optionally, in one embodiment of this application, the step of training a pre-constructed initial network traffic header trajectory synthesis model using the real network traffic data packets and the real traffic to obtain a final network traffic header trajectory synthesis model includes: using the preprocessor to reversibly bin the sequence of the real network traffic data packets to obtain a binning sequence that meets preset conditions, and unifying the attribute values of the binning sequence into category features; obtaining the length of the traffic from the real traffic, and using a preset category label and the length as input to the generator to obtain a generated sequence; processing the binning sequence and the generated sequence with word vectors, and inputting them into the discriminator to obtain a discrimination result; using the training controller to receive the discrimination result and calculate the corresponding loss value, so as to use the loss value to guide the generator and the discriminator to optimize, so as to obtain the final network traffic header trajectory synthesis model.
[0008] Optionally, in one embodiment of this application, the step of using the loss value to guide the generator and the discriminator to optimize in order to obtain the final network traffic packet header trajectory synthesis model includes: using the training controller to sample different length prefixes of the generated sequence to obtain a sampled generated sequence; inputting the sampled generated sequence into the discriminator to obtain an optimized discrimination result, and using the optimized discrimination result to update the loss value until the loss value meets a preset iteration termination condition to obtain the final network traffic packet header trajectory synthesis model.
[0009] Optionally, in one embodiment of this application, the step of reversibly binning the sequence of real network traffic data packets using the preprocessor to obtain a binned sequence that meets preset conditions, and unifying the attribute values of the binned sequence into category features, includes: obtaining attribute values of real network traffic data packets with category labels under multiple categories from the sequence of real network traffic data packets; calculating the chi-square value of the attribute values of the real network traffic data packets; and using the chi-square value to bin the value range of the attribute values of the real network traffic data packets, so as to unify the attribute values of the real network traffic data packets into the category features.
[0010] Optionally, in one embodiment of this application, obtaining the length of the traffic from the real traffic, using a preset category label and the length as input to the generator to obtain a generated sequence, includes: obtaining the previous complete sequence value, wherein the complete sequence value is a sequence value with all target attribute values embedded, and the sequence value is a vector composed of target attribute values extracted from any data table in the real network traffic data packet; combining the previous complete sequence value, the preset category label, and the length, using the sequence feature probability to generate a path, embedding any target attribute value into the current sequence value; based on the target attribute values already present in the current sequence value, using conditional probability features to generate a bypass, embedding all unembedded target attribute values into the current sequence value, until the generated sequence is obtained.
[0011] Optionally, in one embodiment of this application, the expression for calculating the loss value of the discriminator includes:
[0012]
[0013] in, This represents the final loss value of the discriminator. This represents the loss value of the discriminator's judgment result on the real data. This represents the loss value of the discriminator in determining the generated sequence. Let λ represent the gradient penalty term, λ represent the weights, and x represent the actual data. Let E represent the generated sequence, and P represent the expected value. x Let P represent the distribution of the real data, D(x) represent the discrimination result of the discriminator on the real data, and P represent the distribution of the real data. G This represents the distribution of the generated sequences. This indicates the discrimination result of the discriminator for the generated sequence.
[0014] A second aspect of this application provides a network traffic header trajectory synthesis apparatus applied to a server. The apparatus includes: an acquisition module for acquiring real network traffic data packets and real traffic; a training module for training a pre-constructed initial network traffic header trajectory synthesis model using the real network traffic data packets and the real traffic to obtain a final network traffic header trajectory synthesis model, wherein the initial network traffic header trajectory synthesis model consists of a preprocessor, a generator, a discriminator, and a training controller; and a synthesis module for using the final network traffic header trajectory synthesis model to obtain a network traffic header trajectory that satisfies preset real trajectory feature packet granularity and flow granularity feature distribution conditions.
[0015] Optionally, in one embodiment of this application, the training module includes: a preprocessing unit, configured to reversibly bin the sequence of the real network traffic data packets using the preprocessor to obtain a binning sequence that meets preset conditions, and to unify the attribute values of the binning sequence into category features; a generation unit, configured to obtain the length of the traffic from the real traffic, and to use a preset category label and the length as input to the generator to obtain a generated sequence; a discrimination unit, configured to process the binning sequence and the generated sequence with word vectors, and then input them into the discriminator to obtain a discrimination result; and an optimization unit, configured to receive the discrimination result using the training controller, and to calculate the corresponding loss value, so as to use the loss value to guide the generator and the discriminator to optimize, so as to obtain the final network traffic packet header trajectory synthesis model.
[0016] Optionally, in one embodiment of this application, the optimization unit includes: a sampling subunit, used to sample different length prefixes of the generated sequence using the training controller to obtain a sampled generated sequence; and an optimization subunit, used to input the sampled generated sequence into the discriminator to obtain an optimized discrimination result, and use the optimized discrimination result to update the loss value until the loss value meets a preset iteration termination condition to obtain the final network traffic packet header trajectory synthesis model.
[0017] Optionally, in one embodiment of this application, the preprocessing unit includes: a first acquisition subunit, configured to acquire attribute values of real network traffic data packets with category labels under multiple categories from the sequence of real network traffic data packets; a calculation subunit, configured to calculate the chi-square value of the attribute values of the real network traffic data packets; and a processing subunit, configured to bin the value range of the attribute values of the real network traffic data packets using the chi-square value, so as to unify the attribute values of the real network traffic data packets into the category features.
[0018] Optionally, in one embodiment of this application, the generation unit includes: a second acquisition subunit, configured to acquire the previous complete sequence value, wherein the complete sequence value is a sequence value with all target attribute values embedded, and the sequence value is a vector composed of target attribute values extracted from any data table in the real network traffic data packet; a first embedding subunit, configured to combine the previous complete sequence value, the preset category label, and the length, and use the sequence feature probability to generate a path to embed any target attribute value into the current sequence value; and a second embedding subunit, configured to generate a bypass based on the existing target attribute values in the current sequence value, using conditional probability features to embed all unembedded target attribute values into the current sequence value, until the generated sequence is obtained.
[0019] Optionally, in one embodiment of this application, the expression for calculating the loss value of the discriminator includes:
[0020]
[0021] in, This represents the final loss value of the discriminator. This represents the loss value of the discriminator's judgment result on the real data. This represents the loss value of the discriminator in determining the generated sequence. Let λ represent the gradient penalty term, λ represent the weights, and x represent the actual data. Let E represent the generated sequence, and P represent the expected value. x Let P represent the distribution of the real data, D(x) represent the discrimination result of the discriminator on the real data, and P represent the distribution of the real data. G This represents the distribution of the generated sequences. This indicates the discrimination result of the discriminator for the generated sequence.
[0022] A third aspect of this application provides a server, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the network traffic packet header trajectory synthesis method as described in the above embodiments.
[0023] A fourth aspect of this application provides a computer-readable storage medium storing computer instructions for causing the computer to perform the network traffic packet header trajectory synthesis method as described in the above embodiments.
[0024] A fifth aspect of this application provides a computer program product, including a computer program, which, when executed, is used to implement the above-described method for synthesizing network traffic packet header trajectories.
[0025] This application embodiment can train a pre-constructed initial network traffic header trajectory synthesis model using real network traffic data packets and real traffic to obtain a final network traffic header trajectory synthesis model. This final model then yields network traffic header trajectories that satisfy preset real trajectory feature packet granularity and flow granularity feature distribution conditions. Through machine learning-driven network traffic header trajectory synthesis, features are automatically learned from existing real network traffic. While retaining real features, variations are introduced to synthesize new traffic, saving the cost of manual feature extraction and enabling better learning of complex feature distributions. This improves the consistency between the synthesized network traffic header trajectory with category labels and the real trajectory feature distribution. Therefore, this solves the technical problems in related technologies: rule-based and expert knowledge-driven network traffic header trajectory synthesis suffers from high labor costs and reliance on specialized knowledge, thus limiting its simulation capabilities in complex network environments; and network model-driven network traffic header trajectory synthesis relies on the model's adaptability to network traffic, resulting in poor flexibility.
[0026] Additional aspects and advantages of this application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of this application. Attached Figure Description
[0027] The above and / or additional aspects and advantages of this application will become apparent and readily understood from the following description of the embodiments taken in conjunction with the accompanying drawings, wherein:
[0028] Figure 1 A schematic diagram illustrating the problem of destruction of flow granular time series or statistical value features caused by synthesizing a single attribute as a numerical or categorical feature in related technologies;
[0029] Figure 2 This is a flowchart of a method for synthesizing network traffic packet header trajectories according to an embodiment of this application;
[0030] Figure 3 This is a schematic diagram illustrating the principle of a network traffic packet header trajectory synthesis method according to an embodiment of this application;
[0031] Figure 4 This is a schematic diagram illustrating the effect of merging and splitting boxes according to an embodiment of this application;
[0032] Figure 5 This is a schematic diagram illustrating the generator architecture and attribute value generation principle according to an embodiment of this application;
[0033] Figure 6 This is a schematic diagram illustrating the discriminator architecture and loss value calculation principle according to an embodiment of this application;
[0034] Figure 7This is a schematic diagram of a network traffic packet header trajectory synthesis device according to an embodiment of this application;
[0035] Figure 8 This is a schematic diagram of the structure of a server provided according to an embodiment of this application. Detailed Implementation
[0036] The embodiments of this application are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain this application, and should not be construed as limiting this application.
[0037] The following description, with reference to the accompanying drawings, outlines a method, apparatus, server, and storage medium for synthesizing network traffic header trajectories according to embodiments of this application. Addressing the issues raised in the background section regarding rule- and expert-knowledge-driven network traffic header trajectories, which suffer from high labor costs and reliance on specialized knowledge, thus limiting simulation capabilities in complex network environments, and network model-driven network traffic header trajectories, which rely heavily on model-network traffic compatibility and exhibit poor flexibility, this application provides a method for synthesizing network traffic header trajectories. This method utilizes real network traffic data packets and real traffic to train a pre-constructed initial network traffic header trajectories synthesis model, resulting in a final model. This final model then yields network traffic header trajectories that satisfy preset real trajectory feature packet granularity and flow granularity feature distribution conditions. By employing machine learning-driven network traffic header trajectories synthesis, features are automatically learned from existing real network traffic. While retaining real features, variations are introduced to synthesize new traffic, saving the cost of manual feature extraction and enabling better learning of complex feature distributions. This improves the consistency between the synthesized network traffic header trajectories with category labels and the real trajectory feature distribution. This solves the technical problems in related technologies, such as the high labor cost and reliance on professional knowledge in rule- and expert knowledge-driven network traffic header trajectory synthesis, which restricts the simulation capability in complex network environments, and the poor flexibility in network model-driven network traffic header trajectory synthesis, which depends on the adaptability between the model and network traffic.
[0038] Network traffic header traces are sequences of header fields extracted from and recorded in a specific format from network communication data packets. They record routing and transport layer information at the packet level, along with timestamps and packet length information, reflecting protocol behavior and traffic characteristics during communication.
[0039] Category-labeled network traffic header traces can be used to develop network security and quality of service management tools based on supervised machine learning models. To ensure the usability of these tools in real-world network environments, the features of the category-labeled network traffic header traces used for development need to match real traffic. The supervised machine learning model needs to learn from header traces similar to real traffic the joint values, statistical values, and time-varying characteristics of the corresponding traffic category in the header fields to achieve accurate network traffic classification.
[0040] However, developers of cybersecurity and network service quality management tools may not have access to network traffic header tracking datasets with category labels that accurately reflect real-world network traffic. Some internet organizations, such as the Center for Applied Internet Data Analysis (CAIDA) and the Information Security Center of Excellence (ISCX), provide publicly available network traffic header tracking datasets to developers worldwide. However, these datasets are static and only cover a portion of network attack types, application categories, and encryption protocols. This makes them difficult to accurately reflect changes in current internet traffic.
[0041] Conversely, network operators and organizations with proprietary networks can obtain a large amount of recent, real-world network traffic header data through network traffic capture and provide traffic labels using existing classification models combined with manual annotation. However, due to potential policy, privacy, and commercial restrictions, internet entities such as network operators that possess large amounts of categorized network traffic data cannot directly share this data, which may contain sensitive user information (such as IP addresses).
[0042] Among these, network traffic trajectory synthesis technology can, to some extent, solve the problem of sharing network traffic trajectory datasets with category labels. This method aims to synthesize packet header trajectories that do not contain sensitive information but conform to the characteristics of real traffic, based on communication events or real network traffic trajectories in real network traffic scenarios. Existing network traffic packet header trajectory synthesis technologies mainly include the following categories:
[0043] (1) Rule-based and expert knowledge-driven synthesis of network traffic header trajectories. Existing network simulation tools (such as NS-3 and OPNET) can generate network traffic header trajectories in an event-driven manner by configuring a series of parameters to simulate network environments. However, the selection of parameters requires a large amount of human and expert knowledge, which limits their ability to simulate complex network environments.
[0044] (2) Network model-driven network traffic header trajectory synthesis. These tools establish statistical models of network traffic and extract parameters from real traffic to simulate network traffic distribution under specific conditions. Harpoon generates traffic similar to real traffic distribution by extracting parameters from real traffic. Swing extracts traffic feature distributions at different levels to generate network traffic. LitGen statistically models wireless traffic based on users and applications to simulate unlimited traffic for P2P and email. Tmix generates corresponding TCP connections by simulating socket-level behavior of source applications. These works, to some extent, possess data-driven traffic generation capabilities, avoiding excessive human intervention. However, because their generation effect depends on the adaptability of the model to network traffic, they are difficult to flexibly handle different network scenarios and complex traffic distributions.
[0045] (3) Machine learning-driven network traffic header trajectory synthesis. These tools automatically learn features from existing real network traffic using machine learning models, introducing variations while preserving the real features to synthesize new traffic. This approach can save the cost of manual feature extraction and also better learn complex feature distributions.
[0046] However, machine learning-driven network traffic packet header trajectory synthesis models in related technologies still have problems.
[0047] The machine learning-driven network traffic header trajectory synthesis model takes PCAP format header trajectories collected and labeled in real network environments as input. Based on this, packets with the same or opposite IP addresses and port numbers, and the same transport layer protocol within a certain period are treated as a bidirectional network flow. Referring to the usage requirements of downstream network security and quality of service management tasks based on network traffic header trajectories and existing network traffic header trajectory generation work, the synthesized network traffic header trajectories for each category need to be consistent with the corresponding real data in the following packet-level and flow-level feature value distributions:
[0048] (1) Packet-level header attributes. Packet header attributes are defined as the header fields (port, TCP flags, etc.) and measured values (packet length, timestamp) required by downstream classification tasks. For machine learning-driven network traffic header trajectory synthesis models, the header attributes of the synthesized trajectory need to maintain a marginal and joint distribution consistent with the real trajectory.
[0049] (2) Flow-level time series and statistics. The time series composed of the packet header trajectory of the same flow can serve as a direct reflection of its protocol status and application interaction process, including frequency domain characteristics and sequence segment characteristics; the statistical values obtained after aggregation at the flow granularity, such as the distribution characteristics of the number of bytes in the flow, average packet length, and minimum packet interval, can also serve as an important basis for classifying the application and protocol to which the traffic belongs. Therefore, the synthetic data needs to restore the distribution characteristics of the real network traffic packet header trajectory in the flow-level time series and statistical values.
[0050] Early machine learning-driven network traffic header trajectory synthesis models were mostly based on packet-level synthesis. This involved treating each data packet in the network traffic header trajectory as a row in a table, and different header attributes as different columns, synthesizing a new network traffic header trajectory while only considering the attribute distribution at the packet level. While such techniques (e.g., PAC-GAN, PacketCGAN) can maintain good consistency between the header attributes in the synthesized network traffic header trajectory and the actual traffic distribution, they completely ignore the time-series and statistical characteristics at the flow granularity level.
[0051] Flow-granular network traffic header trajectory synthesis models can generate both packet-granular header attributes and flow-granular time series and statistical values. However, unlike unlabeled traffic generation, which only needs to focus on the overall feature distribution of the dataset, labeled traffic generation needs to maintain the feature distribution of each category. This requires network traffic generation models to have more refined modeling and processing of the different types of feature values contained in network traffic. Existing flow-granular network traffic header trajectory synthesis models generally classify attribute values into either numerical features or categorical features based on whether the values have metric significance. Numerical features have clear metric significance, and similar feature values have similar properties. Categorical features do not have obvious metric significance, and similar feature values may correspond to very different properties. However, for network traffic header trajectories with category labels, simply classifying them into numerical or categorical features according to attributes may disrupt the feature distribution of the synthesized header trajectory, especially in terms of flow-granular time series and statistical features.
[0052] In network traffic generation scenarios with category labeling, even the same feature value may exhibit both numerical and categorical characteristics. First, the same dimensional feature value may exhibit different characteristics in different categories. For example, port numbers, often treated as categorical features, may be allocated within a continuous range in some P2P applications (such as Bittorrent), displaying certain numerical characteristics. Second, the same dimensional feature value may also exhibit different characteristics across different value ranges within the same category. For instance, in SMTP protocol applications (such as email), smaller packet lengths correspond to signaling data packets, with only a few specific packet length values, exhibiting categorical characteristics. Larger packet lengths, corresponding to payload data packets, have more continuous and random values, exhibiting numerical characteristics.
[0053] Figure 1 Taking packet length as an example, this paper demonstrates the problem of disrupting flow-granular time series or statistical features caused by synthesizing a single attribute as a numerical or categorical feature. In this example, smaller packet lengths exhibit categorical features, resulting in some flows having packet length sequence fragments with specific patterns. Larger packet lengths exhibit numerical features, allowing their maximum packet length to be continuously distributed within a certain range. The generative model is trained on a dataset sampled from the network and generates the dataset using either categorical or numerical features. (i) Generating the dataset using attribute values as numerical features leads the model to tend to generate data with a value of 61, where the packet length is between 60 and 62, even though such an attribute value does not exist in the real data. (ii) Conversely, generating the dataset using attribute values as numerical features only generates values present in the sampled data, causing the original continuous value range to become discrete. This problem is particularly prominent with sparsely distributed features such as the maximum packet length. Since the characteristics of these features are often directly related to the implementation of protocols, scripts, or applications corresponding to different categories of network traffic, disrupting their characteristics will affect the accuracy of downstream classification tasks.
[0054] As a commonly used model for data generation, Generative Adversarial Networks (GANs) have been introduced into the problem of network traffic generation in numerous existing works. However, the GAN architectures used in existing network traffic models are mostly designed for continuous numerical features, which makes it difficult to generate attribute values with categorical features. In existing GAN models, the generator directly generates feature sequences through parameter-controlled deterministic transformations. This output is then passed to the discriminator to generate a loss value, guiding the model parameters for minor optimization. However, for discrete categorical features, minor parameter optimizations may not cause changes in the generated data, thus preventing model optimization.
[0055] Sequence Generative Adversarial Model (SeqGAN) based on reinforcement learning is a model proposed for generating class feature sequences. Compared to general GAN models, its core changes are: (i) generating the probability distribution of feature values at each step of the sequence under the existing prefix, rather than the feature sequence itself; (ii) introducing reinforcement learning ideas, using Markov chain-Monte Carlo sampling on the existing prefix of the sequence to obtain the class feature sequence, and using the discriminator's judgment result as the reward value to calculate the quality of the current probability distribution generation. This transforms the discrete class feature generation problem into a continuous probability value generation problem, avoiding the model optimization problem when generating discrete values. In addition, this generation method generates new data packets under the state of the existing flow-granular packet header trajectory sequence prefix, which can simulate the sequence feature changes caused by the state changes of network traffic packet header trajectories at the flow granularity. Therefore, this model can effectively solve the problem of generating sequences with class feature attribute values in network traffic. Finally, the reinforcement learning-based loss calculation method can also prioritize the quality of network traffic sequence prefix generation, which is particularly important for some downstream applications that require real-time classification.
[0056] To address the aforementioned challenges, this application introduces reinforcement learning into the field of network traffic header trajectory synthesis and unifies the mixed numerical and categorical features in the attribute values of network traffic header trajectories using data binning technology. Furthermore, based on the original sequence generative adversarial model, this application adds a conditional probability feature generation bypass to the generator to solve the problem of excessive output layer parameters when generating multi-attribute network traffic header trajectories. Finally, this application implements a discriminator based on Wasserstein loss value using a pre-trained word vector model, significantly enhancing the stability of model training. Compared to state-of-the-art network traffic header trajectory synthesis models, this application improves the consistency between the synthesized network traffic header trajectories with category labels and the true trajectory feature distribution.
[0057] In summary, this application embodiment utilizes a network traffic header trajectory synthesis model with category labels that can be deployed, trained, and synthesized on a general-purpose server. This model learns the feature distribution corresponding to each category from provided real network traffic header trajectories, synthesizing network traffic header trajectories that conform to the feature packet granularity and flow granularity distribution of the real trajectory. Furthermore, users can adjust the proportion of network traffic header trajectories corresponding to different category labels as needed.
[0058] Specifically, Figure 2 This is a flowchart illustrating a method for synthesizing network traffic packet header trajectories provided in an embodiment of this application.
[0059] like Figure 2As shown, this method for synthesizing network traffic packet header traces is applied to a server. The method includes the following steps:
[0060] In step S201, real network traffic data packets and real traffic are obtained.
[0061] Real network traffic data packets are the smallest unit of data transmission in network communication. They carry specific information from the source device to the destination device and are transmitted in binary signal form through physical media (such as optical fibers and cables). They are the cornerstone of Internet communication. Traffic data is a macroscopic quantitative indicator of network activity, reflecting the total amount of data transmitted, the rate, and the distribution characteristics per unit time. It is used to evaluate network performance, security status, and user behavior.
[0062] This application embodiment can obtain real network traffic data packets and real traffic to automatically learn features from existing real network traffic in subsequent model training, introduce changes while retaining real features, and synthesize new traffic.
[0063] In step S202, a pre-built initial network traffic header trajectory synthesis model is trained using real network traffic data packets and real traffic to obtain the final network traffic header trajectory synthesis model. The initial network traffic header trajectory synthesis model consists of a preprocessor, a generator, a discriminator, and a training controller.
[0064] Furthermore, in this embodiment of the application, the pre-constructed initial network traffic header trajectory synthesis model can be trained using real network traffic data packets and real traffic, so that the final network traffic header trajectory synthesis model can automatically learn features from existing real network traffic.
[0065] Optionally, in one embodiment of this application, a pre-built initial network traffic header trajectory synthesis model is trained using real network traffic data packets and real traffic to obtain a final network traffic header trajectory synthesis model. This includes: using a preprocessor to reversibly bin the sequence of real network traffic data packets to obtain binning sequences that meet preset conditions, and unifying the attribute values of the binning sequences into category features; obtaining the length of the traffic from the real traffic, and using preset category labels and length as input to the generator to obtain a generated sequence; processing the binning sequence and the generated sequence with word vectors, and then inputting them into a discriminator to obtain a discrimination result; using a training controller to receive the discrimination result and calculate the corresponding loss value, so as to use the loss value to guide the generator and discriminator to optimize, thereby obtaining the final network traffic header trajectory synthesis model.
[0066] In actual implementation, the embodiments of this application can model the network traffic packet header trajectory with category labeling as a multi-attribute time series + category form, as follows:
[0067] (1) Network traffic packet header trajectory dataset D = {(x,y)}, (x,y) is a sample, x is real data (feature values of interest to downstream applications extracted from real network traffic, such as protocol fields and measurement values), its data type is aggregated bidirectional stream, and y is the corresponding category;
[0068] (2)x={(f1,f2,...,f l ),l}, where l is the length of the bidirectional flow (number of packets within the flow), f i The multidimensional feature values of the i-th data packet within the stream together form a multidimensional feature value sequence.
[0069] (3)f i =(a1,...,a m ),a j The feature values that may be used for classification by downstream applications include packet length, orientation, time information, field values, etc.
[0070] Figure 3 This paper illustrates the framework and training process of the network traffic packet header trajectory synthesis model according to an embodiment of this application. The core components include four parts: a preprocessor, a generator, a discriminator, and a training controller. The training process consists of the following steps:
[0071] Step S1: The preprocessor reversibly bins the real network traffic data packet sequence into bin sequences that can maintain numerical characteristics while significantly reducing the value space, and unifies the attribute values into category features.
[0072] In step S2, the generator takes the category label and length of the generated traffic as input, generates paths through sequence probability features, and generates bypasses through conditional probability, to generate the conditional distribution of each attribute of the multi-attribute sequence value under the sequence prefix and existing attribute values. The generated sequence is obtained through sampling. The proportion of category labels can be adjusted by the user, and the length distribution can be sampled from real traffic.
[0073] Step S3: The discriminator takes the binned real data and the sampled generated sequence as input after word vector processing, and then checks whether the real data matches the corresponding label.
[0074] In step S4, the discrimination result is finally received by the training controller, and its Wasserstein loss value is used as the reward value for reinforcement learning and the loss value of the discriminator to guide the generator and discriminator to optimize.
[0075] When optimizing the generator and discriminator, the training controller can obtain multiple sampled generated sequences by sampling the generated sequences with different length prefixes of Markov Chain-Monte Carlo (MCMC), which are the sequences of real network traffic data packets generated.
[0076] Optionally, in one embodiment of this application, a preprocessor is used to reversibly bin the sequence of real network traffic data packets to obtain a binned sequence that meets preset conditions, and the attribute values of the binned sequences are unified as category features. This includes: obtaining attribute values of real network traffic data packets with category labels under multiple categories from the sequence of real network traffic data packets; calculating the chi-square value of the attribute values of the real network traffic data packets; and using the chi-square value to bin the range of attribute values of the real network traffic data packets, so as to unify the attribute values of the real network traffic data packets as category features.
[0077] To unify the attribute values of the mixed numerical and categorical features of network traffic packet header trajectories, this application proposes an improved ChiMerge feature value reversible binning method for sparse distribution intervals. By binning the range of attribute values (real network traffic packet attribute values), the attribute values are transformed into categorical features, which can then be uniformly processed by sequence generative adversarial networks based on reinforcement learning. Furthermore, the ChiMerge method ensures that the attribute values are approximately uniformly distributed within the binned intervals, allowing for the reversible recovery of attribute values with the same distribution as before binning through random sampling. This guarantees that the binned sequence synthesized by the model can be restored to the packet header trajectory attribute value sequence, i.e., the sequence of attribute values of real network traffic packets, where the packet header trajectory is the captured data packet.
[0078] Chi-square merging is a statistical method used to merge a series of disjoint intervals based on the chi-square test. The chi-square test uses a statistical method to test whether the actual distribution matches the expected distribution. For k disjoint intervals, it is assumed that the merged intervals conform to the distribution P. E Let O i Let E be the actual observed frequency of the i-th interval. i For the i-th interval, the expected distribution P E The frequency of the distribution is used to calculate the chi-square value (χ²), which represents the similarity between the actual and expected distributions. 2 as follows:
[0079]
[0080] When the chi-square value is sufficiently small, the actual distribution can be approximated as the expected distribution. Therefore, the data from the k disjoint intervals can be replaced with data conforming to the merged expected distribution.
[0081] To facilitate the calculation of the chi-square value and the recovery of reversible data during merging, the embodiments of this application set the expected distribution P after merging. E All values are uniformly distributed. The chi-square values of the network traffic trajectory attribute values (actual network traffic data packet attribute values) with category labels under the n categories are calculated as follows:
[0082] For two adjacent intervals I1 and I2 that may have a blank area in between, let the blank area in between be I. e Let the lengths of the above intervals be L(I1), L(I2), and L(I... e The frequencies of the corresponding attribute values under category label i are C(I1,i), C(I2,i), and 0 (empty intervals contain no data). Then, the expected distribution frequency E of the data under category label i within the merged interval is... i for:
[0083]
[0084] At this point, the chi-square value χ of the distribution before and after interval merging 2 It can be represented as:
[0085]
[0086] However, ChiMerge performs poorly in intervals where the numerical distribution is too sparse. When the attribute values are too sparsely distributed within a specific interval (e.g., there is only one data point in two adjacent intervals, with a very long blank interval in between), the calculated chi-square value will be too large. To address this, this embodiment adds a frequency-based merging step before chi-square merging, directly merging intervals with frequencies below a specific threshold (default is 1).
[0087] Figure 4 This demonstrates the improved chi-square binning effect for sparse data under a single category and single attribute. Initially, all numerical points with attribute values are divided into independent intervals. For sparsely distributed regions, threshold binning is first performed to avoid overly fine binning and too many blank intervals. Then, the chi-square binning algorithm is used to merge intervals with chi-square values below a specific threshold (determined by the chi-square test confidence interval) until the chi-square values of all adjacent intervals are above the threshold.
[0088] By employing an improved chi-square merging eigenvalue reversible binning method for sparse distribution intervals, the attribute values of network traffic packet header trajectories, which combine categorical and numerical features, are binned according to their attributes and used as input to the categorical feature model for training and generation. Simultaneously, this method ensures that the generated binned sequence can be reconstructed from the attribute value sequence through random sampling of the binned intervals, thus completing the synthesis of network traffic packet header trajectories.
[0089] In summary, the embodiments of this application can use a sequence conditional generative adversarial network based on reinforcement learning and a data binning algorithm designed for network traffic packet header trajectories to uniformly convert network traffic packet header trajectories with both numerical and categorical features of a single attribute into categorical feature generation, thus avoiding the problem of flow granular feature destruction caused by existing methods that synthesize single attributes as numerical or categorical features.
[0090] Optionally, in one embodiment of this application, obtaining the length of traffic from real traffic and using a preset category label and length as input to the generator to obtain a generated sequence includes: obtaining the previous complete sequence value, wherein the complete sequence value is a sequence value with all target attribute values embedded, and the sequence value is a vector composed of target attribute values extracted from any data table in the real network traffic data packet; combining the previous complete sequence value, the preset category label, and the length, using sequence feature probability to generate a path, embedding any target attribute value into the current sequence value; based on the target attribute values already in the current sequence value, using conditional probability features to generate a bypass, embedding all unembedded target attribute values into the current sequence value, until a generated sequence is obtained.
[0091] To address the characteristics of multi-attribute value sequences in network traffic packet header trajectories, this application proposes a reinforcement learning generator that adds a conditional probability feature generation bypass. Unlike text sequences with a single attribute, network traffic sequence values simultaneously contain multiple attributes. For m attributes (a1,...,a...),... m Let the number of possible values for a single attribute be (v1,...,v). m Then the joint distribution of all attributes can take the number of values.
[0092] Therefore, even with prior binning and merging of attribute value ranges, the possible combinations of all attributes could still exceed one billion. Since the number of parameters in the model's output layer is proportional to the length of the probability distribution vector, directly outputting the probability distribution of these value combinations would make the model difficult to train due to the excessive number of parameters in the output layer. A vector composed of target attribute values extracted from a data packet.
[0093] In this embodiment, the joint distribution p(a1,...,a) of different attributes of the sequence values (a vector composed of target attribute values extracted from a data packet) can be selected. m Decompose into conditional distributions p(a1), p(a2|a1), p(a... m |a1,...,a m-1 During a round of sampling, the generator generates the probability distribution of the remaining attributes of the current sequence value based on the existing sequence prefix and the existing attributes of the current sequence value.
[0094] The generator architecture and the generation process of an attribute value are as follows: Figure 5 As shown, the generator can be mainly divided into two data processing and generation paths: a sequence probability feature generation path starting with the previous complete sequence value (the sequence value after attribute generation), and a conditional probability feature generation bypass path starting with existing sequence attribute values. In addition, sequence length and category label are input into the generator as generation conditions.
[0095] (1) Sequence Probabilistic Feature Generation Path. The complete sequence value is first split into different attribute values, which are then processed by the corresponding word embedding layer. The sequence value after word embedding is concatenated with the processed sequence length value, and the corresponding linear layer is selected by the category label as the hidden feature for processing. The processed data is input as the hidden feature into LSTM (Long Short-Term Memory). LSTM also accepts the hidden state from the previous step. After completing the state transfer, the current sequence feature is generated from the input hidden features. In this step, the current hidden state of LSTM is retained for state transfer when generating the next sequence value feature. The sequence feature is merged with the conditional probability generation bypass output, and after probability processing, the output is the conditional probability distribution vector of the next attribute value. Finally, the conditional probability distribution vector is sampled to obtain the next attribute value.
[0096] (2) Conditional probability feature generation bypass. The generator uses the existing attribute values of the current sequence value as conditions. The existing attribute values are also first processed by the word embedding layer corresponding to the attribute. The attribute values after word embedding are processed by a linear layer different from the sequence probability generation path. The processing result is concatenated with the sequence features and then merged into the sequence probability generation path.
[0097] The above process describes how the generator produces an attribute value. The generated attribute value is concatenated with existing attribute values and used as a condition for generating the next sequence value or attribute value. This process is repeated until all attribute values for all sequence values under the expected sequence length are generated.
[0098] In summary, the embodiments of this application can transform the joint probability generation problem of multiple attribute sequence values of network traffic packet header trajectory into a conditional probability generation problem of the current attribute under existing attribute values by introducing a conditional probability feature generation bypass. This significantly reduces the length of the probability distribution vector output by the generator, thereby avoiding the problem of too many parameters in the generator output layer under multiple attribute sequence values.
[0099] Optionally, in one embodiment of this application, the loss value is used to guide the generator and discriminator to perform optimization in order to obtain the final network traffic packet header trajectory synthesis model, including: sampling different length prefixes of the generated sequence using a training controller to obtain a sampled generated sequence; inputting the sampled generated sequence into the discriminator to obtain an optimized discrimination result, and using the optimized discrimination result to update the loss value until the loss value meets a preset iteration termination condition to obtain the final network traffic packet header trajectory synthesis model.
[0100] The generator training process is based on reinforcement learning, and the reward value of reinforcement learning is the classification result of the classifier. To ensure the stability of training, this embodiment uses the Wasserstein loss of the classification result as the reward value. The generator optimization process in one round of training is as follows:
[0101] Step S1, Sequence Generation: Obtain the complete generated sequence (generated sequence).
[0102] Step S2, Sequence prefix sampling: For the generated sequence prefix The generator is used to pad the sampled sequence to l, and this process is repeated N times to obtain N(l+1) sampled sequences.
[0103] Step S3, Classification: Input the N(l+1) sampled sequences into classifier D for classification.
[0104] Step S4, probability sequence generation: Without sampling, directly obtain the probability value sequence s = (p1, p2, ..., p) for each step of the x sequence from the output layer. l ).
[0105] Step S5, Loss calculation: Generator loss value The calculation formula is as follows:
[0106]
[0107] Step S6: Gradient calculation and model parameter update to complete one round of training.
[0108] Through the above process, the trained generator can learn the conditional distribution of the current attribute value of the network traffic header trajectory under a specific category label, given the existing sequence value prefix and attribute value. This not only ensures that the synthesized network traffic header trajectory conforms to the joint distribution of attribute values at different packet granularities within the same data packet, but also allows the generator to use the existing sequence value prefix as a state to simulate the attribute value changes that occur when the state of the network traffic header trajectory changes within the flow.
[0109] Optionally, in one embodiment of this application, the expression for calculating the loss value of the discriminator includes:
[0110]
[0111] in, This represents the final loss value of the discriminator. This represents the loss value of the discriminator's judgment result on the real data. This represents the loss value of the discriminator in judging the generated sequence. Let λ represent the gradient penalty term, λ represent the weights, and x represent the actual data. Let P represent the generated sequence, E represent the expected value, and P represent the expected value. x Let P represent the distribution of the real data, D(x) represent the discriminator's judgment result on the real data, and P represent the distribution of the real data. G Indicates the distribution of the generated sequences. This indicates the discriminator's judgment result for the generated sequence.
[0112] This application implements a binned network traffic packet header trajectory discriminator based on Wasserstein loss by combining a pre-trained word vector model. Generative Adversarial Networks (WGANs) based on Wasserstein loss are designed to address the gradient vanishing or pattern collapse problems that may occur during the training of GAN (Generative Adversarial Networks) models. For network traffic datasets with uneven class distributions, gradient vanishing or pattern collapse can cause the model to synthesize network traffic data that does not match the real data on classes with fewer samples, or tend to generate repetitive data patterns. Using Wasserstein loss for training can significantly improve the stability of model training.
[0113] Using the Wasserstein loss requires the discriminator to satisfy the K-Lipschitz condition. This is generally achieved using gradient penalty, which involves adding a gradient penalty term to the discriminator's loss value according to the following formula.
[0114]
[0115] in, Real data x and synthetic data Interpolation, i.e. However, interpolation only works for numerical features with metric significance, and has no meaning for the generated categorical features after binning. This application's embodiments achieve this using a pre-trained word vector model. The word vector model processes the categorical feature sequences into measurable numerical vectors in a one-to-one correspondence. The closer the properties of two sequence values are, the closer their corresponding word vectors are in space. This makes the generated word vector sequences metrically meaningful and can be directly used for interpolation.
[0116] Figure 6 This paper describes the architecture of the discriminator in this embodiment and the calculation process of the Wassertern loss during training. Before formal training, this embodiment trains the word vector model using binned network traffic sequences. During training, the binned real data and the generated sequences synthesized by the generator are input into the word vector model, which maps the feature value sequence to the word vector sequence. Subsequently, three types of word vector sequences are input into the discriminator: the real vector sequence, the generated vector sequence, and the interpolated sequence generated by random interpolation of the two (used to calculate the gradient penalty term). The word vector sequences are processed by the sequence feature extractor for feature extraction, and together with the processed sequence length, they are used by the input layer corresponding to the sequence category for discrimination. After receiving the discrimination results, the training controller calculates the Wassertern loss value of the discriminator for the discrimination results of the real data and the generated sequence. and the corresponding gradient penalty term and its weight λ. The final loss value of the discriminator. The calculation is as follows:
[0117]
[0118] in, This represents the function with random variable x as independent variable in distribution P. x The expected value under, Represented by random variables A function of the independent variable in the distribution The expected value is given below, where x is the actual data. To generate sequences, i.e., synthesize data.
[0119] In summary, this embodiment achieves flexible conversion between categorical and numerical features through a pre-trained word vector model, avoiding the interpolation problem when calculating the Wasserstein loss value for categorical features. By introducing the Wasserstein loss value, the discriminator can significantly improve the training stability of this embodiment on network traffic header trajectories with multiple labels and uneven distribution, ensuring the authenticity and diversity of the synthesized network traffic header trajectories. By combining the pre-trained word vector model, this embodiment can realize a discriminator and training control based on the Wasserstein loss value, greatly improving the stability of model training.
[0120] In step S203, the final network traffic header trajectory synthesis model is used to obtain the network traffic header trajectory that satisfies the preset conditions of real trajectory feature packet granularity and flow granularity feature distribution.
[0121] Using the trained final network traffic packet header trajectory synthesis model, the embodiments of this application can synthesize network traffic packet header trajectories that conform to the distribution of real trajectory feature packet granularity and flow granularity features.
[0122] The network traffic header trajectory synthesis method proposed in this application can train a pre-constructed initial network traffic header trajectory synthesis model using real network traffic data packets and real traffic to obtain a final network traffic header trajectory synthesis model. This final model then yields network traffic header trajectories that satisfy preset real trajectory feature packet granularity and flow granularity feature distribution conditions. Through machine learning-driven network traffic header trajectory synthesis, features are automatically learned from existing real network traffic. While retaining real features, variations are introduced to synthesize new traffic, saving the cost of manual feature extraction and enabling better learning of complex feature distributions. This improves the consistency between the synthesized network traffic header trajectory with category labels and the real trajectory feature distribution. Therefore, this solves the technical problems in related technologies: rule-based and expert knowledge-driven network traffic header trajectory synthesis suffers from high labor costs and reliance on professional knowledge, thus limiting its simulation capabilities in complex network environments; and network model-driven network traffic header trajectory synthesis relies on the model's adaptability to network traffic, resulting in poor flexibility.
[0123] Next, the network traffic packet header trajectory synthesis apparatus proposed according to the embodiments of this application is described with reference to the accompanying drawings.
[0124] Figure 7 This is a block diagram of a network traffic packet header trajectory synthesis device according to an embodiment of this application.
[0125] like Figure 7 As shown, the network traffic packet header trajectory synthesis device 10 includes: an acquisition module 100, a training module 200, and a synthesis module 300.
[0126] Specifically, module 100 is used to acquire real network traffic data packets and real traffic.
[0127] The training module 200 is used to train a pre-built initial network traffic header trajectory synthesis model using real network traffic data packets and real traffic to obtain the final network traffic header trajectory synthesis model. The initial network traffic header trajectory synthesis model consists of a preprocessor, a generator, a discriminator, and a training controller.
[0128] The synthesis module 300 is used to obtain the network traffic packet header trajectory that meets the preset conditions of real trajectory feature packet granularity and flow granularity feature distribution by using the final network traffic packet header trajectory synthesis model.
[0129] Optionally, in one embodiment of this application, the training module 200 includes: a preprocessing unit, a generation unit, a discrimination unit, and an optimization unit.
[0130] The preprocessing unit is used to reversibly bin the sequence of real network traffic data packets using the preprocessor to obtain binning sequences that meet preset conditions, and to unify the attribute values of the binning sequences into category features.
[0131] The generation unit is used to obtain the length of the traffic from the real traffic. It takes the preset category label and length as input to the generator to obtain the generated sequence.
[0132] The discriminant unit is used to process the binned sequence and the generated sequence with word vectors and then input them into the discriminator to obtain the discrimination result.
[0133] The optimization unit is used to receive the discrimination results from the training controller and calculate the corresponding loss value. The loss value is then used to guide the generator and discriminator to perform optimization in order to obtain the final network traffic packet header trajectory synthesis model.
[0134] Optionally, in one embodiment of this application, the optimization unit includes a sampling subunit and an optimization subunit.
[0135] The sampling subunit is used to sample different length prefixes of the generated sequence using the training controller to obtain the sampled generated sequence.
[0136] The optimization sub-unit is used to input the sampled generated sequence into the discriminator to obtain the optimized discrimination result, and to update the loss value using the optimized discrimination result until the loss value meets the preset iteration termination condition, so as to obtain the final network traffic packet header trajectory synthesis model.
[0137] Optionally, in one embodiment of this application, the preprocessing unit includes: a first acquisition subunit, a calculation subunit, and a processing subunit.
[0138] The first acquisition subunit is used to acquire attribute values of real network traffic data packets with category labels under multiple categories from the sequence of real network traffic data packets.
[0139] The calculation subunit is used to calculate the chi-square value of the attribute values of real network traffic data packets.
[0140] The processing subunit is used to bin the range of attribute values of real network traffic data packets using the chi-square value, so as to unify the attribute values of real network traffic data packets into category features.
[0141] Optionally, in one embodiment of this application, the generation unit includes: a second acquisition subunit, a first embedding subunit, and a second embedding subunit.
[0142] The second acquisition subunit is used to acquire the previous complete sequence value, wherein the complete sequence value is a sequence value with all target attribute values embedded, and the sequence value is a vector composed of target attribute values extracted from any data table in the real network traffic data packet.
[0143] The first embedding subunit is used to combine the previous complete sequence value, the preset category label and length, and use the sequence feature probability to generate a path to embed any target attribute value into the current sequence value.
[0144] The second embedding subunit is used to generate a bypass based on the target attribute values already present in the current sequence value, using conditional probability features, and embeds all non-embedded target attribute values into the current sequence value until a generated sequence is obtained.
[0145] Optionally, in one embodiment of this application, the expression for calculating the loss value of the discriminator includes:
[0146]
[0147] in, This represents the final loss value of the discriminator. This represents the loss value of the discriminator's judgment result on the real data. This represents the loss value of the discriminator in judging the generated sequence. Let λ represent the gradient penalty term, λ represent the weights, and x represent the actual data. Let P represent the generated sequence, E represent the expected value, and P represent the expected value. x Let P represent the distribution of the real data, D(x) represent the discriminator's judgment result on the real data, and P represent the distribution of the real data. G Indicates the distribution of the generated sequences. This indicates the discriminator's judgment result for the generated sequence.
[0148] It should be noted that the foregoing explanation of the network traffic header trajectory synthesis method embodiment also applies to the network traffic header trajectory synthesis device of this embodiment, and will not be repeated here.
[0149] The network traffic header trajectory synthesis device proposed in this application can train a pre-constructed initial network traffic header trajectory synthesis model using real network traffic data packets and real traffic to obtain a final network traffic header trajectory synthesis model. This final model then yields network traffic header trajectories that satisfy preset real trajectory feature packet granularity and flow granularity feature distribution conditions. Through machine learning-driven network traffic header trajectory synthesis, features are automatically learned from existing real network traffic. While retaining real features, variations are introduced to synthesize new traffic, saving the cost of manual feature extraction and enabling better learning of complex feature distributions. This improves the consistency between the synthesized network traffic header trajectory with category labels and the real trajectory feature distribution. Therefore, this solves the technical problems in related technologies: rule-based and expert knowledge-driven network traffic header trajectory synthesis suffers from high labor costs and reliance on professional knowledge, thus limiting its simulation capabilities in complex network environments; and network model-driven network traffic header trajectory synthesis relies on the model's adaptability to network traffic, resulting in poor flexibility.
[0150] Figure 8 A schematic diagram of the structure of a server provided in an embodiment of this application. The server may include:
[0151] The memory 801, the processor 802, and the computer program stored on the memory 801 and capable of running on the processor 802.
[0152] When the processor 802 executes the program, it implements the network traffic packet header trajectory synthesis method provided in the above embodiments.
[0153] Furthermore, the server also includes:
[0154] Communication interface 803 is used for communication between memory 801 and processor 802.
[0155] The memory 801 is used to store computer programs that can run on the processor 802.
[0156] The memory 801 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk storage device.
[0157] If the memory 801, processor 802, and communication interface 803 are implemented independently, then the communication interface 803, memory 801, and processor 802 can be interconnected via a bus to complete communication between them. The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be divided into address buses, data buses, control buses, etc. For ease of representation, Figure 8 The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.
[0158] Optionally, in a specific implementation, if the memory 801, processor 802, and communication interface 803 are integrated on a single chip, then the memory 801, processor 802, and communication interface 803 can communicate with each other through an internal interface.
[0159] The processor 802 may be a central processing unit (CPU), an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of this application.
[0160] This embodiment also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described method for synthesizing network traffic packet header trajectories.
[0161] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the network traffic packet header trajectory synthesis method provided in this application.
[0162] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.
[0163] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this application, "N" means at least two, such as two, three, etc., unless otherwise explicitly specified.
[0164] Any process or method described in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or N executable instructions for implementing custom logic functions or processes, and the scope of the preferred embodiments of this application includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functions involved, as should be understood by those skilled in the art to which embodiments of this application pertain.
[0165] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of computer-readable media include: an electrical connection having one or more wires (electronic device), a portable computer disk drive (magnetic device), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Alternatively, the computer-readable medium may be paper or other suitable media on which the program can be printed, since the program can be obtained electronically by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in a computer memory.
[0166] It should be understood that the various parts of this application can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.
[0167] Those skilled in the art will understand that all or part of the steps of the methods in the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.
[0168] Furthermore, the functional units in the various embodiments of this application can be integrated into a processing module, or each unit can exist physically separately, or two or more units can be integrated into a module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
[0169] The storage medium mentioned above can be a read-only memory, a disk, or an optical disk, etc. Although embodiments of this application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting this application. Those skilled in the art can make changes, modifications, substitutions, and variations to the above embodiments within the scope of this application.
Claims
1. A method for synthesizing network traffic packet header trajectories, characterized in that, Applied to a server, the method includes the following steps: Obtain real network traffic data packets and real traffic; The initial network traffic header trajectory synthesis model is trained using the real network traffic data packets and the real traffic to obtain the final network traffic header trajectory synthesis model. The initial network traffic header trajectory synthesis model consists of a preprocessor, a generator, a discriminator, and a training controller. Using the final network traffic header trajectory synthesis model, a network traffic header trajectory that satisfies the preset conditions for the distribution of real trajectory feature packet granularity and flow granularity features is obtained; The step of training a pre-constructed initial network traffic header trajectory synthesis model using the real network traffic data packets and the real traffic to obtain a final network traffic header trajectory synthesis model includes: using the preprocessor to reversibly bin the sequence of the real network traffic data packets to obtain binning sequences that meet preset conditions, and unifying the attribute values of the binning sequences into category features; obtaining the length of the traffic from the real traffic, and using preset category labels and the length as inputs to the generator to obtain a generated sequence; processing the binning sequence and the generated sequence with word vectors, and inputting them into the discriminator to obtain a discrimination result; using the training controller to receive the discrimination result and calculate the corresponding loss value, so as to use the loss value to guide the generator and the discriminator to optimize, thereby obtaining the final network traffic header trajectory synthesis model.
2. The method according to claim 1, characterized in that, The step of using the loss value to guide the optimization of the generator and the discriminator to obtain the final network traffic packet header trajectory synthesis model includes: The training controller is used to sample prefixes of different lengths in the generated sequence to obtain the sampled generated sequence; The sampled generated sequence is input into the discriminator to obtain an optimized discrimination result. The optimized discrimination result is then used to update the loss value until the loss value meets a preset iteration termination condition, thus obtaining the final network traffic packet header trajectory synthesis model.
3. The method according to claim 1, characterized in that, The process of reversibly binning the real network traffic data packet sequence using the preprocessor to obtain a binning sequence that meets preset conditions, and unifying the attribute values of the binning sequence into category features, includes: Obtain attribute values of the real network traffic data packets with category labels under multiple categories from the sequence of the real network traffic data packets; Calculate the chi-square value of the attribute values of the actual network traffic data packets; The chi-square value is used to bin the range of attribute values of the real network traffic data packets, so as to unify the attribute values of the real network traffic data packets into the category feature.
4. The method according to claim 1, characterized in that, The step of obtaining the length of the traffic from the real traffic, and using the preset category label and the length as input to the generator to obtain the generated sequence, includes: Obtain the previous complete sequence value, wherein the complete sequence value is a sequence value with all target attribute values embedded, and the sequence value is a vector composed of target attribute values extracted from any data table in the real network traffic data packet; Combining the previous complete sequence value, the preset category label, and the length, a path is generated using the sequence feature probability, and any target attribute value is embedded into the current sequence value; Based on the existing target attribute values in the current sequence value, a bypass is generated using conditional probability features to embed all non-embedded target attribute values into the current sequence value until the generated sequence is obtained.
5. The method according to claim 1, characterized in that, The expression for calculating the loss value of the discriminator includes: in, This represents the final loss value of the discriminator. This represents the loss value of the discriminator's judgment result on the real data. This represents the loss value of the discriminator in determining the generated sequence. Represents the gradient penalty term. Indicates weight, Represents real data. This refers to the generated sequence. Indicates the expected value. Represents the distribution of real data. This indicates the discriminant's judgment result for the real data. This represents the distribution of the generated sequences. This indicates the discrimination result of the discriminator for the generated sequence.
6. A network traffic packet header trajectory synthesis device, characterized in that, Applied to a server, wherein the device includes: The acquisition module is used to acquire real network traffic data packets and real traffic. The training module is used to train a pre-built initial network traffic header trajectory synthesis model using the real network traffic data packets and the real traffic to obtain the final network traffic header trajectory synthesis model. The initial network traffic header trajectory synthesis model consists of a preprocessor, a generator, a discriminator, and a training controller. The synthesis module is used to obtain a network traffic packet header trajectory that satisfies the preset conditions of real trajectory feature packet granularity and flow granularity feature distribution by using the final network traffic packet header trajectory synthesis model. The training module includes: a preprocessing unit, used to reversibly bin the sequence of real network traffic data packets using the preprocessor to obtain binning sequences that meet preset conditions, and to unify the attribute values of the binning sequences into category features; a generation unit, used to obtain the length of the traffic from the real traffic, and to use preset category labels and the length as inputs to the generator to obtain a generated sequence; a discrimination unit, used to process the binning sequence and the generated sequence with word vectors, and then input them into the discriminator to obtain a discrimination result; and an optimization unit, used to receive the discrimination result using the training controller, and to calculate the corresponding loss value, so as to use the loss value to guide the generator and the discriminator to optimize, so as to obtain the final network traffic packet header trajectory synthesis model.
7. A server, characterized in that, include: A memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the program to implement the network traffic header trace synthesis method as described in any one of claims 1-5.
8. A computer-readable storage medium having a computer program stored thereon, characterized in that, The program is executed by the processor to implement the network traffic packet header trajectory synthesis method as described in any one of claims 1-5.
9. A computer program product, comprising a computer program, characterized in that, When the computer program is executed, it is used to implement the network traffic packet header trajectory synthesis method as described in any one of claims 1-5.