Method and apparatus for generating stream metadata
By extracting and recombining the feature and load data of the information flow data from the multi-mode information flow template, the problem of inaccurate network flow data transmission in the existing technology is solved, and economical and efficient network flow data generation is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2023-02-24
- Publication Date
- 2026-06-16
Smart Images

Figure CN116366291B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of network security, and in particular to a method and apparatus for generating information flow metadata. Background Technology
[0002] Maintaining cybersecurity requires the development and continuous maintenance of anomaly and intrusion detection tools. Many such cybersecurity tools use machine learning (ML) algorithms to learn and differentiate between various forms of traffic, including normal and anomalous traffic. For example, supervised ML classifiers rely on acquiring examples of different traffic types (including malicious traffic), each appropriately labeled so that the ML classifier can learn how to identify these traffic flows. Therefore, the design and training of such intrusion detection systems often require a large amount of labeled traffic data. In addition, testing the network performance or security capabilities of certain hardware devices also requires substantial amounts of network flow data. A challenge in cybersecurity R&D is generating large amounts of realistic, labeled traffic data using cost-effective methods.
[0003] Existing methods for generating network streams mostly target isolated network stream characteristics; that is, these methods do not directly generate network stream data. Furthermore, existing methods for generating network stream metadata can easily lead to the generated network stream data deviating from the fixed fields of the network protocol, thus failing to achieve true transmission. Summary of the Invention
[0004] In view of this, embodiments of this application provide a method and apparatus for generating information flow metadata to eliminate or improve one or more defects existing in the prior art.
[0005] One aspect of this application provides a method for generating information flow metadata, the method comprising:
[0006] Obtain the current target information flow template selected by the user from the preset multi-mode information flow templates;
[0007] Based on the target information flow template, the feature data and load data corresponding to the original information flow data are randomly recombined to generate corresponding information flow metadata. The feature data and load data are extracted from the original information flow data in advance, and the load data is data obtained after cutting the protocol fields of the original information flow data.
[0008] In some embodiments of this application, before obtaining the current target information flow template selected by the user from preset multi-mode information flow templates, the method further includes:
[0009] The feature data is extracted from the original signal flow data based on the key feature extraction algorithm, and the load data is extracted from the original signal flow data through the protocol fixed-length separation algorithm.
[0010] In some embodiments of this application, before obtaining the current target information flow template selected by the user from preset multi-mode information flow templates, the method further includes:
[0011] The feature data is stored in the feature database, and the load data is stored in the load database.
[0012] In some embodiments of this application, the step of randomly recombining the feature data and load data corresponding to the original information flow data based on the target information flow template to generate corresponding information flow metadata includes:
[0013] Random sampling is performed on the feature database and the load database to obtain sampled feature data corresponding to the feature database and sampled load data corresponding to the load database. The network parameters of the target link are configured through the network configurator, the link is started based on the network parameters, and the information flow metadata is generated according to the sampled feature data and the sampled load data.
[0014] In some embodiments of this application, after randomly recombining the feature data and load data corresponding to the original information flow data based on the target information flow template to generate the corresponding information flow metadata, the method further includes:
[0015] The aforementioned information flow metadata is replayed for verification.
[0016] In some embodiments of this application, the step of extracting the load data from the original signal flow data using a protocol fixed-length separation algorithm includes:
[0017] Cut the header of the packet capture file corresponding to the original information flow data, read the data length of the payload data from the packet header of the packet capture file, and cut the Ethernet header, IP header and TCP header of the data content part of the packet capture file according to the data length to obtain the payload data.
[0018] In some embodiments of this application, storing the load data in a load database includes:
[0019] The load data is converted from a bitstream to a hexadecimal value and stored in the load database.
[0020] Another aspect of this application provides an apparatus for generating information flow metadata, the apparatus comprising:
[0021] The target information flow template selection module is used to obtain the current target information flow template selected by the user from the preset multi-mode information flow templates;
[0022] The information flow metadata generation module is used to randomly reorganize each feature data and each load data corresponding to the original information flow data based on the target information flow template to generate corresponding information flow metadata. The feature data and load data are extracted from the original information flow data in advance, and the load data is data obtained after cutting the protocol fields of the original information flow data.
[0023] A third aspect of this application provides an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the information flow metadata generation method described in the first aspect above.
[0024] A fourth aspect of this application provides a computer-readable storage medium having a computer program stored thereon that, when executed by a processor, implements the information flow metadata generation method described in the first aspect above.
[0025] This application provides a method and apparatus for generating network flow metadata. The method includes: obtaining a current target network flow template selected by the user from a preset multi-mode network flow template; and, based on the target network flow template, randomly recombining various feature data and various load data corresponding to the original network flow data to generate corresponding network flow metadata. The feature data and load data are extracted beforehand from the original network flow data, and the load data is data obtained after cutting off the protocol fields of the original network flow data. This application can directly generate network flow data and simultaneously achieve real-time transmission of network flow data, improving transmissibility.
[0026] Additional advantages, objectives, and features of this application will be set forth in part in the description which follows, and will in part become apparent to those skilled in the art upon review of the following description, or may be learned by practice of the application. The objectives and other advantages of this application can be realized and obtained by means of the structures specifically pointed out in the specification and drawings.
[0027] Those skilled in the art will understand that the purposes and advantages that can be achieved with this application are not limited to those specifically described above, and that the above and other purposes that this application can achieve will be more clearly understood from the following detailed description. Attached Figure Description
[0028] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, do not constitute a limitation thereof. The components in the drawings are not drawn to scale but are merely for illustrating the principles of this application. For ease of illustration and description of certain parts of this application, corresponding portions in the drawings may be enlarged, i.e., may appear larger relative to other components in an exemplary device actually manufactured according to this application. In the drawings:
[0029] Figure 1 This is a flowchart illustrating the information flow metadata generation method in one embodiment of this application.
[0030] Figure 2 This is a schematic diagram of the structure of a data flow generation device in another embodiment of this application.
[0031] Figure 3 This is a schematic diagram illustrating the basic principles and process steps of this application. Detailed Implementation
[0032] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the embodiments and accompanying drawings. Here, the illustrative embodiments and their descriptions are used to explain this application, but are not intended to limit it.
[0033] It should also be noted that, in order to avoid obscuring this application with unnecessary details, only the structures and / or processing steps closely related to the solution according to this application are shown in the accompanying drawings, while other details that are not closely related to this application are omitted.
[0034] It should be emphasized that the term "including / comprises" as used herein refers to the presence of a feature, element, step, or component, but does not exclude the presence or addition of one or more other features, elements, steps, or components.
[0035] It should also be noted that, unless otherwise specified, the term "connection" in this article can refer not only to a direct connection, but also to an indirect connection involving an intermediary.
[0036] In the following description, embodiments of the present application will be illustrated with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar parts, or the same or similar steps.
[0037] The following examples will provide a detailed description.
[0038] This application provides a method for generating flow metadata that can be executed by a flow metadata generation device. See [link to relevant documentation]. Figure 1 The aforementioned method for generating metadata for information flow specifically includes the following:
[0039] Step 110: Obtain the current target information flow template selected by the user from the preset multi-mode information flow templates.
[0040] Step 120: Based on the target information flow template, randomly reorganize each feature data and each load data corresponding to the original information flow data to generate corresponding information flow metadata. The feature data and load data are extracted from the original information flow data in advance, and the load data is data obtained after cutting the protocol fields of the original information flow data.
[0041] Specifically, the server first uses the improved WGAN algorithm to obtain the corresponding target flow template selected by the user according to the network protocol from the multi-mode flow template. Then, based on the target flow template, it randomly reassembles the feature data and load data corresponding to the original flow data, thereby directly generating the corresponding flow metadata. Based on this flow metadata, network intrusion detection and network performance or security performance testing of hardware devices can be performed.
[0042] Among them, the improved WGAN (Wasserstein GAN, Generative Adversarial Network) algorithm can use WGAN-LP, WGAN-DIV, WGAN-IDR and other algorithms that improve WGAN to generate information flow metadata. Since the use of different improved WGAN algorithms will lead to different algorithm complexities, it is also necessary to consider the transmissibility index of the generated information flow. Different WGAN algorithms have a significant impact on the quality of the information flow generated by this invention, but have little impact on the generation speed.
[0043] Information flow data refers to network flow data, while multi-mode information flow templates correspond to different network protocols. Feature data and payload data are extracted in advance from the original information flow data. Feature data includes network flow characteristics such as packet size, packet interval (packet rate), packet duration, and flow length. Payload data is data obtained after cutting the protocol fields of the original information flow data. Therefore, it can realize the real transmission of network flow data and improve transmissibility.
[0044] To further and more effectively separate the characteristic and load components from the original signal flow data, the following steps are included before step 110:
[0045] Step 010: Extract the feature data from the original information flow data based on the key feature extraction algorithm, and extract the load data from the original information flow data through the protocol fixed-length separation algorithm.
[0046] In step 010, the server uses the traffic feature extraction tool CicFlowmeter to extract features from the raw traffic flow data. However, CicFlowmeter extracts 81 features, and using all of them would lead to information redundancy and a decrease in generation speed. In addition, there are redundant features that are related to each other. Therefore, a key feature extraction algorithm is used to filter the 81 extracted features to obtain feature data. The load data is extracted from the raw traffic flow data using a protocol fixed-length separation algorithm, thereby effectively separating the feature part and the load part from the raw traffic flow data.
[0047] Among them, the traffic feature extraction tool cicflowmeter can process pcap packets, which is consistent with the data flow format in this application. Secondly, the tool can perform feature extraction on pre-acquired pcap packets in offline mode, thereby satisfying the extraction of key features and the construction of feature database. Moreover, the tool has a fast processing speed and can extract a large number of features.
[0048] The specific key feature extraction algorithm is as follows:
[0049] 1. Select the target information flow template
[0050] 2. Input the raw information flow data into CICFlowmeter for offline feature extraction and output 81-dimensional features in CSV format.
[0051] 3. Parse the above CSV file. Each row represents the features of a stream. Extract key features such as packet rate, packet size, and packet duration for each row.
[0052] 4. Store the key features extracted in the first three steps to form a feature database.
[0053] To further improve the efficiency of retrieving feature data and load data, the following step after 010 is also included:
[0054] Step 020: Store the feature data in the feature database and store the load data in the load database.
[0055] Specifically, the server stores feature data in a feature database and load data in a load database, thereby improving the efficiency of accessing feature data and load data.
[0056] To further achieve direct generation of information flow metadata, step 120 involves randomly recombining the feature data and load data corresponding to the original information flow data based on the target information flow template to generate the corresponding information flow metadata, including:
[0057] Step 121: Randomly sample the feature database and the load database to obtain sampled feature data corresponding to the feature database and sampled load data corresponding to the load database. Configure the network parameters of the target link through the network configurator, start the link based on the network parameters, and generate the information flow metadata according to the sampled feature data and the sampled load data.
[0058] Specifically, the server randomly samples the feature database and the load database to obtain the sampled feature data corresponding to the feature database and the sampled load data corresponding to the load database. Then, it configures the specified source IP, destination IP, port, data bandwidth (0 Bits / s-50000 Bits / s), and link duration of the target link through the network configurator. Based on the aforementioned network parameters, the link is started and the information flow metadata is generated according to the sampled feature data and sampled load data, thereby enabling the direct generation of information flow metadata.
[0059] After step 120, which involves randomly recombining the feature data and load data corresponding to the original information flow data based on the target information flow template to generate the corresponding information flow metadata, the method further includes:
[0060] Step 122: Replay and verify the information flow metadata.
[0061] Specifically, the server uses tools such as tcpreplay, goreplay, or tcpcopy to replay and verify the metadata of the information flow, thereby proving that the information flow generated by this application can meet the requirements of real network transmission and has effective transmissibility.
[0062] The specific replay verification steps are as follows:
[0063] 1. Establish the network topology for message flow replay, specifying the receiving and sending ends.
[0064] 2. Replay the data stream generated in the above process at the sending end using the tcpreplay tool.
[0065] 3. After the replay ends at the transmitting end, a replay log is generated to display the success rate of the signal transmission. At the receiving end, packets are captured using Wireshark to verify the success of the replay.
[0066] To further obtain load data more accurately and effectively, step 010 involves extracting the load data from the original signal flow data using a protocol fixed-length separation algorithm, including:
[0067] Step 011: Cut the header of the packet capture file corresponding to the original information flow data, read the data length of the payload data from the packet header of the packet capture file, and cut the Ethernet header, IP header and TCP header of the data content part of the packet capture file according to the data length to obtain the payload data.
[0068] Specifically, since both the original data flow and the generated metadata flow are stored in pcap (PacketCapture) format, a protocol fixed-length splitting algorithm is used to parse the pcap. A pcap consists of three parts: a pcap header, a packet header, and the data content. The payload is a portion of the data content; therefore, to obtain the payload, the pcap header is first trimmed. The length of the data content is then read from the packet header, and the packet header is trimmed to obtain the data content. Taking the TCP protocol as an example, the first 14 bytes of the data content are the Ethernet header and need to be trimmed, the next 20 bytes are the IP header and need to be trimmed, and the next 32 bytes are the TCP header and need to be trimmed. Therefore, the TCP protocol pcap packet needs to trim 66 bytes of the acquired data content to obtain the payload, thus enabling accurate and efficient acquisition of the payload data.
[0069] To further and more effectively improve the storage capacity of the load database, step 020, which involves storing the load data in the load database, includes:
[0070] Step 021: Convert the load data from bit stream to hexadecimal value and store it in the load database.
[0071] Specifically, the server converts the payload data from a bitstream to a hexadecimal value and stores it in the payload database, which can effectively improve the storage capacity of the payload database while preserving the bitstream of the payload data.
[0072] The basic principles and procedural steps of this application can be found in [reference needed]. Figure 3 The server first separates the load and features of the raw traffic to form a feature library and a load library. Then, the network configurator configures the network parameters to reassemble the features and generate a stream generation file. The client device receives the generated stream metadata, i.e., the stream generation file.
[0073] From a software perspective, this application also provides an apparatus for performing all or part of the aforementioned information flow metadata generation method, see [link to relevant documentation]. Figure 2 The aforementioned information flow metadata generation device specifically includes the following components:
[0074] Module 10: Target Information Flow Template Selection Module, used to obtain the current target information flow template selected by the user from the preset multi-mode information flow templates.
[0075] Module 20: Information flow metadata generation module, used to randomly reorganize each feature data and each load data corresponding to the original information flow data based on the target information flow template to generate corresponding information flow metadata, wherein the feature data and load data are extracted from the original information flow data in advance, and the load data is data obtained after cutting the protocol fields of the original information flow data.
[0076] The embodiments of the information flow metadata generation device provided in this application can be used to execute the processing flow of the embodiments of the information flow metadata generation method in the above embodiments. Its functions will not be repeated here, but can be referred to the detailed description of the embodiments of the information flow metadata generation method described above.
[0077] This application provides a method and apparatus for generating network flow metadata. The method includes: obtaining a current target network flow template selected by the user from a preset multi-mode network flow template; and, based on the target network flow template, randomly recombining various feature data and various load data corresponding to the original network flow data to generate corresponding network flow metadata. The feature data and load data are extracted beforehand from the original network flow data, and the load data is data obtained after cutting off the protocol fields of the original network flow data. This application can directly generate network flow data and simultaneously achieve real-time transmission of network flow data, improving transmissibility.
[0078] This application also provides an electronic device (i.e., an electronic device), such as a central server. This electronic device may include a processor, a memory, a receiver, and a transmitter. The processor is used to execute the information flow metadata generation method mentioned in the above embodiments. The processor and memory can be connected via a bus or other means, taking a bus connection as an example. The receiver can be connected to the processor and memory via wired or wireless means.
[0079] The processor can be a central processing unit (CPU). The processor can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations of the above types of chips.
[0080] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as the program instructions / modules corresponding to the information flow metadata generation method in the embodiments of this application. The processor executes various functional applications and data processing by running the non-transitory software programs, instructions, and modules stored in the memory, thereby implementing the image classification model training method based on reinforced federated domain generalization in the above method embodiments.
[0081] The memory may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created by the processor, etc. Furthermore, the memory may include high-speed random access memory and non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory may optionally include memory remotely located relative to the processor, which can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0082] The one or more modules are stored in the memory, and when executed by the processor, the information flow metadata generation method in the embodiment is executed.
[0083] In some embodiments of this application, the user equipment may include a processor, a memory, and a transceiver unit. The transceiver unit may include a receiver and a transmitter. The processor, memory, receiver, and transmitter may be connected via a bus system. The memory is used to store computer instructions, and the processor is used to execute the computer instructions stored in the memory to control the transceiver unit to send and receive signals.
[0084] As one implementation method, the functions of the receiver and transmitter in this application can be implemented by transceiver circuits or dedicated transceiver chips, and the processor can be implemented by dedicated processing chips, processing circuits or general-purpose chips.
[0085] As another implementation approach, the server provided in this application embodiment can be implemented using a general-purpose computer. That is, the program code implementing the processor, receiver, and transmitter functions is stored in memory, and the general-purpose processor implements the processor, receiver, and transmitter functions by executing the code in memory.
[0086] This application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the aforementioned information flow metadata generation method. The computer-readable storage medium can be a tangible storage medium, such as random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disks, hard disks, removable storage disks, CD-ROMs, or any other form of storage medium known in the art.
[0087] Those skilled in the art will understand that the exemplary components, systems, and methods described in conjunction with the embodiments disclosed herein can be implemented in hardware, software, or a combination of both. Whether implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application. When implemented in hardware, it can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this application are programs or code segments used to perform the required tasks. The programs or code segments can be stored in a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried on a carrier wave.
[0088] It should be clarified that this application is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of this application is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of this application.
[0089] In this application, features described and / or illustrated for one embodiment may be used in the same or similar manner in one or more other embodiments, and / or combined with or in place of features of other embodiments.
[0090] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Various modifications and variations can be made to the embodiments of this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.
Claims
1. A method for generating information flow metadata, characterized in that, include: Feature data is extracted from the original information flow data based on the key feature extraction algorithm, and load data is extracted from the original information flow data through the protocol fixed-length separation algorithm; Obtain the current target traffic template selected by the user from the preset multi-mode traffic templates; wherein, the feature data includes packet size, packet interval, packet duration, and flow length, and the extraction of payload data from the original traffic data by the protocol fixed-length separation algorithm includes: cutting the packet header of the packet capture file corresponding to the original traffic data, reading the data length of the payload data from the packet header of the packet capture file, and cutting the Ethernet header, IP header, and TCP header of the data content part of the packet capture file according to the data length to obtain the payload data; Based on the target traffic flow template, the feature data and load data corresponding to the original traffic flow data are randomly recombined to generate corresponding traffic flow metadata. The feature data and load data are extracted from the original traffic flow data beforehand, and the load data is obtained by cutting off the protocol fields of the original traffic flow data. The process of randomly recombining the feature data and load data corresponding to the original traffic flow data based on the target traffic flow template to generate corresponding traffic flow metadata includes: randomly sampling the feature database and load database to obtain sampled feature data corresponding to the feature database and sampled load data corresponding to the load database; configuring the network parameters of the target link through a network configurator; starting the link based on the network parameters; and generating the traffic flow metadata based on the sampled feature data and the sampled load data.
2. The method for generating information flow metadata according to claim 1, characterized in that, Before obtaining the current target information flow template selected by the user from the preset multi-mode information flow templates, the method further includes: The feature data is stored in the feature database, and the load data is stored in the load database.
3. The method for generating information flow metadata according to claim 1, characterized in that, After randomly recombining the feature data and load data corresponding to the original information flow data based on the target information flow template to generate the corresponding information flow metadata, the method further includes: The aforementioned information flow metadata is replayed for verification.
4. The method for generating information flow metadata according to claim 2, characterized in that, The process of storing load data in a load database includes: The load data is converted from a bitstream to a hexadecimal value and stored in the load database.
5. A device for generating information flow metadata, characterized in that, include: The target flow template selection module is used to extract feature data from the original flow data based on a key feature extraction algorithm, and to extract load data from the original flow data through a protocol fixed-length separation algorithm; it obtains the current target flow template selected by the user from a preset multi-mode flow template; wherein, the feature data includes packet size, packet interval, packet duration, and flow length, and the extraction of load data from the original flow data through the protocol fixed-length separation algorithm includes: cutting the packet header of the packet capture file corresponding to the original flow data, reading the data length of the load data from the packet header of the packet capture file, and cutting the Ethernet header, IP header, and TCP header of the data content part of the packet capture file according to the data length to obtain the load data; The information flow metadata generation module is used to randomly reassemble various feature data and various load data corresponding to the original information flow data based on the target information flow template to generate corresponding information flow metadata. The feature data and load data are extracted from the original information flow data in advance, and the load data is data obtained after cutting off the protocol fields of the original information flow data. The random reassembly of various feature data and various load data corresponding to the original information flow data based on the target information flow template to generate corresponding information flow metadata includes: randomly sampling the feature database and the load database to obtain sampled feature data corresponding to the feature database and sampled load data corresponding to the load database; configuring the network parameters of the target link through a network configurator; starting the link based on the network parameters; and generating the information flow metadata based on the sampled feature data and the sampled load data.
6. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the information flow metadata generation method as described in any one of claims 1 to 4.
7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the information flow metadata generation method as described in any one of claims 1 to 4.