Data generation device and data generation method

The data generation device and method address the lack of time information in network data by segmenting and generating distribution functions for time-series occurrence timing, resulting in simulated network data that improves anomaly detection and fault prediction.

WO2026133434A1PCT designated stage Publication Date: 2026-06-25NT T INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
NT T INC
Filing Date
2024-12-17
Publication Date
2026-06-25

Smart Images

  • Figure JP2024044669_25062026_PF_FP_ABST
    Figure JP2024044669_25062026_PF_FP_ABST
Patent Text Reader

Abstract

A data generation device (10) according to the present disclosure comprises: a division unit (11) that uses a predetermined rule as a basis to divide a data set transmitted and received over a network into a plurality of pieces of divided data; a generation unit (12) that uses the divided data as a basis to generate a distribution function indicating the generation timing of data transmitted and received over the network; and a synthesis unit (13) for converting the distribution function into time information, synthesizing the time information with network data relating to data that is transmitted and received over the network and does not have time information, and generating pseudo-network data.
Need to check novelty before this filing date? Find Prior Art

Description

Data generation device and data generation method

[0001] The present disclosure relates to a data generation device and a data generation method.

[0002] It has been considered to perform machine learning using network traffic to detect anomalies or predict failures in a network and take countermeasures. In machine learning using traffic, it is necessary to provide traffic data in the network as an input. In order to widely disclose the learning results, the dataset used for learning also needs to be disclosed. Conventionally, only limited datasets could be obtained. Especially in machine learning using traffic time information, a large amount of traffic data having intended features is required, but it has been costly to obtain such data.

[0003] Non-Patent Document 1 describes a technique for generating network data that simulates a network environment using a generation model called GAN (Generative Adversarial Networks). GAN has two functions: Generator and Discriminator. Generator is a function that generates pseudo data simulating the input real data. Discriminator is a function that discriminates the authenticity of the data generated by Generator. Generator is learned so that it can generate pseudo data that can deceive Discriminator, and Discriminator is learned so that it can discriminate the authenticity of the generated data. As a result, data similar to the prepared dataset can be generated.

[0004] Yucheng Yin, Zinan Lin, Minhao Jin, Giulia Fanti, and Vyas Sekar. 2022. Practical GAN-based synthetic IP header trace generation using NetSharee. In Proceedings of the ACM SIGCOMM 2022 Conference. 458--472.

[0005] The technology described in Non-Patent Document 1 does not take into account time information such as the communication time of packets' uplink and downlink. Even if machine learning is performed using data generated without considering time information, it is difficult to use the results for anomaly detection or fault prediction in the network. Therefore, there is a need for a technology that generates network data that includes time information, simulating network traffic.

[0006] In light of the problems described above, the purpose of this disclosure is to provide a data generation device and a data generation method that can generate simulated network data containing time information that simulates network traffic.

[0007] A data generation device according to one embodiment is a data generation device that generates pseudo-network data relating to data transmitted and received on a network, which has time information and simulates network traffic, and comprises: a division unit that divides a dataset transmitted and received on the network into a plurality of divided data based on predetermined rules; a generation unit that generates a distribution function indicating the time-series occurrence timing of the data transmitted and received on the network based on the divided data; and a synthesis unit that converts the distribution function into time information and synthesizes the time information with network data relating to the data transmitted and received on the network, which does not have time information, to generate the pseudo-network data.

[0008] A data generation method according to one embodiment is a data generation method executed by a data generation device that generates pseudo-network data relating to data transmitted and received on a network, which has time information and simulates network traffic, comprising: dividing a dataset transmitted and received on the network into a plurality of divided data based on predetermined rules; generating a distribution function indicating the time-series occurrence timing of the data transmitted and received on the network based on the divided data; converting the distribution function into time information; and generating the pseudo-network data by combining the time information with network data relating to the data transmitted and received on the network, which does not have time information.

[0009] According to this disclosure, it is possible to generate simulated network data that simulates network traffic and includes time information.

[0010] This figure shows an example of the configuration of a data generation device according to one embodiment of the present disclosure. This figure shows an example of data set division by the division unit shown in Figure 1. This figure shows another example of data set division by the division unit shown in Figure 1. This figure shows an example of a distribution function. This figure shows an example of the configuration of the generation unit shown in Figure 1. This figure shows another example of the configuration of the generation unit shown in Figure 1. This figure shows an example of actual packet arrival timing in a network. This figure shows an example of packet arrival timing based on the distribution function generated by the generation unit shown in Figure 1. This figure shows an example of network data. This figure is for explaining the generation of simulated network data by the synthesis unit shown in Figure 1. This figure is for explaining the generation of simulated network data by the synthesis unit shown in Figure 1. This is a flowchart showing an example of the operation of the data generation device shown in Figure 1. This figure shows an example of the configuration of a computer that functions as a data generation device according to the present disclosure.

[0011] Embodiments of this disclosure will be described below with reference to the drawings.

[0012] Figure 1 shows an example of the configuration of a data generation device 10 according to one embodiment of the present disclosure. The data generation device 10 according to this embodiment is a device that generates pseudo-network data relating to data transmitted and received over a network, which has time information and simulates network traffic.

[0013] As shown in Figure 1, the data generation device 10 according to this embodiment comprises a splitting unit 11, a generation unit 12, and a synthesis unit 13.

[0014] The splitting unit 11 receives a dataset that is transmitted and received over the network. The dataset may be a packet, a series of flows sent from a source to a destination, etc., but is not limited to these. The splitting unit 11 divides the input dataset into multiple split data based on predetermined rules.

[0015] The splitting unit 11 divides the input dataset into 5-tuples (source IP address (Src IP), destination IP address (Dst IP), source port number (Src Port), destination port number (Dst Port), and protocol (Protocol)) as shown in Figure 2A, for example. Alternatively, the splitting unit 11 may divide the input dataset into 5-tuples of the flow. Alternatively, the splitting unit 11 may divide the input dataset into source IP addresses. Alternatively, the splitting unit 11 may divide the input dataset into headers (IP (Internet Protocol) header and TCP (Transmission Control Protocol) header) as shown in Figure 2B, for example.

[0016] The segmented data may, for example, consist only of 5-tuples. Alternatively, the segmented data may include, for example, 5-tuples and a Payload. Alternatively, the segmented data may include 5-tuples, a Payload, and header information (IP header and TCP / UDP (User Datagram Protocol) header) including communication flags and packet length. Alternatively, the segmented data may consist only of 7-tuples (source MAC address (Src MAC), destination MAC address (Dst MAC), Src IP, Dst IP, Src Port, Dst Port, and Protocol). Alternatively, the segmented data may include, for example, 7-tuples and a Payload. Alternatively, the segmented data may include 7-tuples, a Payload, and header information including communication flags and packet length.

[0017] The splitting unit 11 may split the input dataset regardless of the TCP / IP header. For example, the splitting unit 11 may split the input dataset by application.

[0018] Referring again to Figure 1, the generation unit 12 generates a distribution function based on the segmented data that shows the time-series timing of data transmitted and received over the network. As shown in Figure 3, the distribution function is a function in which the horizontal axis represents time and the vertical axis represents intensity (frequency of data occurrence).

[0019] To generate a distribution function, the generation unit 12 is first trained. Specifically, it divides the dataset actually sent and received on the network according to the same rules as the division unit 11, and approximates the distribution function from the arrival timing of the divided data. Then, a model for generating the distribution function is constructed using machine learning with the divided data and the distribution function of that data. The generation unit 12 generates the distribution function using the constructed model. That is, the generation unit 12 generates a distribution function based on the divided data using a model constructed by machine learning with the dataset sent and received on the network and the transmission and reception timing of that dataset.

[0020] Next, we will explain the details of the generation of the distribution function by the generation unit 12. As mentioned above, the generation unit 12 generates the distribution function using a model constructed by machine learning. First, we will explain the generation of the distribution function using a machine learning model called VAE (Variational Auto Encoder).

[0021] Figure 4A shows an example of the configuration of the generation unit 12 that generates a distribution function using VAE. As shown in Figure 4A, the generation unit 12 includes an encoder 121 and a decoder 122.

[0022] Encoder 121 is a machine learning model that transforms input information into a latent space z for the purpose of dimensionality reduction and feature extraction. In the example shown in Figure 4A, encoder 121 takes 5-tuples of divided data as input and transforms the input information into a latent space z that follows a normal distribution.

[0023] The information from the 5-tuples is plotted on the latent space z, and the mean μ and standard deviation σ of the normal distribution are extracted based on the plotted information. The decoder 122 generates a distribution function based on the extracted mean μ and standard deviation σ.

[0024] Next, we will explain the generation of distribution functions using GANs.

[0025] Figure 4B shows an example of the configuration of the generation unit 12 that generates a distribution function using a GAAN. As shown in Figure 4B, the generation unit 12 comprises a generator 123 and a detector 124.

[0026] The generator 123 receives a condition (a 5-tuple in the example shown in Figure 4B) and a normal distribution as input, and generates a distribution function based on the input 5-tuple.

[0027] The detector 124 receives the distribution function generated by the generator 123 and the actual distribution function of the data corresponding to the input 5-tuple, and determines whether the distribution function generated by the generator 123 is true or false.

[0028] The generator 123 is constructed by updating its learning parameters so that the detector 124 can identify it as "true". The detector 124 is constructed by updating its learning parameters so that it can accurately distinguish between the distribution function generated by the generator 123 and the original distribution function. By inputting the divided data into the generator 123 constructed in this way, a distribution function can be generated.

[0029] Figure 5A shows an example of packet arrival timing in a real network. Figure 5A shows packet arrival timing for 10 sessions, from Session 1 to Session 10. As shown in Figure 5A, packets arrive uniformly in each session, or they arrive in a pattern with peaks at regular intervals.

[0030] Figure 5B shows the packet arrival timing based on a distribution function generated from network data of the same network as in Figure 5A. In Figure 5B, as in Figure 5A, the packet arrival timing is shown for 10 sections from Session 1 to Session 10.

[0031] In this embodiment, since a distribution function fitted to a normal distribution was used, the packet generation time interval is wider compared to Figure 5A, but it was confirmed that results similar to the actual packet arrival timing were obtained.

[0032] Referring again to Figure 1, the synthesis unit 13 receives the distribution function generated by the generation unit 12 and the network data as input. The network data is information about data transmitted and received over the network, but does not contain time information. The synthesis unit 13 converts the distribution function into time information and synthesizes the time information with the network data to generate pseudo-network data.

[0033] Figure 6 shows an example of network data. As shown in Figure 6, network data includes, for example, the source IP address and destination IP address of the data, such as 5-tuples. Network data does not include time information such as the time the data was sent or received.

[0034] The synthesis unit 13 converts the distribution function into time information. The synthesis unit 13 generates time information corresponding to the peaks of the distribution function, for example, as shown in Figure 7A. The synthesis unit 13 generates time information for the length of the network data sequence. Then, as shown in Figure 7B, the synthesis unit 13 adds (synthesizes) the generated time information to the network data to generate pseudo-network data that has time information and simulates network traffic.

[0035] Next, the operation of the data generation device 10 according to this embodiment will be described. Figure 8 is a flowchart showing an example of the operation of the data generation device 10 according to this embodiment, and is a diagram for explaining the data generation method executed by the data generation device 10 according to this embodiment.

[0036] The splitting unit 11 divides the dataset transmitted and received over the network into multiple split data sets based on predetermined rules (step S11).

[0037] The generation unit 12 generates a distribution function that indicates the time-series timing of data transmitted and received over the network, based on the segmented data (step S12).

[0038] The synthesis unit 13 converts the distribution function into time information and synthesizes the time information with network data that does not have time information and is transmitted and received over the network to generate pseudo-network data (step S13).

[0039] As described above, the data generation device 10 according to this embodiment comprises a splitting unit 11, a generation unit 12, and a synthesis unit 13. The splitting unit 11 divides a dataset transmitted and received over a network into a plurality of split data based on predetermined rules. The generation unit 12 generates a distribution function that indicates the time-series occurrence timing of the data transmitted and received over the network based on the split data. The synthesis unit 13 converts the distribution function into time information and synthesizes the time information with network data relating to the data transmitted and received over the network, which does not have time information, to generate pseudo-network data.

[0040] By dividing a dataset transmitted and received over a network into segmented data, a distribution function for the data transmitted and received over the network can be generated. This distribution function, transformed to obtain temporal information, can then be combined with network data that lacks temporal information to create simulated network data that includes temporal information.

[0041] The data generation device 10 described above can be realized by the computer 20 shown in Figure 9. A program for causing the computer 20 to function as the data generation device 10 may be provided. This program may be stored on a storage medium or provided via a network. Figure 9 is a block diagram illustrating the schematic configuration of the computer 20 functioning as the data generation device 10. The computer 20 may be a general-purpose computer, a dedicated computer, a workstation, a PC (Personal Computer), an electronic notepad, etc. Program instructions may be program code, code segments, etc., for executing the necessary tasks.

[0042] As shown in FIG. 9, the computer 20 includes a processor 21, a ROM (Read Only Memory) 22, a RAM (Random Access Memory) 23, a storage 24, an input unit 25, a display unit 26, and a communication interface (I / F) 27. Each component is communicably connected to each other via a bus 29. Specifically, the processor 21 is a CPU (Central Processing Unit), MPU (Micro Processing Unit), GPU (Graphics Processing Unit), DSP (Digital Signal Processor), SoC (System on a Chip), etc., and may be composed of a plurality of processors of the same type or different types.

[0043] The processor 21 is a control unit that controls each component and executes various arithmetic processes. That is, the processor 21 reads a program from the ROM 22 or the storage 24 and executes the program using the RAM 23 as a work area. The processor 21 performs control of each of the above components and various arithmetic processes according to a program stored in the ROM 22 or the storage 24. In the present embodiment, a program for operating the computer 20 as the data generation device 10 according to the present disclosure is stored in the ROM 22 or the storage 24. By the program being read and executed by the processor 21, the division unit 11, the generation unit 12, and the synthesis unit 13 included in the data generation device 10 are realized.

[0044] The program may be provided in a form stored in a non-transitory storage medium such as a CD-ROM (Compact Disk Read Only Memory), a DVD-ROM (Digital Versatile Disk Read Only Memory), or a USB (Universal Serial Bus) memory. Also, the program may be in a form downloaded from an external device via a network.

[0045] The ROM 22 stores various programs and various data. The RAM 23 temporarily stores programs or data as a work area. The storage 24 is composed of an HDD (Hard Disk Drive) or an SSD (Solid State Drive), and stores various programs including an operating system and various data.

[0046] The input unit 25 includes a pointing device such as a mouse and a keyboard, and is used to perform various inputs.

[0047] The display unit 26 is, for example, a liquid crystal display, and displays various information. The display unit 26 may adopt a touch panel method and function as the input unit 25.

[0048] The communication interface 27 is an interface for communicating with other devices, and is, for example, an interface for a LAN.

[0049] Regarding the above embodiments, the following additional remarks are further disclosed.

[0050] [Additional Claim 1] A data generation device that generates pseudo-network data related to data transmitted and received in the network, which has time information and simulates the traffic of the network, comprising a control unit, wherein the control unit divides a data set transmitted and received in the network into a plurality of divided data based on a predetermined rule, generates a distribution function indicating the temporal generation timing of the data transmitted and received in the network based on the divided data, converts the distribution function into time information, and synthesizes the time information with network data related to the data transmitted and received in the network that does not have time information to generate the pseudo-network data.

[0051] [Additional Claim 2] The data generation device according to Additional Claim 1, wherein the control unit generates the distribution function based on the divided data using a model constructed by machine learning of the data set transmitted and received in the network and the transmission and reception timing of the data set.

[0052] [Appendix 3] A data generation method executed by a data generation device that generates pseudo-network data relating to data transmitted and received on a network, which has time information and simulates network traffic, comprising: dividing a dataset transmitted and received on the network into a plurality of divided data based on a predetermined rule; generating a distribution function that indicates the time-series occurrence timing of data transmitted and received on the network based on the divided data; converting the distribution function into time information; and generating the pseudo-network data by combining the time information with network data relating to data transmitted and received on the network, which does not have time information.

[0053] [Appendix 4] A non-temporary storage medium storing a program executable by a computer, the non-temporary storage medium storing a program that causes the computer to operate as a data generation device as described in Appendix 1 or 2.

[0054] Although the embodiments described above are representative examples, it will be apparent to those skilled in the art that many modifications and substitutions are possible within the spirit and scope of this disclosure. Therefore, the present invention should not be construed as being limited by the embodiments described above, and various modifications or changes are possible without departing from the claims. For example, it is possible to combine multiple component blocks shown in the configuration diagram of the embodiments into one, or to divide one component block.

[0055] 10 Data generation device 11 Splitting unit 12 Generation unit 13 Synthesis unit 20 Computer 21 Processor 22 ROM 23 RAM 24 Storage 25 Input unit 26 Display unit 27 Communication I / F 29 Path

Claims

1. A data generation device that generates pseudo-network data relating to data transmitted and received on a network, which has time information and simulates network traffic, comprising: a division unit that divides a dataset transmitted and received on the network into a plurality of divided data based on predetermined rules; a generation unit that generates a distribution function indicating the time-series occurrence timing of data transmitted and received on the network based on the divided data; and a synthesis unit that converts the distribution function into time information and synthesizes the time information with network data relating to data transmitted and received on the network, which does not have time information, to generate the pseudo-network data.

2. A data generation device according to claim 1, wherein the generation unit generates the distribution function based on the divided data using a model constructed by machine learning of the data set transmitted and received over the network and the transmission and reception timing of the data set.

3. A data generation method executed by a data generation device that generates pseudo-network data relating to data transmitted and received on a network, which has time information and simulates network traffic, comprising: dividing a dataset transmitted and received on the network into a plurality of divided data based on predetermined rules; generating a distribution function indicating the time-series occurrence timing of data transmitted and received on the network based on the divided data; converting the distribution function into time information; and generating the pseudo-network data by combining the time information with network data relating to data transmitted and received on the network, which does not have time information.