A graph federated learning privacy enhancement method for industrial terminal network flow detection
By using graph federated learning and differential privacy technology, a knowledge graph is constructed and entity recognition is performed, which solves the problems of privacy data leakage and low detection accuracy in federated learning, and realizes efficient and secure detection of network traffic of industrial terminals.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2023-04-24
- Publication Date
- 2026-06-19
AI Technical Summary
Existing federated learning poses a risk of privacy data leakage in network traffic detection, and the traffic classification results are difficult to directly combine with network anomaly detection, requiring manual intervention, resulting in low detection accuracy and efficiency.
We employ a graph federated learning approach, which combines knowledge graph construction with federated learning. We use the Word2Vec word vector model and the Bi_LSTM-CRF model for entity recognition, construct triples and visualize them. We also use differential privacy technology to enhance privacy and add Gaussian noise to protect client data.
It reduces the risk of privacy breaches in traffic data, improves the accuracy and automation of network traffic detection, reduces human intervention, and enhances the causal correlation and accuracy of detection results.
Smart Images

Figure CN116701618B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of federated learning and industrial internet technology, and in particular to a graph federated learning privacy enhancement method for industrial terminal network traffic detection. Background Technology
[0002] Federated learning, as an emerging machine learning approach with high privacy and low communication costs, is gaining increasing attention. Traffic classification, as the first step in network anomaly detection or network-based intrusion detection systems, can play a crucial role in cybersecurity. However, current federated learning still carries the potential for privacy breaches through shared gradient data, and the results of traffic classification are difficult to directly integrate with network anomaly detection, requiring some degree of human intervention. Summary of the Invention
[0003] This invention provides a graph-based federated learning privacy enhancement method for network traffic detection of industrial terminals. Unlike traditional anomaly detection methods based on hard-coded traffic classification, this method combines federated learning and knowledge graphs to integrate traffic classification and network traffic anomaly detection. This reduces the risk of traffic data privacy leakage and enables the federated learning-based detection results to reflect a deeper causal relationship between traffic data and malicious behavior. This further improves the accuracy of judging whether a terminal has malicious tendencies and reduces human error and manpower costs caused by manual intervention.
[0004] This invention provides a graph federated learning privacy enhancement method for industrial terminal network traffic detection, comprising:
[0005] Configure a federated learning environment; wherein, the federated learning environment includes a dataset, a federated learning task, and a federated learning model;
[0006] A knowledge graph is constructed based on the classification learning results of the federated learning model.
[0007] Assess the privacy risks of the federated learning process based on the described federated learning environment;
[0008] Privacy enhancement methods will be added to strengthen the federated learning client based on the aforementioned privacy risks.
[0009] Furthermore, the step of configuring the federated learning environment includes:
[0010] Define a dataset, and use the sorted labels of traffic software classes as data tags;
[0011] The system performs traffic segmentation, traffic cleaning, image generation, and IDX conversion on the raw .pcap format traffic data sent by the terminal and software application, using the terminal and software application as federated learning clients. Specifically, traffic segmentation and cleaning involve splitting the .pcap file into discrete traffic units, then organizing, trimming, and deleting them to transform all files into uniform data bytes. Image generation and IDX conversion treat the processed data bytes as pixels, perform black-and-white image conversion, and package them into .IDX format.
[0012] The traffic data is converted into a black and white image of a set number of pixels and packaged into .IDX format. The black and white images are then classified by terminal.
[0013] A federated learning framework based on the federated stochastic gradient descent algorithm is adopted, with the central server sending the global model W. t The selected client trains the model based on its local traffic dataset and applies the gradient of the loss function. The gradients are sent to the central server for aggregation to configure the federated learning model; the client-side gradient calculation formula is as follows:
[0014]
[0015] Where, x t,i and y t,i This represents the data and labels that the i-th client will learn in the t-th iteration, and F(·) represents the model output, i.e., the output based on the input x. t,i The model weights are W. t The output value of the neural network, the predicted data corresponding to the label, and l(·) is the loss function used to calculate the difference between the estimated label and the true label, and the gradient of the loss function is obtained by taking its derivative.
[0016] Furthermore, the step of constructing a knowledge graph based on the classification learning results of the federated learning model includes:
[0017] Based on traffic data, determine the relationships between information entities and individual information entities to construct triples;
[0018] The Word2Vec word vector model is used to transform entity names from semantic space to vector space, and the angle between vectors is calculated to classify entity categories.
[0019] The entities in the triples are transformed into nodes, and the relations are transformed into edges. The traffic data, industrial equipment, running software and malicious tendency detection results are visualized in the form of points and edges. The triples are stored as knowledge using the graph database Neo4j.
[0020] The Bi_LSTM-CRF model for entity recognition extracts the entities contained in the user's input question and assigns them to entity nodes in the knowledge storage. By retrieving the association relationships represented by the edges associated with the nodes, the information associated with the entities in the graph is obtained and returned to the user.
[0021] Furthermore, the step of determining the relationship between information entities and individual information entities based on traffic data to construct triples includes:
[0022] Identify information entities in traffic data, wherein the information entities include the software terminal to which the traffic belongs, the device controlled by the software, and the traffic data itself;
[0023] The relationship between the traffic data detected on the network and the various information entities is determined based on the classification results of federated learning;
[0024] The malicious relationship is determined based on the spatiotemporal characteristics, background characteristics, and handshake characteristics of the detected traffic data;
[0025] Construct a triple G from the identified entities and relationships. t = (ε, R, τ); where ε represents the set of entities, R represents the set of relationships between entities, and τ represents the set of triples consisting of entities and relationships.
[0026] Furthermore, the step of assessing the privacy risks of the federated learning process based on the federated learning environment includes:
[0027] Assuming the attacked object is a local image of device t, a gradient leakage attack is used to simulate the attack.
[0028] By simulating an attack, the client calculates the peak signal-to-noise ratio (PSNR) between the reconstructed image and the original image as an indicator of their similarity. The calculation formula is as follows:
[0029]
[0030]
[0031] Where MSE is the mean square error between the two images, x i and These represent the pixels of the original image and the reconstructed image, respectively, where B is the pixel size and MAX is the maximum pixel value. I To reconstruct the maximum pixel value in the image.
[0032] Furthermore, assuming the attacked and reconstructed image is a local image of device t, the steps of simulating the attack using gradient leakage include:
[0033] Generate a random virtual traffic image X' t And the category label Y' of traffic datat Each pixel x' in the image t The amplitudes of all values follow a random distribution:
[0034] X' t ←N(0,1)
[0035] Where N represents a normal distribution with an expected value of 0 and a variance of 1;
[0036] The generated virtual image X' t The locally trained neural network model φ of the data input device t t (X' t Obtain virtual gradients in ) When the attacked client sends local gradient data Before reaching the server, the client initiates a simulated DLG attack and calculates the generated virtual gradient. and the original gradient The distance between them, and by continuously adjusting the virtually generated random data X' t and virtual label Y' t To minimize this distance:
[0037]
[0038] in, and It is with X' t and Y' t The optimal reconstruction value, which minimizes the distance between the virtual gradient and the original gradient after basic iterative calculation, is the final reconstruction result.
[0039] Furthermore, the step of adding privacy enhancement methods to enhance the federated learning client based on the privacy risks includes:
[0040] In client gradients where leakage is risky The Gaussian noise x with normally distributed characteristics superimposed on it has a probability density function that satisfies:
[0041]
[0042]
[0043] Δ f =max||φ t (D k )-φ t (D i )||
[0044] Where μ is the mean of Gaussian noise, σ 2Let φ be the noise variance, representing the noise level, and ε be the privacy budget, i.e., the local model φ in the context of neighboring datasets. t The output error value after adding noise does not exceed e. ε δ represents the relaxation term of differential privacy, i.e., the probability of not satisfying strict differential privacy, Δ f For global model sensitivity, D k and D i Adjacent datasets are two datasets that differ by only one bit.
[0045] Based on the federated learning model, the client selects the privacy budget ε and global sensitivity Δ according to local data characteristics. f To add noise, and to add noise to the gradient data sent to the central server:
[0046]
[0047] in, The gradient after adding noise interference, N(0,σ) 2 Gaussian noise is used to meet the distribution requirements.
[0048] This invention also provides a graph federated learning privacy enhancement device for industrial terminal network traffic detection, comprising:
[0049] A configuration module is used to configure the federated learning environment; wherein, the federated learning environment includes a dataset, a federated learning task, and a federated learning model;
[0050] A construction module is used to construct a knowledge graph based on the classification learning results of the federated learning model;
[0051] An evaluation module is used to assess the privacy risks of the federated learning process based on the federated learning environment.
[0052] An enhancement module is used to add privacy enhancement methods based on the aforementioned privacy risks to enhance the federated learning client.
[0053] The present invention also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the above-described method.
[0054] The present invention also provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the above-described method.
[0055] The beneficial effects of this invention are as follows:
[0056] This invention constructs a knowledge graph from traffic data of industrial internet terminals and performs real-time anomaly detection and analysis of network traffic based on federated learning technology. To enable the detection results based on graph federated learning to provide management systems with more decision-making and early warning information deeply related to the risk of privacy traffic data leakage, this invention further proposes a privacy enhancement method to improve the adopted graph federated learning technology. It details the terminal data acquisition and processing and federated learning model configuration optimization for industrial anomaly traffic detection tasks. Leveraging the powerful inference capabilities of knowledge graphs, it establishes a causal correlation between traffic classification results and the presence of malicious intent in terminals, improving the accuracy of anomaly behavior detection. Furthermore, it adds a Gaussian-based differential privacy technique to the graph federated learning algorithm to enhance the privacy of data interaction during the federated learning process, ensuring both model training accuracy and data interaction privacy in the federated learning process for the industrial internet. Attached Figure Description
[0057] Figure 1 This is a schematic diagram of a method flow according to an embodiment of the present invention.
[0058] Figure 2 This is a schematic diagram of the graph federated learning process framework in this invention.
[0059] Figure 3 This is a schematic diagram of a federated learning framework that poses privacy risks in this invention.
[0060] Figure 4 This is a schematic diagram of the main body and relationships in the knowledge graph construction of this invention.
[0061] Figure 5 This is a schematic diagram of the device structure according to an embodiment of the present invention.
[0062] Figure 6 This is a schematic diagram of the internal structure of a computer device according to an embodiment of the present invention.
[0063] The realization of the objective, functional features and advantages of the present invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0064] It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
[0065] like Figure 1-2 As shown, this invention provides a graph federated learning privacy enhancement method for industrial terminal network traffic detection, comprising:
[0066] S1. Configure the federated learning environment; wherein, the federated learning environment includes a dataset, a federated learning task, and a federated learning model;
[0067] Step S1 specifically includes:
[0068] S11. Set up a dataset, and use the sorting number of traffic software class as data label in the dataset;
[0069] The dataset used in this invention is USTC-TFC2016 endpoint traffic data. This dataset contains 10 malware families and 10 benign traffic types. The malware families are: Cridex (Dridex), Geodo (Emotet), Htbot, Miuref, Neris, Nsis-a, Shifu, Tinba, Virut, and Zeus. The benign traffic types include: BitTorrent, FaceTime, FTP, Gmail, MySQL, Outlook, Skype, and SMB. This dataset sorts the traffic by the software class and uses the resulting numbers as data labels.
[0070] S12. Perform traffic segmentation, traffic cleaning, image generation, and IDX conversion on the raw .pcap format traffic data sent by the terminal and software application, and use the terminal and software application as federated learning clients; wherein, the traffic segmentation and traffic cleaning are performed by splitting the .pcap file into discrete traffic units, and then organizing, trimming, and deleting them to make all files into uniform data bytes; the image generation and IDX conversion are performed by treating the processed data bytes as a pixel, converting them into a black and white image, and packaging them into .IDX format;
[0071] The data preprocessing of raw .pcap format traffic data involves four steps: traffic segmentation, traffic cleaning, image generation, and IDX conversion. Specifically, the .pcap file is first split into discrete traffic units, which are then organized, trimmed, and deleted to make all files uniformly 784 bytes. The processed data bytes are treated as pixels, converted to black and white images, and packaged into .IDX format for machine learning algorithm computation. The granularity of terminal traffic segmentation adopts a session format and focuses on traffic data at all layers. This method of visualizing terminal traffic data can solve the problem of manually designed features in traffic classification tasks, increasing model accuracy and reducing learning costs.
[0072] S13. Convert the traffic data into a black and white image with a set number of pixels and package it into .IDX format, and classify the black and white image by terminal;
[0073] By setting up a federated learning task, the objective of this invention is to identify and classify network traffic data to detect whether the network is being illegally occupied, whether software terminals are maliciously hijacking devices or illegally accessing private data through traffic data, thereby curbing abnormal network activities and stopping malicious terminal devices.
[0074] The traffic data emitted by terminals and software applications, after being processed through steps S11 and S12, becomes the dataset for the federated learning client, i.e., the edge server, based on the traffic data classification task. The task of federated learning is to distinguish the categories of traffic data collected from the network, i.e., the corresponding different data sources. After querying the knowledge graph and detecting traffic anomalies, the device terminals and software applications can be accurately located. The above steps convert the traffic data into 28*28 pixel black and white images and package them in .IDX format. Federated learning only needs to classify the images by terminal.
[0075] S14. Configure the federated learning model. For example... Figure 3 As shown, a federated learning framework based on the Federated Stochastic Gradient Descent (FedSGD) algorithm is adopted, in which the central server sends the global model W. t The selected client trains the model based on its local traffic dataset and applies the gradient of the loss function. The gradients are sent to the central server for aggregation to configure the federated learning model; the client-side gradient calculation formula is as follows:
[0076]
[0077] Where, x t,i and y t,i This represents the data and labels that the i-th client will learn in the t-th iteration, and F(·) represents the model output, i.e., the output based on the input x. t,i The model weights are W. t The output value of the neural network, i.e., the predicted data corresponding to the label, and l(·) is the loss function used to calculate the difference between the estimated label and the true label, and the gradient of the loss function is obtained by taking its derivative. The entire process iterates continuously until the loss function meets the convergence condition, at which point learning stops.
[0078] S2. Construct a knowledge graph based on the classification learning results of the federated learning model. A knowledge graph understands knowledge from a relational perspective, utilizing accumulated human knowledge to empower machines with the ability to understand, reason, and make decisions. This invention establishes a knowledge graph by defining relationships between various terminals, relationships between terminals and traffic data, and attributes such as whether traffic data exhibits malicious intent. After the federated learning model classifies traffic data, it can further understand the nature of the traffic, determine whether terminals are maliciously using network resources, and maintain network security.
[0079] like Figure 4 As shown, step S2 specifically includes:
[0080] S21. Information Extraction. Based on traffic data, determine the relationships between information entities and individual information entities to construct triples;
[0081] Step S21 specifically includes:
[0082] S211. Determine information entities in the above traffic data, wherein the information entities include the software terminal to which the traffic belongs, the device controlled by the software, and the traffic data itself, that is, the software terminal to which the traffic belongs, the device controlled by the software, and the traffic data itself.
[0083] S212. Based on the classification results of federated learning, determine the relationship between the traffic data detected on the network and various information entities, that is, to determine which type of software and the industrial IoT facilities controlled by the software, such as robotic arms, smart cars and drones, the detected traffic data belongs to.
[0084] S213. Determine the malicious relationship based on the spatiotemporal characteristics, background characteristics, and handshake characteristics of the detected traffic data;
[0085] S214. Construct triples G from the determined entities and relations. t = (ε, R, τ); where ε represents the set of entities, R represents the set of relationships between entities, such as mutual control feedback between terminals, or whether traffic status has malicious tendencies; τ represents the set of triples formed by entities and relationships, for example, if entity e i ,e j ∈ε, and entity e i There is a pointer to e j If the relation r is true, then the triple can be constructed as g = (e i ,r,e j ).
[0086] S22. Knowledge Fusion. Due to the uncertainty and incompleteness of various information on the internet, the triples established in the above steps can easily lead to conflicts in the descriptions of the same entity. Knowledge fusion connects different semantic expressions of the same entity from different data sources. Using the Word2Vec word vector model, entity names are transformed from the semantic space to the vector space, and the angle between the vectors is calculated to classify the entity. For example, the software objects Cridex and Dridex in traffic data, although expressed differently, both refer to the same entity. The purpose of knowledge fusion is to connect the two to prevent information conflicts between the constructed triples.
[0087] S23. Knowledge Storage. Entities in the triples are transformed into nodes, and relations into edges. Traffic data, industrial equipment, operating software, and malicious intent detection results are visualized in a point-edge format. The optimized triples are then stored in the graph database Neo4j to improve the convenience of subsequent knowledge graph queries.
[0088] S24. Question Answering System Establishment. Based on the Bi_LSTM-CRF entity recognition model, the system extracts the entities contained in the user's input question and assigns them to entity nodes in the knowledge storage. By retrieving the association relationships represented by the edges associated with the nodes, the system obtains the information associated with the entities in the graph and returns it to the user.
[0089] S3. Assess the privacy risks of the federated learning process based on the federated learning environment. As a distributed machine learning method with high privacy, federated learning can ensure that the client's local data does not leave the local protection domain. However, various attack methods targeting federated learning are emerging. The mainstream attack methods include member inference attacks, label inference attacks, and reconstruction of original data attacks. This invention proposes to use simulated attacks to determine whether the above-mentioned federated learning method / architecture has the risk of privacy leakage.
[0090] Step S3 specifically includes:
[0091] S31. Simulated Attack. Assuming the attacked device reconstructs a local image of device t, a gradient leakage (DLG) attack is simulated.
[0092] Step S31 specifically includes:
[0093] Generate a random virtual traffic image X' t And the category label Y' of traffic data t Each pixel x' in the image t The amplitudes of all values follow a random distribution:
[0094] X' t ←N(0,1)
[0095] Where N represents a normal distribution with an expected value of 0 and a variance of 1;
[0096] The generated virtual image X' t The locally trained neural network model φ of the data input device t t (X' t Obtain virtual gradients in ) When the attacked client sends local gradient data Before reaching the server, the client initiates a simulated DLG attack and calculates the generated virtual gradient. and the original gradient The distance between them, and by continuously adjusting the virtually generated random data X' t and virtual label Y' t To minimize this distance:
[0097]
[0098] in, and It is with X' t and Y' t The optimal reconstruction value, which minimizes the distance between the virtual gradient and the original gradient after basic iterative calculation, is the final reconstruction result.
[0099] S32. Assess privacy risks. Using the simulated attack methods described above, the client calculates the Peak Signal-to-Noise Ratio (PSNR) between the reconstructed image and the original image as an indicator of their similarity. The calculation formula is as follows:
[0100]
[0101]
[0102] Where MSE is the mean square error between the two images, x i and These represent the pixels of the original image and the reconstructed image, respectively, where B is the pixel size and MAX is the maximum pixel value. I The maximum pixel value in the reconstructed image. The higher the PSNR, the closer the reconstructed image is to the original image. Generally, above 40dB, the difference between the two is almost indistinguishable to the naked eye. Between 30 and 40dB, the reconstructed image may contain some noise contamination. Therefore, this invention uses 30dB as the limit. If the PSNR exceeds this limit, it is considered to pose a significant privacy risk, requiring defensive measures.
[0103] S4. Add privacy enhancement methods to strengthen the federated learning client based on the aforementioned privacy risks. This invention employs a differential privacy defense method based on Gaussian mechanisms to enhance the privacy of the federated learning client, i.e., the edge server, to protect the traffic data collected by the client from being eavesdropped on and leaked during the learning process.
[0104] Step S4 specifically includes:
[0105] S41, in the client gradient with leakage risk The Gaussian noise x with normally distributed characteristics superimposed on it has a probability density function that satisfies:
[0106]
[0107] Where μ is the mean of Gaussian noise, σ 2 The noise variance represents the noise level; the specific values of the distribution parameters are determined in this invention. ε represents the privacy budget, i.e., the local model φ under adjacent datasets. t The output error value after adding noise does not exceed e. ε δ represents the relaxation term of differential privacy, i.e., the probability of not satisfying strict differential privacy, Δ f The global model sensitivity is defined as:
[0108] Δ f =max||φ t (D k )-φ t (D i )||
[0109] Among them, D k and D i Adjacent datasets are two datasets that differ by only one bit.
[0110] S42. Based on the federated learning model, the client selects the privacy budget ε and global sensitivity Δ according to local data characteristics. f Two parameters are used to add noise, ensuring that the added noise does not exceed the client's overall budget, causing defense failure or sacrificing too much model accuracy. Based on this, noise is added to the gradient data sent to the central server:
[0111]
[0112] in, The gradient after adding noise interference, N(0,σ) 2 Gaussian noise is used to meet the above distribution requirements.
[0113] This invention constructs a knowledge graph from traffic data of industrial internet terminals and performs real-time anomaly detection and analysis of network traffic based on federated learning technology. To enable the detection results based on graph federated learning to provide management systems with more decision-making and early warning information deeply related to the risk of privacy traffic data leakage, this invention further proposes a privacy enhancement method to improve the adopted graph federated learning technology. It details the terminal data acquisition and processing and federated learning model configuration optimization for industrial anomaly traffic detection tasks. Leveraging the powerful inference capabilities of knowledge graphs, it establishes a causal correlation between traffic classification results and the presence of malicious intent in terminals, improving the accuracy of anomaly behavior detection. Furthermore, it adds a Gaussian-based differential privacy technique to the graph federated learning algorithm to enhance the privacy of data interaction during the federated learning process, ensuring both model training accuracy and data interaction privacy in the federated learning process for the industrial internet.
[0114] like Figure 5 As shown, the present invention also provides a graph federated learning privacy enhancement device for industrial terminal network traffic detection, comprising:
[0115] Configuration module 1 is used to configure the federated learning environment; wherein, the federated learning environment includes a dataset, a federated learning task, and a federated learning model;
[0116] Module 2 is used to construct a knowledge graph based on the classification learning results of the federated learning model;
[0117] Evaluation module 3 is used to evaluate the privacy risks of the federated learning process based on the federated learning environment.
[0118] Enhancement module 4 is used to add privacy enhancement methods based on the privacy risks to enhance the federated learning client.
[0119] In one embodiment, configuration module 1 includes:
[0120] The sorting label unit is used to set up a dataset, and the dataset sorts the traffic software classes as data labels.
[0121] The preprocessing unit is used to perform traffic segmentation, traffic cleaning, image generation, and IDX conversion on the raw .pcap format traffic data sent by the terminal and software application, and to use the terminal and software application as federated learning clients. Specifically, traffic segmentation and cleaning involve splitting the .pcap file into discrete traffic units, and then organizing, trimming, and deleting them to make all files into uniform data bytes. Image generation and IDX conversion treat the processed data bytes as a pixel, perform black and white image conversion, and package it into .IDX format.
[0122] A classification unit is used to convert the traffic data into a black and white image of a set number of pixels and package it into .IDX format, and classify the black and white image by terminal.
[0123] The learning model configuration unit is used to employ a federated learning framework based on the federated stochastic gradient descent algorithm, with the central server sending the global model W. t The selected client trains the model based on its local traffic dataset and applies the gradient of the loss function. The gradients are sent to the central server for aggregation to configure the federated learning model; the client-side gradient calculation formula is as follows:
[0124]
[0125] Where, x t,i and y t,iThis represents the data and labels that the i-th client will learn in the t-th iteration, and F(·) represents the model output, i.e., the output based on the input x. t,i The model weights are W. t The output value of the neural network, the predicted data corresponding to the label, and l(·) is the loss function used to calculate the difference between the estimated label and the true label, and the gradient of the loss function is obtained by taking its derivative.
[0126] In one embodiment, building module 2 includes:
[0127] The triplet construction unit is used to determine the relationship between information entities and individual information entities based on traffic data in order to construct triples;
[0128] The transformation unit is used to transform entity names from the semantic space to the vector space using the Word2Vec word vector model, and calculate the angle between vectors to classify entity categories.
[0129] The transformation unit is used to transform entities in triples into nodes and relations into edges, and to visualize traffic data, industrial equipment, running software and malicious intent detection results in the form of points and edges, and to store triples as knowledge using the graph database Neo4j.
[0130] The return unit is used to extract the entities contained in the user's input question based on the entity recognition Bi_LSTM-CRF model, and to identify them as entity nodes in the knowledge storage. By retrieving the association relationships represented by the edges associated with the nodes, the information associated with the entities in the graph is obtained and returned to the user.
[0131] In one embodiment, the triplet building unit includes:
[0132] The information entity determination subunit is used to determine information entities in traffic data, wherein the information entity includes the software terminal to which the traffic belongs, the device controlled by the software, and the traffic data itself;
[0133] The information entity relationship subunit is used to determine the relationship between traffic data detected on the network and various information entities based on the classification results of federated learning;
[0134] The malicious relationship determination subunit is used to determine the malicious relationship of the detected traffic data based on its spatiotemporal characteristics, background characteristics, and handshake characteristics.
[0135] The triple set subunit is used to construct a triple G from defined entities and relations. t = (ε, R, τ); where ε represents the set of entities, R represents the set of relationships between entities, and τ represents the set of triples consisting of entities and relationships.
[0136] In one embodiment, evaluation module 3 includes:
[0137] The attack unit is used to simulate an attack by assuming that the image being reconstructed is a local image of device t and employing gradient leakage.
[0138] The computational unit is used to calculate the peak signal-to-noise ratio (PSNR) between the reconstructed image and the original image using simulated attack methods, as an indicator of their similarity. The calculation formula is as follows:
[0139]
[0140]
[0141] Where MSE is the mean square error between the two images, x i and These represent the pixels of the original image and the reconstructed image, respectively, where B is the pixel size and MAX is the maximum pixel value. I To reconstruct the maximum pixel value in the image.
[0142] In one embodiment, the attack unit includes:
[0143] Generating sub-units for generating random virtual flow images X' t And the category label Y' of traffic data t Each pixel x' in the image t The amplitudes of all values follow a random distribution:
[0144] X' t ←N(0,1)
[0145] Where N represents a normal distribution with an expected value of 0 and a variance of 1;
[0146] The input subunit is used to process the generated virtual image X' t The locally trained neural network model φ of the data input device t t (X' t Obtain virtual gradients in ) When the attacked client sends local gradient data Before reaching the server, the client initiates a simulated DLG attack and calculates the generated virtual gradient. and the original gradient The distance between them, and by continuously adjusting the virtually generated random data X' t and virtual label Y' t To minimize this distance:
[0147]
[0148] in, and It is with X' t and Y' t The optimal reconstruction value, which minimizes the distance between the virtual gradient and the original gradient after basic iterative calculation, is the final reconstruction result.
[0149] In one embodiment, enhancement module 4 includes:
[0150] Overlay units are used to add gradients to clients at risk of leakage. The Gaussian noise x with normally distributed characteristics superimposed on it has a probability density function that satisfies:
[0151]
[0152]
[0153] Δ f =max||φ t (D k )-φ t (D i )||
[0154] Where μ is the mean of Gaussian noise, σ 2 Let φ be the noise variance, representing the noise level, and ε be the privacy budget, i.e., the local model φ in the context of neighboring datasets. t The output error value after adding noise does not exceed e. ε δ represents the relaxation term of differential privacy, i.e., the probability of not satisfying strict differential privacy, Δ f For global model sensitivity, D k and D i Adjacent datasets are two datasets that differ by only one bit.
[0155] Add a unit for selecting a privacy budget ε and a global sensitivity Δ based on local data characteristics, according to the federated learning model. f To add noise, and to add noise to the gradient data sent to the central server:
[0156]
[0157] in, The gradient after adding noise interference, N(0,σ) 2 Gaussian noise is used to meet the distribution requirements.
[0158] Each of the above modules, units, and sub-units is used to perform the respective steps in the graph federated learning privacy enhancement method for industrial terminal network traffic detection. The specific implementation methods are as described in the above method embodiments and will not be repeated here.
[0159] like Figure 6 As shown, the present invention also provides a computer device, which may be a server, and its internal structure may be as follows: Figure 6 As shown, the computer device includes a processor, memory, network interface, and database connected via a system bus. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and database. The internal memory provides the environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores all data required for the process of a graph federated learning privacy enhancement method for industrial terminal network traffic detection. The network interface communicates with external terminals via a network connection. The computer program is executed by the processor to implement the graph federated learning privacy enhancement method for industrial terminal network traffic detection.
[0160] Those skilled in the art will understand that Figure 6 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer equipment on which the present application is applied.
[0161] An embodiment of this application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements any of the above-described graph federated learning privacy enhancement methods for industrial terminal network traffic detection.
[0162] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media provided in this application and in the embodiments may include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-speed SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
[0163] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, apparatus, article, or method. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, apparatus, article, or method that includes that element.
[0164] The above description is merely a preferred embodiment of the present invention and does not limit the patent scope of the present invention. Any equivalent structural or procedural transformations made based on the content of the present invention's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of the present invention.
Claims
1. A graph federated learning privacy enhancement method for industrial terminal network traffic detection, characterized in that, include: Configure a federated learning environment, which includes a dataset, federated learning tasks, and a federated learning model. Specifically, this includes: setting a dataset, where traffic software classes are sorted and labeled as data tags; performing traffic segmentation, traffic cleaning, image generation, and IDX conversion on the raw .pcap format traffic data sent by terminals and software applications, and using terminals and software applications as federated learning clients; wherein, traffic segmentation and traffic cleaning involve splitting the .pcap file into discrete traffic units, and then organizing, trimming, and deleting them to make all files into uniform data bytes; image generation and IDX conversion treat the processed data bytes as pixels, perform black and white image conversion, and package them into .IDX format; converting the traffic data into a black and white image of a set number of pixels and packaging it into .IDX format, classifying the black and white images by terminal; employing a federated learning framework based on the federated stochastic gradient descent algorithm, with the central server sending the global model. The selected client trains the model based on its local traffic dataset and applies the gradient of the loss function. The gradients are sent to the central server for aggregation to configure the federated learning model; the client-side gradient calculation formula is as follows: ;in, and This represents the data and labels that the i-th client will learn in the t-th iteration. This represents the model output, i.e., based on the input. The model weights are The output value of the neural network predicts the corresponding label. To calculate the loss function between the estimated label and the true label, we obtain the gradient of the loss function by taking its derivative. ; Knowledge graph construction is performed based on the classification learning results of the federated learning model. Specifically, this includes: determining information entities and their relationships based on traffic data to construct triples; specifically: identifying information entities in the traffic data, where the information entities include the software terminal to which the traffic belongs, the device controlled by the software, and the traffic data itself; determining the relationships between the traffic data detected on the network and each information entity based on the classification results of federated learning; determining malicious relationships based on the spatiotemporal characteristics, background characteristics, and handshake characteristics of the detected traffic data; and constructing triples from the determined entities and relationships. ;in, Represents a set of entities. Represents a set of relationships between entities. This approach represents a set of triples consisting of entities and relations. Using the Word2Vec word vector model, entity names are transformed from the semantic space to the vector space, and the angle between vectors is calculated to classify entity categories. Entities in the triples are converted into nodes, and relations into edges. Traffic data, industrial equipment, operating software, and malicious intent detection results are visualized in a point-edge format. The triples are stored in the graph database Neo4j. Based on the Bi_LSTM-CRF entity recognition model, entities are extracted from the user's input question and mapped to entity nodes in the knowledge storage. By retrieving the relationships represented by the edges associated with the nodes, information related to entities in the graph is obtained and returned to the user. Assess the privacy risks of the federated learning process based on the federated learning environment; specifically including: assuming the device is being attacked and reconstructed. The local image is used to simulate a gradient leakage attack, specifically by generating random virtual traffic images. and the category tags of traffic data Each pixel in the image The amplitudes of all values follow a random distribution: Where N represents a normal distribution with an expected value of 0 and a variance of 1; the generated virtual image Data input devices Locally trained neural network model To obtain virtual gradients When the attacked client sends local gradient data Before reaching the server, the client initiates a simulated DLG attack and calculates the generated virtual gradient. and the original gradient The distance between them, and by continuously adjusting the virtually generated random data. and virtual tags To minimize this distance: ;in, and It is in the and The optimal reconstruction value that minimizes the distance between the virtual gradient and the original gradient after basic iterative calculation is the final reconstruction result. By simulating an attack, the client calculates the peak signal-to-noise ratio (PSNR) between the reconstructed image and the original image as an indicator of their similarity. The calculation formula is as follows: Where MSE is the mean squared error between the two images. and These are the pixels of the original image and the reconstructed image, respectively. In pixels To reconstruct the maximum pixel value in the image; To enhance the federated learning client, privacy enhancement methods are added based on the aforementioned privacy risks. Specifically, this includes: applying privacy enhancements to the gradients of clients at risk of data leakage. Superimposed with Gaussian noise that has a normal distribution Its probability density function satisfies: in, The mean of Gaussian noise, The noise variance represents the noise level. For privacy budget, i.e., local model under adjacent datasets The output error value after adding noise does not exceed [a certain value]. , The relaxation term represents differential privacy, i.e., the probability that strict differential privacy is not satisfied. For global model sensitivity, and Adjacent datasets are defined as two datasets that differ by only one bit; according to the federated learning model, the client selects a privacy budget based on local data characteristics. and global sensitivity To add noise, and to add noise to the gradient data sent to the central server: ;in, The gradient after adding noise interference, Gaussian noise to meet distribution requirements.
2. A graph federated learning privacy enhancement device for industrial terminal network traffic detection, characterized in that, The apparatus based on the graph federated learning privacy enhancement method for industrial terminal network traffic detection according to claim 1 includes: A configuration module is used to configure the federated learning environment; wherein, the federated learning environment includes a dataset, a federated learning task, and a federated learning model; A construction module is used to construct a knowledge graph based on the classification learning results of the federated learning model; An evaluation module is used to assess the privacy risks of the federated learning process based on the federated learning environment. An enhancement module is used to add privacy enhancement methods based on the aforementioned privacy risks to enhance the federated learning client.
3. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method of claim 1.
4. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method of claim 1.
Citation Information
Patent Citations
Risk management and control method, system and device
CN110910041A
Federal learning privacy evaluation method under cross-domain heterogeneous scene
CN115952507A