Encryption traffic analysis method and device based on source socket interaction graph
By employing a source socket interaction graph-based approach and utilizing RNN and graph attention networks for encrypted traffic analysis, the problem of insufficient recognition performance in existing technologies is solved. This approach achieves highly accurate traffic recognition and fine-grained analysis, making it suitable for online real-time recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NAT UNIV OF DEFENSE TECH
- Filing Date
- 2024-01-16
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies are insufficient in identifying encrypted traffic, especially as the identification time window shortens, making it difficult to effectively distinguish traffic from different applications. Furthermore, the node connections in graph neural networks lack practical significance.
A source socket interaction graph-based approach is adopted to cluster network flows and establish temporally sequential connections. RNNs are used to extract long sequence features of network flow packets, and graph neural networks are constructed for training. Global features are generated through graph attention networks for traffic identification. Supervised and unsupervised same-source flow detection algorithms are designed, and fully connected layers are combined for classification.
It improves the accuracy of encrypted traffic analysis, can distinguish traffic from different applications, supports fine-grained analysis, provides the rationality and interpretability of flow interaction graphs, and is suitable for online real-time identification scenarios.
Smart Images

Figure CN117896147B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of encrypted traffic analysis technology, specifically relating to an encrypted traffic analysis method and device based on source socket interaction graphs. Background Technology
[0002] Currently, most traffic analysis tasks utilize Generative Neural Networks (GNNs), with source application identification being a fundamental function. Numerous research achievements exist, such as FG-Net, Flowprint, and FS-Net. In the FG-Net model architecture, packet length sequences are used as the initial features for identification. Intra-flow features are extracted using neural networks, while inter-flow features are extracted using graph neural networks. Finally, graph classification is performed based on graph pooling operations. During packet parsing, FG-Net extracts packet length sequences and arrival times, and simultaneously clusters network flows. During graph construction, network flows within the same cluster are sequentially connected, while network flows in different clusters are interconnected by the first and last flows within each cluster.
[0003] The FG-Net model has the following drawbacks: node connections lack practical meaning, and the authors cannot explain the intrinsic meaning of directed edges in the graph; the authors perform traffic identification on a per-generational traffic interaction graph basis, but each interaction graph contains multiple network flows within the same time window. Therefore, the authors assume that traffic in each time window originates from the same application, which is unreasonable in practical applications. As the identification time window shortens, the number of nodes in the flow interaction graph decreases, leading to a decline in the learning ability of the GNN. In some scenarios requiring rapid identification (online real-time identification), where only a few nodes exist in the flow interaction graph, the model degenerates into a feature extraction network consisting only of a CNN, significantly reducing identification performance. Summary of the Invention
[0004] This invention provides a method and device for encrypted traffic analysis based on source socket interaction graphs, in order to solve the problem of insufficient recognition performance in existing technologies.
[0005] According to a first aspect of the present invention, one or more embodiments of this application provide a method for encrypted traffic analysis based on a source socket interaction graph, comprising the following steps:
[0006] Step 1: Cluster the network flows and extract the long sequence of network flow packets from each network flow;
[0007] Step 2: Establish connections representing the temporal order of network flows within each cluster, and treat the captured network flows as nodes;
[0008] Step 3: Use RNN to extract long sequence features of network stream packets, so that the similarity of long sequence features of network stream packets of different sockets is low, and the similarity of long sequence features of network stream packets of the same socket is high;
[0009] Step 4: Construct a graph neural network. Based on the similarity relationship of network stream packet long sequence features, establish interconnections between network streams across bursts to indicate whether they originate from the same socket code.
[0010] Step 5: Train the constructed graph neural network using a two-layer graph attention network, average pool the output node representations to generate global features that can represent graph information, and finally use a fully connected layer for traffic identification or classification.
[0011] Step two specifically includes the following steps:
[0012] Network flow nodeization: The captured network flow is treated as a node, with each node representing a network flow. For each network flow node, its relevant information is recorded.
[0013] Node relationship establishment: Establish the connection relationship between nodes based on the time sequence of different socket code generated in the source program in the network stream;
[0014] Same-source network stream node fusion: If multiple network stream nodes have the same source socket within the same time slice, they are merged into one node;
[0015] Flow interaction graph generation: Construct a flow interaction graph based on the connections between nodes.
[0016] In step three, when using RNN to extract node features, the method of supervising or unsupervised flow extraction is selected based on the labeling of the long sequence data of the network flow packets. When the labeling quality of the long sequence data of the network flow packets is high or the labeling is sufficient, the method of supervising flow extraction is selected for the extraction of the features of the same source. When facing unsupervised scenarios, the method of unsupervised flow extraction is selected for the extraction of the features of the same source.
[0017] Data annotation includes the following steps:
[0018] Instrument the device's Libc.so library so that the instrumented code is invoked every time the application calls the socket program;
[0019] This enables the instrumentation code to output information about the process ID, thread ID, and stack space.
[0020] Dynamically debug the software and obtain process information corresponding to traffic;
[0021] Match process information with traffic and automatically label the source socket name of the traffic.
[0022] Among them, when using the supervised flow method to extract features of the same source flow, the long sequence of network flow packets is used as the original feature for identification and transformed into flow embedding features, so that the similarity of the same source socket traffic is higher than that of the non-same source socket traffic.
[0023] When it is impossible to obtain source socket data annotations by instrumenting the application device, a random walk method is used to extract co-occurring streams to determine whether the network streams are from the same source socket.
[0024] The extraction of co-occurring flows includes the following steps: when judging the similarity between different flows based on the random walk sequence, the edit distance is used to find other nodes with similar neighbor structures to a specific node, and they are classified into one category. Then, the discovery of co-originating socket network flows is achieved by replacing application instrumentation in an unsupervised manner.
[0025] Step five includes the following steps:
[0026] First, node features are extracted. Then, the attention value between two nodes is calculated. Next, softmax is used to calculate the attention coefficient. Then, the features of neighboring nodes are aggregated. In the last layer of the graph attention network, pooling is performed on all nodes in the flow interaction graph. Finally, the prediction result is generated by maximum likelihood estimation.
[0027] Specifically, when the node classification task needs to be completed, the step of performing pooling operations on all nodes in the last layer of the graph attention network is removed.
[0028] According to another aspect of the present invention, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, implements the encrypted traffic analysis method based on the source socket interaction graph as described in any of the above.
[0029] The beneficial effects of this invention are that it provides a method and device for encrypted traffic analysis based on source socket interaction graphs, designs a novel modeling approach for graph domain features, assigns practical meaning to each edge and node, and associates them with the source program flow graph. This provides rationality and interpretability for the construction of the flow interaction graph. Two same-source flow discovery algorithms are designed, capable of discovering same-source socket flows under both supervised and unsupervised conditions. Simultaneously, the same-source flow discovery process can distinguish traffic from different applications, avoiding interference from different application traffic in the flow interaction graph. An RNN+GNN network model is designed, capable of traffic identification with high accuracy. Furthermore, this model supports node classification for more fine-grained analysis tasks. Attached Figure Description
[0030] Figure 1This is a schematic diagram of the GNN traffic analysis process in the background technology of an encrypted traffic analysis method and device based on source socket interaction graph, which is an embodiment of the present invention.
[0031] Figure 2 This is a schematic diagram of the FG-Net task flow in the background technology of an encrypted traffic analysis method and device based on source socket interaction graph, which is an embodiment of the present invention.
[0032] Figure 3 This is a schematic diagram of the graph construction method of FG-Net, which is incorporated into the background technology of an encrypted traffic analysis method and device based on source socket interaction graphs according to an embodiment of the present invention.
[0033] Figure 4 This is a schematic diagram of the overall architecture of an encrypted traffic analysis method and device based on a source socket interaction graph according to an embodiment of the present invention.
[0034] Figure 5 This is a schematic diagram of an encrypted traffic analysis method based on a source socket interaction graph and a device flow interaction graph according to an embodiment of the present invention.
[0035] Figure 6 This is a schematic diagram of an encrypted traffic analysis method based on a source socket interaction graph and a device tag acquisition method according to an embodiment of the present invention.
[0036] Figure 7 This is a schematic diagram of an encrypted traffic analysis method based on a source socket interaction graph and an unsupervised flow embedding method for a device according to an embodiment of the present invention.
[0037] Figure 8 This is a schematic diagram of the GAT and GCN training process of an encrypted traffic analysis method and device based on source socket interaction graph according to an embodiment of the present invention.
[0038] Figure 9 This is a schematic diagram of the GAT network architecture of an encrypted traffic analysis method and device based on source socket interaction graphs according to an embodiment of the present invention. Detailed Implementation
[0039] To make the objectives, technical solutions, and advantages of this disclosure clearer, the following detailed description is provided in conjunction with specific embodiments and the accompanying drawings.
[0040] It should be noted that, unless otherwise defined, the technical or scientific terms used in one or more embodiments of this application should have the ordinary meaning understood by one of ordinary skill in the art to which this disclosure pertains. The terms "first," "second," and similar terms used in one or more embodiments of this application do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Terms such as "comprising" or "including" mean that the element or object preceding the word encompasses the elements or objects listed following the word and their equivalents, without excluding other elements or objects. Terms such as "connected" or "linked" are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. Terms such as "upper," "lower," "left," and "right" are used only to indicate relative positional relationships; when the absolute position of the described object changes, the relative positional relationship may also change accordingly.
[0041] like Figures 1-9 As shown, one or more embodiments of this application disclose an encrypted traffic analysis method and device based on a source socket interaction graph, comprising the following steps:
[0042] Step 1: Cluster the network flows and extract the long sequence of network flow packets from each network flow;
[0043] Step 2: Establish connections representing the temporal order of network flows within each cluster, and treat the captured network flows as nodes;
[0044] Step 3: Use RNN to extract long sequence features of network stream packets, so that the similarity of long sequence features of network stream packets of different sockets is low, and the similarity of long sequence features of network stream packets of the same socket is high;
[0045] Step 4: Construct a graph neural network. Based on the similarity relationship of network stream packet long sequence features, establish interconnections between network streams across bursts to indicate whether "different network streams" or "different nodes" originate from the same socket code.
[0046] Step 5: Train the constructed graph neural network using a two-layer graph attention network, average pool the output node representations to generate global features that can represent graph information, and finally use a fully connected layer for traffic identification or classification.
[0047] See Figure 4 The method used in this embodiment is an improvement on the overall task flow of FG-Net. A newly designed flow interaction graph replaces the original graph construction method, giving the edges and nodes practical meaning (the source application controls the depiction of the flow graph). At the same time, RNN is used instead of CNN for node feature extraction, which improves the recognition accuracy.
[0048] This flow interaction graph can characterize the features of network traffic in the latent space (source program control flow graph), such as... Figure 5 As shown, within a given time slice, the captured network streams are generally generated by different socket code within the source program. Treating these network streams as nodes, their connections reflect the logical relationships between parts of the socket code in the source program's control flow graph. Merging the same-origin network streams within different time slices to generate a flow interaction graph will provide an approximate representation of the source program's control flow graph.
[0049] Specifically, the following steps are included:
[0050] Network flow nodeization: Captured network flows are treated as nodes, with each node representing a network flow. For each network flow node, its relevant information is recorded, such as source IP address, destination IP address, source port, and destination port.
[0051] Node relationship establishment: Based on the chronological order in which different socket codes are generated in the source code of the network stream, connections are established between nodes. That is, network stream nodes are connected sequentially according to their chronological order.
[0052] Same-origin network flow node merging: If multiple network flow nodes have the same source socket within the same time slice, they are merged into one node. The merged node retains the control flow relationships from its original program.
[0053] Flow interaction graph generation: A flow interaction graph is constructed based on the connections between nodes. The flow interaction graph reflects the logical relationships of some socket code in the source program's control flow graph and can serve as an approximate representation of the source program.
[0054] When performing source stream extraction, this method selects between supervised and unsupervised embedding algorithms based on the data annotation. When the data annotation quality is high and sufficient, supervised stream extraction can effectively extract features. In unsupervised scenarios, unsupervised stream extraction can provide approximate stream features. Supervised stream extraction algorithms require source socket labels for each network stream in the dataset for training. This embodiment provides an automated method for obtaining source socket labels and a corresponding training algorithm.
[0055] like Figure 6 As shown, under supervised conditions, the tags of traffic source sockets are obtainable. The method for automated socket tagging is as follows:
[0056] Instrument the device's Libc.so library so that the instrumented code is invoked every time the application calls the socket program. Ensure the instrumented code outputs process ID, thread ID, and stack space information. Perform dynamic software debugging to obtain process information corresponding to traffic. Match process information with traffic and automatically label the source socket name of the traffic.
[0057] For supervised flow extraction:
[0058] This method transforms the original long sequence of stream packets into stream embedding features, resulting in higher similarity between traffic from the same socket and traffic from different sockets. This lays the foundation for subsequent cross-burst node connections. The method uses a recurrent neural network for training, and the specific process is as follows:
[0059] If the original stream features (packet length sequence) are v, and the generated embedding sequence is u, then the function for stream embedding of the original stream using an RNN can be described as:
[0060] f(v)=RNN(v)=u
[0061] Stream embedding can be used to calculate the similarity between two streams, and the similarity function can be expressed as:
[0062]
[0063] The similarity between each pair of flows within the same category should be higher. The similarity between all flows belonging to the same category can be represented as:
[0064]
[0065] The objective function is modeled as the negative logarithm of the overall similarity:
[0066]
[0067] Where #(v1,v2) represents the number of times the traffic pair (v1,v2) appears in a certain source socket class, and I represents the class.
[0068] For unsupervised flow extraction:
[0069] When it's technically impossible to instrument application devices to obtain source socket labels, unsupervised algorithms can be used to discover co-origin flows. This primarily utilizes random walk algorithms to extract co-occurring flows to determine if they share the same source socket. The main basis for this is that the random walk sequences of traffic from co-origin sockets are similar. The process architecture is as follows: Figure 7 As shown.
[0070] The co-occurrence flow extraction process is as follows: Input a flow interaction graph G(V,ε), a random walk path length T, and an iteration layer r. The final result is a set containing the random walk sequences corresponding to all nodes in the graph. Each node's random walk sequence is essentially a representation of its neighbor structure. We believe that nodes with similar neighbor graph structures in the internal connection graph of each burst are more likely to belong to the same source socket class. Therefore, we use a random walk generation algorithm to extract the neighbor structure of each node.
[0071] To determine the similarity between different flows based on random walk sequences, edit distance is used as a similarity measure. Edit distance, also known as Levenshtein distance, is a metric for measuring the degree of difference between two strings. It is defined as the minimum number of operations required to transform one string into another by inserting, deleting, or replacing characters.
[0072] Specifically, given two strings A and B, as an example, the following three operations can be used to convert A to B:
[0073] Insert: Insert a character into A.
[0074] Delete: Delete a character from A.
[0075] Replace: Replace one character in A with another character.
[0076] Edit distance can be calculated using dynamic programming. It identifies other nodes with similar neighbor structures to a specific node and groups them into the same category. Ultimately, this unsupervised method replaces application instrumentation to discover network flows from the same socket.
[0077] In unsupervised stream embedding, the original stream will be classified by the co-occurrence stream, and its objective function modeling remains unchanged:
[0078]
[0079] See Figure 8GAT (Graph Attention Network) is a commonly used graph neural network model for processing graph-structured data. GAT is a graph neural network model based on an attention mechanism. It captures the relationships between nodes by learning attention weights and then weights and aggregates node features based on these weights. Centered on each node in the graph, GAT uses the attention weights as learnable parameters and performs a weighted sum with the node features. This approach allows GAT to adaptively apply different levels of attention to the neighbors of different nodes, thus better capturing the connections between nodes. The GAT model has good expressive power in graph-structured data and is widely used in tasks such as node classification and graph classification.
[0080] The core formula of GAT: In the GAT of the l-th layer, the representation Hi(l) of node i can be calculated by the following formula:
[0081]
[0082] α ij (l)=softmax(LeakyReLU(α(l)T(W(l)H i (l)||W(l)H j (l))))
[0083] Where N(i) represents the set of neighboring nodes of node i, H j (l) represents the neighbor node j in layer l, W(l) is the weight matrix of layer l, α(l) is the learned attention weight vector, σ represents the activation function (usually ReLU), LeakyReLU is a linear rectified function with leakage, and || represents the feature concatenation operation. The attention weights α... ij (l) Weight the features of neighboring nodes to obtain a new representation H of node i. i (l+1).
[0084] The GAT network architecture used in this embodiment is as follows: Figure 9 As shown, for each stream embedding h i (0) We perform a series of operations at each layer of the network. First, we extract node features:
[0085] z i (l) =W (l) h i (l) +b (l)
[0086] Next, we calculate the attention value between the two nodes:
[0087] eij (l) =ReLU(α) (l) (z i (l) ||z j (l) ))
[0088] Next, softmax is used to calculate the attention coefficient:
[0089]
[0090] Then, aggregate the features of neighboring nodes:
[0091]
[0092] In the last layer of the network, pooling is performed on all nodes in the graph:
[0093]
[0094] Finally, the prediction results are generated through maximum likelihood estimation:
[0095]
[0096] To complete the node classification task, you only need to remove the pooling operation for all nodes in the last layer of the network.
[0097] Comparison Results: This embodiment demonstrates superior performance compared to existing traffic identification models such as FS-Net and FG-Net, based on existing methods for identifying encrypted traffic. Comparison data is shown in the table below:
[0098] Identification Time Window Method of the Implementation Example FG-Net FSNet 60s 0.962996 0.931548 0.931548 30s 0.927232 0.919643 0.931548 10s 0.916535 0.878157 0.931548 5s 0.926967 0.786830 0.931548
[0099] In practical applications, it breaks the assumption that the traffic source application is the same in the same identification window of FG-Net, making it suitable for a wider range of scenarios.
[0100] In another embodiment, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the encrypted traffic analysis method based on the source socket interaction graph as described in any of the above embodiments.
[0101] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0102] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A system that specifies functions in one or more boxes.
[0103] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including an instruction set implemented in a process. Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0104] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0105] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention.
[0106] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.
Claims
1. A method for encrypted traffic analysis based on source socket interaction graphs, characterized by: It includes the following steps: Step 1: Cluster the network flows and extract the long sequence of network flow packets from each network flow; Step 2: Establish connections representing the temporal order of network flows within each cluster, and treat the captured network flows as nodes; Step 3: Use RNN to extract long sequence features of network stream packets, so that the similarity of long sequence features of network stream packets of different sockets is low, and the similarity of long sequence features of network stream packets of the same socket is high; When using RNN to extract node features, the method of supervising or unsupervised flow extraction is selected based on the labeling of the long sequence data of the network flow packets. When the labeling quality of the long sequence data of the network flow packets is high or the labeling is sufficient, the method of supervising flow extraction is selected for the extraction of the features of the same source. When facing unsupervised scenarios, the method of unsupervised flow extraction is selected for the extraction of the features of the same source. Step 4: Construct a graph neural network. Based on the similarity relationship of network stream packet long sequence features, establish interconnections between network streams across bursts to indicate whether they originate from the same socket code. Step 5: Train the constructed graph neural network using a two-layer graph attention network, average pool the output node representations to generate global features that can represent graph information, and finally use a fully connected layer for traffic identification or classification.
2. The encrypted traffic analysis method based on source socket interaction graph as described in claim 1, characterized in that, Step two specifically includes the following steps: Network flow nodeization: The captured network flow is treated as a node, with each node representing a network flow. For each network flow node, its relevant information is recorded. Node relationship establishment: Establish the connection relationship between nodes based on the time sequence of different socket code generated in the source program in the network stream; Same-source network stream node fusion: If multiple network stream nodes have the same source socket within the same time slice, they are merged into one node; Flow interaction graph generation: Construct a flow interaction graph based on the connections between nodes.
3. The encrypted traffic analysis method based on source socket interaction graph as described in claim 2, characterized in that, Network stream packet long sequence data annotation includes the following steps: Instrument the device's Libc.so library so that the instrumented code is invoked every time the application calls the socket program; This enables the instrumentation code to output information about the process ID, thread ID, and stack space. Dynamically debug the software and obtain process information corresponding to traffic; Match process information with traffic and automatically label the source socket name of the traffic.
4. The encrypted traffic analysis method based on source socket interaction graph as described in claim 3, characterized in that, When using supervised flow methods to extract features from the same source flow, the long sequence of network flow packets is used as the original feature for identification and transformed into flow embedding features, so that the similarity of the same source socket traffic is higher than that of the non-same source socket traffic.
5. The encrypted traffic analysis method based on source socket interaction graph as described in claim 4, characterized in that, When it is impossible to obtain source socket data annotations by instrumenting the application device, a random walk method is used to extract co-occurring streams to determine whether the network streams are from the same source socket.
6. The encrypted traffic analysis method based on source socket interaction graph as described in claim 5, characterized in that, The extraction of co-occurring flows includes the following steps: when judging the similarity between the long sequence features of different network flow packets based on the random walk sequence, the edit distance is used to find other nodes with similar neighbor structures to a specific node, and they are classified into one category. Then, the discovery of co-originating socket network flows is achieved by replacing instrumentation with an unsupervised approach.
7. The encrypted traffic analysis method based on source socket interaction graph as described in claim 2, characterized in that, Step five includes the following steps: First, node features are extracted. Then, the attention value between two nodes is calculated. Next, softmax is used to calculate the attention coefficient. Then, the features of neighboring nodes are aggregated. In the last layer of the graph attention network, pooling is performed on all nodes in the flow interaction graph. Finally, the prediction result is generated by maximum likelihood estimation.
8. The encrypted traffic analysis method based on source socket interaction graph as described in claim 7, characterized in that, When a node classification task needs to be completed, the step of performing pooling operations on all nodes in the last layer of the graph attention network is removed.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the encrypted traffic analysis method based on the source socket interaction graph as described in any one of claims 1 to 8.