A method for detecting IoT botnets based on multimodal fusion

CN116614248BActive Publication Date: 2026-06-30ZHEJIANG UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG UNIV OF TECH
Filing Date
2023-03-31
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing technologies, deep learning-based botnet detection methods are mainly limited to a single modality and fail to fully consider traffic content and its contextual relationships, resulting in weak model generalization ability and difficulty in effectively detecting unknown botnets.

Method used

A multimodal fusion approach is adopted, combining convolutional neural networks (CNN), long short-term memory networks (LSTM), and graph convolutional neural networks (GCN). By fusing spatiotemporal features and analyzing traffic behavior, a botnet detection model is constructed. By utilizing the spatial and temporal features of traffic data, the detection accuracy and generalization ability are improved.

Benefits of technology

It improves the accuracy of botnet detection and the generalization ability of the model, effectively identifying unknown botnets and enhancing the detection capabilities of IoT network security.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116614248B_ABST
    Figure CN116614248B_ABST
Patent Text Reader

Abstract

This invention discloses a method for detecting IoT botnets based on multimodal fusion, comprising: capturing traffic data and preprocessing it; extracting valid data from the traffic data; inputting the valid data into a CNN model to obtain an output matrix output11; inputting the valid data into an LSTM model to obtain an output matrix output12; obtaining an output matrix output1 based on output11 and output12; extracting feature parameters from the traffic data to form a traffic communication behavior sequence; dividing the traffic communication behavior sequence based on dynamic bias to obtain several sub-sequences; constructing a communication behavior graph corresponding to each sub-sequence based on the feature parameters; inputting the communication behavior graph into a GCN model to obtain an output matrix output2; fusing output1 and output2 for classification to obtain the network detection result. This invention has high detection capability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the fields of network security and machine learning technology, and specifically relates to an IoT botnet detection method based on multimodal fusion. Background Technology

[0002] The Internet of Things (IoT) is hailed as the third information technology revolution after computers and the internet, and is considered the foundation for smart cities, smart homes, smart transportation, and smart healthcare. The goal of the IoT is to connect previously unconnected devices to the internet, creating intelligent devices capable of collecting, storing, and sharing data. The sheer number of IoT devices and their inherent vulnerabilities have made the IoT a prime target for attackers.

[0003] A botnet consists of a large number of networked devices controlled by attackers. These controlled devices are called bots. A bot master can infect devices without the device owner's knowledge, turning them into bots. Then, through a pre-set command and control (C&C) channel, the bot master distributes commands and manipulates the bots to perform malicious activities such as DDoS attacks, cryptocurrency mining, malware distribution, spam, phishing attacks, and theft of personal information. Therefore, research on the detection and identification of botnets is currently a hot topic and key issue in cybersecurity.

[0004] Network traffic-based detection methods are currently a commonly used approach, detecting botnets by analyzing the characteristics of network traffic. This method primarily relies on machine learning, building a botnet identification model based on the relationship between sample features and labels in the collected data. With the rapid development of artificial intelligence, deep learning has become a hot research area and is increasingly being applied to network traffic detection methods. However, current research on botnets using deep learning algorithms is mostly limited to single-modal detection, using only a single model from convolutional neural networks or recurrent neural networks, without comprehensively considering the characteristics of the traffic content and contextual relationships. Furthermore, the limited coverage of training datasets results in weak model generalization ability, leading to poor detection capabilities when faced with unfamiliar botnet programs.

[0005] To address the aforementioned issues, it is urgent to solve the problems of comprehensively considering the relationship between traffic content and its context, as well as improving the generalization ability of the model and increasing its coverage of different botnets. Summary of the Invention

[0006] The purpose of this invention is to provide an IoT botnet detection method based on multimodal fusion, which has high detection capability.

[0007] To achieve the above objectives, the technical solution adopted by the present invention is as follows:

[0008] A method for detecting IoT botnets based on multimodal fusion, the method comprising:

[0009] Step 1: Capture traffic data and preprocess it;

[0010] Step 2: Zombie network detection based on spatiotemporal feature fusion;

[0011] Step 2.1: Extract valid data from the traffic data;

[0012] Step 2.2: Input the valid data into the CNN model to obtain the output matrix output11;

[0013] Step 2.3: Input the valid data into the LSTM model to obtain the output matrix output12;

[0014] Step 2.4: Obtain the output matrix output1 based on the output matrix output11 and the output matrix output12;

[0015] Step 3: Botnet detection based on traffic behavior;

[0016] Step 3.1: Extract feature parameters from traffic data to form a traffic communication behavior sequence;

[0017] Step 3.2: Divide the traffic communication behavior sequence based on dynamic bias to obtain several sub-sequences;

[0018] Step 3.3: Construct the communication behavior graph corresponding to each subsequence based on the feature parameters;

[0019] Step 3.4: Input the communication behavior graph into the GCN model to obtain the output matrix output2;

[0020] Step 4: Combine the output matrix output1 and output matrix output2 for classification to obtain the network detection results.

[0021] Several alternative methods are provided below, but they are not intended as additional limitations on the overall solution above. They are merely further additions or optimizations. Provided there are no technical or logical contradictions, each alternative method can be combined individually with respect to the overall solution above, or multiple alternative methods can be combined with each other.

[0022] Preferably, the capture and preprocessing of traffic data includes:

[0023] Use the Tcpdump tool to capture traffic data and save it in pcap file format;

[0024] Traffic data cleaning, filtering out useless TCP connection packets;

[0025] The cleaned traffic data is segmented by session, using the five-tuple of traffic data as the segmentation standard, resulting in a pcap format file for each session.

[0026] Preferably, the extracted valid data from the traffic data includes:

[0027] Extract the pcap format file for each session from the preprocessed traffic data;

[0028] Extract 600 bytes of content sequentially, and remove the first 24 bytes to obtain 576 bytes of valid data.

[0029] Preferably, the step of obtaining output matrix output1 based on output matrix output11 and output matrix output12 includes:

[0030] The output matrices output11 and output12 are concatenated to form output13, and output13 is then input into the fully connected layer to obtain output1.

[0031] Preferably, the process of dividing the traffic communication behavior sequence based on dynamic bias to obtain several sub-sequences includes:

[0032] a) Set the initial value and increment step of the dynamic bias;

[0033] b) The traffic communication behavior sequence is divided into several sub-sequences based on the initial value of the dynamic bias;

[0034] c) Accumulate the dynamic bias with an increasing step size, and determine whether the accumulated dynamic bias exceeds the bias threshold. If it does, end the data partitioning and summarize the subsequences obtained from each partition as the final set of subsequences obtained from the partitioning. Otherwise, proceed to the next step.

[0035] d) Divide the traffic communication behavior sequence into several sub-sequences based on the accumulated dynamic bias, and return to step c) to continue execution.

[0036] Preferably, the step of constructing a communication behavior graph corresponding to each subsequence based on feature parameters includes:

[0037] If a subsequence contains K traffic data points, and each traffic data point has M-dimensional node features, then the node feature matrix is ​​defined as follows: Adjacency matrix is ​​defined as The communication behavior graph is mathematically represented as (X, A).

[0038] Preferably, the fusion of the output matrix output1 and the output matrix output2 for classification to obtain the network detection result includes:

[0039] The output matrix output1 and the output matrix output2 are concatenated to obtain the output matrix output3;

[0040] Input the output matrix output3 into the fully connected layer to obtain the output matrix output. Then, input the output matrix output into the classifier to obtain the network detection results.

[0041] The present invention provides an IoT botnet detection method based on multimodal fusion, which has the following advantages compared with the prior art:

[0042] 1. By fully utilizing the spatial and temporal characteristics of traffic and comprehensively considering the relationship between traffic content and its context, the detection accuracy can be effectively improved.

[0043] 2. It incorporates botnet communication behavior patterns to improve the model's generalization ability, enabling it to maintain a high detection accuracy when facing unknown botnets. Attached Figure Description

[0044] Figure 1 This is a flowchart illustrating the overall process of an IoT botnet detection method based on multimodal fusion according to the present invention.

[0045] Figure 2 This is a network structure diagram of the botnet detection method based on spatiotemporal feature fusion according to the present invention;

[0046] Figure 3 This is a schematic diagram of existing window technology;

[0047] Figure 4 This is a schematic diagram of node transformation in this invention;

[0048] Figure 5 This is a schematic diagram of the multimodal fusion classification process of the present invention. Detailed Implementation

[0049] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0050] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to limit the invention.

[0051] like Figure 1 As shown, this embodiment provides a method for detecting IoT botnets based on multimodal fusion, including the following steps:

[0052] Step 1: Capture traffic data and preprocess it.

[0053] Step 1.1: Use the Tcpdump tool to capture traffic data and save it to the disk in pcap file format to obtain the raw traffic file.

[0054] Step 1.2: Traffic data cleaning, removing useless TCP connection data packets. In this embodiment, useless TCP connection data packets include retransmitted data, duplicate ACK data, out-of-order data, etc.

[0055] Step 1.3: Segment the cleaned traffic by session to obtain a pcap file for each session, which can be used in subsequent steps. Currently, the conventional definition of a session is the data transmission process after a connection is established between the client and server. This process begins when the connection is established and ends when the connection is closed, including all network traffic data during this period. The traffic data captured by the Tcpdump tool is saved as packets, and each packet has the network traffic quintuple attributes {srcIP, srcPort, Protocol, desIP, desPort}. The same session has the same quintuple, that is, the attributes of the quintuple {srcIP_x, srcPort_x, Protocol_x, desIP_x, desPort_x} in the x-th data and the attributes of the quintuple {srcIP_y, srcPort_y, Protocol_y, desIP_y, desPort_y} in the y-th data have srcIP_x = desIP_y, srcPort_x = desPort_y, and Protocol_x = Protocol_y. Therefore, this embodiment uses the quintuple as the segmentation standard and uses the SlipCat tool for segmentation.

[0056] Step 2, Zombie Network Detection Based on Spatiotemporal Feature Fusion: Extract the payload part of the traffic data, encode it, and input it into the parallel CNN model and LSTM model for detection. The output is a 1*10 matrix.

[0057] Step 2.1: Extract valid data from the traffic data.

[0058] For each session's pcap file, 600 bytes of content are extracted. For sessions with a file size larger than 600 bytes, the file is truncated at the 600-byte mark; for sessions with a file size smaller than 600 bytes, the content is padded with zeros up to 600 bytes. Since the first 24 bytes are TCP connection header content, including IP address, port number, protocol type, and flags, the IP address and port number can interfere with traffic analysis. Protocol type and flag information are mostly the same in traffic, lacking differentiation and reducing the effectiveness of the data, increasing model learning costs. Therefore, this embodiment removes the first 24 bytes, obtaining 576 bytes of effective data as the traffic payload information.

[0059] Step 2.2, as follows Figure 2 As shown, valid data is input into the CNN model to obtain the output matrix output11.

[0060] For the CNN model, the data is transformed into a 24*24 two-dimensional matrix, and the data content is mapped from [0-255] to [0-1]. The data is then input into the CNN model for classification, resulting in a 1*10 output matrix output11.

[0061] The CNN model in this embodiment includes a convolutional layer 1 (kernel=3, size=3), a pooling layer 1 (size=2), a convolutional layer 2 (kernel=3, size=3), a pooling layer 2 (size=4), a fully connected layer (neurons=27), and a fully connected layer (neurons=10) connected in sequence. In other embodiments, the network structure of the CNN model can be replaced as needed.

[0062] Step 2.3: Input the valid data into the LSTM model to obtain the output matrix output12.

[0063] For the LSTM model, the data is transformed into a 16*36 two-dimensional matrix, and the data content is mapped from [0-255] to [0-1]. The data is then input into the LSTM model for classification, resulting in a 1*10 output matrix output12.

[0064] The LSTM model in this embodiment includes a bidirectional LSTM network (neurons=256, timestep=32), a fully connected layer (neurons=512), a fully connected layer (neurons=128), and a fully connected layer (neurons=10) connected sequentially. In other embodiments, the network structure of the LSTM model can be replaced as needed.

[0065] Step 2.4: Obtain the output matrix output1 based on the output matrix output11 and the output matrix output12.

[0066] The output matrices output11 and output12 are concatenated to form output13, where output13 = [output11, output12], with a dimension of 20. Output13 is then input into a fully connected layer. Figure 2 The output layer 1 (neurons=10) yields the output matrix output1, which has a dimension of 10.

[0067] Step 3: Botnet detection based on traffic behavior: Collect feature indicators from traffic data, construct a traffic behavior graph, input it into the GCN model for detection, and output a 1*10 matrix.

[0068] Step 3.1: Extract the characteristic parameters of the traffic data to form a traffic communication behavior sequence.

[0069] The Zeek tool was used to extract feature parameters from pcap format files, including relevant statistical metrics such as: ts: timestamp, id.orig_h: source IP address, id.orig_p: source port number, id.resp_h: destination IP address, id.resp_p: destination port number, proto: network protocol, duration: duration, conn_state: connection state, orig_ip_pkts: number of source packets, orig_ip_bytes: number of source bytes, resp_ip_pkts: number of response packets, and resp_ip_bytes: number of response bytes. These feature parameters from multiple traffic data points constitute a sequence of traffic communication behavior.

[0070] Step 3.2: Divide the traffic communication behavior sequence based on dynamic bias to obtain several sub-sequences.

[0071] Since the GCN model can only handle static graphs and the number of traffic data nodes is enormous, computers cannot process all the data at once. Therefore, the traffic sequence must be divided into several subsequences according to specific rules, and the resulting set of subsequences is used for calculation.

[0072] Windowing is a common method in streaming data processing. Its principle lies in segmenting the original sequence based on time windows, focusing on analyzing the data within each window. Windowing techniques can efficiently divide data and reduce the time complexity of the algorithm. For example... Figure 3 As shown, in the data partitioning, L represents the original sequence length, W is the window size, S is the stride, and P represents the bias.

[0073] When the original sequence L is divided using a bias of P, a window size of W, and a stride of S, the number of subsequences N obtained after the division is as follows:

[0074]

[0075] As can be seen, the number of subsequences after windowing is significantly reduced, and information about the relationships between data is lost. Suppose two data points A and B are related in the original sequence. After windowing, A and B belong to two different subsequences. The relationship between A and B is severed, and the loss of data information will lead to poor learning results.

[0076] To compensate for the impact of reduced data volume and missing information on model training, this embodiment proposes a sequence partitioning method based on dynamic bias. Specifically, this involves dynamically changing the bias value to increase the number of different subsequences. Furthermore, due to the dynamic change in bias, the probability of data A and data B appearing within the same window is significantly increased, preventing the connection between data A and data B from being severed.

[0077] Based on dynamic bias, the data partitioning process in this embodiment is as follows:

[0078] a) Set the initial value and increment step of the dynamic bias;

[0079] b) The traffic communication behavior sequence is divided into several sub-sequences based on the initial value of the dynamic bias;

[0080] c) Accumulate the dynamic bias with an increasing step size, and determine whether the accumulated dynamic bias exceeds the bias threshold. If it does, end the data partitioning and summarize the subsequences obtained from each partition as the final set of subsequences obtained from the partitioning. Otherwise, proceed to the next step.

[0081] d) Divide the traffic communication behavior sequence into several sub-sequences based on the accumulated dynamic bias, and return to step c) to continue execution.

[0082] Then, after data partitioning using this method, the data entry N′ in the subsequence set is:

[0083]

[0084] In the formula, O is the growth step size of the dynamic bias, S is the step size of the sliding window, N is the number of subsequences obtained after partitioning based on the initial value of the dynamic bias, and [O,S] is the least common multiple of O and S. To make N′ as large as possible and to obtain the richest data, we can take O and S as two coprime numbers, then it is easy to obtain N′=ON.

[0085] Step 3.3: Construct a communication behavior graph corresponding to each subsequence based on the feature parameters.

[0086] A communication behavior graph is a graph in computer data structures, composed of nodes and edges (connections between nodes). In graph convolutional neural networks, nodes are generally represented by node feature matrices, and the graph is represented by adjacency matrices. The network topology of a communication behavior graph is a natural graph structure, with hosts as nodes and network communication behaviors as edges. However, malicious traffic detection classifies edges, not nodes. Therefore, it is necessary to construct a graph structure with network communication behaviors as nodes. (See [reference needed]). Figure 4 Node transition diagram.

[0087] Figure 4 The left side of the middle arrow represents the node feature matrix, where the numbers represent edges, i.e., network communication behavior. The right side of the arrow represents the adjacency matrix, where the numbers represent network communication behavior. Nodes in the adjacency matrix are connected, corresponding to two edges in the node feature matrix sharing a common node.

[0088] Assuming there are K traffic data points, and each traffic data point has M-dimensional node features, then the node feature matrix is ​​defined as follows: (The i-th row in the node feature matrix represents the i-th traffic data). The adjacency matrix is ​​defined as follows: The communication behavior graph is mathematically represented as (X, A). If two traffic streams have the same IP address, it means that there is an edge between these two nodes (e.g., node i and node j). Let A... ij =1, otherwise there is no edge, let A ij =0.

[0089] In this context, having the same IP address for two traffic streams means that any one of their IP addresses is the same. Suppose there are two traffic streams, x and y. The source and destination addresses of traffic x are {srcIP_x, desIP_x}, and the source and destination addresses of traffic y are {srcIP_y, desIP_y}. An edge exists if any of the following four conditions are met: srcIP_x = srcIP_y, srcIP_x = desIP_y, desIP_x = srcIP_y, or desIP_x = desIP_y.

[0090] In this embodiment, the node features are selected from the statistical indicators duration, conn_state, orig_ip_pkts, orig_ip_bytes, resp_ip_pkts, and resp_ip_bytes. In other embodiments, the selection of node features can be adjusted according to requirements.

[0091] Step 3.4: Input the communication behavior graph into the GCN model to obtain the output matrix output2.

[0092] The input to a GCN model can be defined as in It is a matrix The renormalized Laplace matrix. N W is an N-dimensional identity matrix, and W is the weight coefficient matrix. f(.) is an activation function, such as ReLU(.) = max(0,.).

[0093] This embodiment uses a two-layer GCN model. Based on the input communication behavior graph, the output is defined as follows: in It is the coefficient matrix from the input layer to the hidden layer, and H is the number of features; This is the weight coefficient matrix from the hidden layer to the output layer, and F is the dimension of the output matrix, i.e., the number of categories to be classified. In this embodiment, M=6, H=64, and F=10, resulting in the output matrix. The dimension is 10.

[0094] Step 4: Combine the output matrix output1 and output matrix output2 for classification to obtain the network detection results.

[0095] like Figure 5 As shown, output matrix output1 and output matrix output2 are concatenated to obtain output matrix output3, which has a dimension of 20. Output matrix output3 is then input into a fully connected layer, i.e. Figure 5 The output layer 2 (neurons=10) produces an output matrix `output`, which has a dimension of 2. This output matrix is ​​then input into a classifier to obtain the corresponding classification result, i.e., the network detection result. In this embodiment, the classification result is either a botnet or a non-botnet, used for detecting botnets in the Internet of Things (IoT). In other embodiments, the model output dimension and type can be adjusted according to requirements.

[0096] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0097] The embodiments described above are merely illustrative of several implementations of the present invention, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these modifications and improvements all fall within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the appended claims.

Claims

1. A method for detecting IoT botnets based on multimodal fusion, characterized in that, The IoT botnet detection method based on multimodal fusion includes: Step 1: Capture traffic data and preprocess it; Step 2: Zombie network detection based on spatiotemporal feature fusion; Step 2.1: Extract valid data from the traffic data; Step 2.2: Input the valid data into the CNN model to obtain the output matrix output11; Step 2.3: Input the valid data into the LSTM model to obtain the output matrix output12; Step 2.4: Obtain the output matrix output1 based on the output matrix output11 and the output matrix output12; Step 3: Botnet detection based on traffic behavior; Step 3.1: Extract feature parameters from traffic data to form a traffic communication behavior sequence; Step 3.2: Divide the traffic communication behavior sequence based on dynamic bias to obtain several sub-sequences, including: a) Set the initial value and increment step of the dynamic bias; b) The traffic communication behavior sequence is divided into several sub-sequences based on the initial value of the dynamic bias; c) Accumulate the dynamic bias with an increasing step size, and determine whether the accumulated dynamic bias exceeds the bias threshold. If it does, end the data partitioning and summarize the subsequences obtained from each partition as the final set of subsequences obtained from the partitioning. Otherwise, proceed to the next step. d) Divide the traffic communication behavior sequence into several sub-sequences based on the accumulated dynamic bias, and return to step c) to continue execution; Step 3.3: Based on the feature parameters, construct the communication behavior graph corresponding to each subsequence. The communication behavior graph uses traffic communication behavior as nodes. If two traffic communication behaviors have the same IP address, there is an edge between the two nodes. The two traffic communication behaviors having the same IP address means that either the source address or the destination address of the two traffic communication behaviors has the same IP address. Step 3.4: Input the communication behavior graph into the GCN model to obtain the output matrix output2; Step 4: Combine the output matrix output1 and output matrix output2 for classification to obtain the network detection results.

2. The IoT botnet detection method based on multimodal fusion as described in claim 1, characterized in that, The captured traffic data and preprocessing include: Use the Tcpdump tool to capture traffic data and save it in pcap file format; Traffic data cleaning, filtering out useless TCP connection packets; The cleaned traffic data is segmented by session, using the five-tuple of traffic data as the segmentation standard, resulting in a pcap format file for each session.

3. The IoT botnet detection method based on multimodal fusion as described in claim 1, characterized in that, The extracted valid data from the traffic data includes: Extract the pcap format file for each session from the preprocessed traffic data; Extract 600 bytes of content sequentially, and remove the first 24 bytes to obtain 576 bytes of valid data.

4. The IoT botnet detection method based on multimodal fusion as described in claim 1, characterized in that, The step of obtaining output matrix output1 based on output matrix output11 and output matrix output12 includes: The output matrices output11 and output12 are concatenated to form output13, and output13 is then input into the fully connected layer to obtain output1.

5. The IoT botnet detection method based on multimodal fusion as described in claim 1, characterized in that, The step of constructing a communication behavior graph corresponding to each sub-sequence based on feature parameters includes: If the subsequence contains Each traffic data point contains [number] data points. If the node features are dimensional, then the node feature matrix is ​​defined as follows: The adjacency matrix is ​​defined as Then the communication behavior graph is mathematically represented as ( , ).

6. The IoT botnet detection method based on multimodal fusion as described in claim 1, characterized in that, The fusion of the output matrices output1 and output2 is used for classification to obtain network detection results, including: The output matrix output1 and the output matrix output2 are concatenated to obtain the output matrix output3; Input the output matrix output3 into the fully connected layer to obtain the output matrix output. Then, input the output matrix output into the classifier to obtain the network detection results.