Botnet detection method based on multi-modal stacked autoencoder

By combining static and dynamic analysis with a multimodal stacked autoencoder, the complex relationship between the static and dynamic features of botnets is learned, which solves the problem of insufficient detection accuracy in existing technologies and achieves higher detection accuracy and improved model performance.

CN117640190BActive Publication Date: 2026-06-26HOHAI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HOHAI UNIV
Filing Date
2023-11-28
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing botnet detection methods are mainly based on single static or dynamic features, which cannot effectively fuse the complex relationship between the two features, resulting in insufficient detection accuracy. Furthermore, existing feature fusion methods are simple and fail to fully leverage the advantages of hybrid analysis.

Method used

A multimodal stacked autoencoder is employed, combining static and dynamic analysis. By using a stacked multimodal autoencoder, the complex relationship between two different modal features is learned, deep and complex features are extracted, and the detection capability is improved by fine-tuning the model.

Benefits of technology

It improves the accuracy of zombie program detection by automatically fusing static and dynamic features using the self-learning capability of a multimodal autoencoder, giving full play to the advantages of hybrid analysis, reducing the need for labeled datasets, and enhancing the performance of the network model through iterative training.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117640190B_ABST
    Figure CN117640190B_ABST
Patent Text Reader

Abstract

The application discloses a botnet detection method based on a multi-modal stacked autoencoder. The method comprises the following steps: obtaining an executable file of an application; performing dynamic analysis and static analysis on a dataset containing benign programs and bot programs respectively, and extracting flow-based dynamic features and printable string information graph-based static features; pre-training two stacked autoencoders to encode flow-based features and graph-based features respectively, and extract deep features; fusing the dynamic features and the static features based on a multi-modal autoencoder; fine-tuning the multi-modal stacked autoencoder model; taking the encoder of the trained multi-modal stacked autoencoder model as a feature extractor, taking the output of a shared hidden layer as the input of a softmax layer, and performing bot program detection. The application can automatically fuse static features and dynamic features through an improved multi-modal stacked autoencoder, can learn the complex relationship between two different modal features, can fully play the advantages of a hybrid analysis method, and can improve the precision of detecting botnet programs.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the fields of network security and machine learning, and specifically relates to a botnet detection method based on a multimodal stacked autoencoder. Background Technology

[0002] In the field of botnet detection, two main methods are static analysis and dynamic analysis for data acquisition and feature extraction. Static methods extract static features by analyzing the binary code of botnet instances without executing malware. Dynamic methods require executing a given botnet instance, typically in a sandbox environment, and extracting dynamic features representing botnet behavior. Most existing botnet detection methods rely solely on either static or dynamic features. Static analysis is simple and fast, but susceptible to obfuscation techniques such as encryption. Conversely, dynamic analysis reflects the behavior of a program at runtime, is relatively difficult to obfuscate, and has better versatility against unknown attacks and attack variants; however, data collection is time-consuming. Since static analysis excels in detecting the structure of malware, while dynamic analysis can easily detect obfuscated malware, fusing these two types of features appropriately can improve the accuracy of botnet detection. Previous methods, while considering the fusion of multiple features, were feature-level or single-modal fusions, simply splicing and merging features without learning the complex relationships between the two types of features, thus failing to fully leverage the advantages of hybrid analysis methods. Summary of the Invention

[0003] To address the aforementioned problems, this invention proposes a botnet detection method based on a multimodal stacked autoencoder. It combines multimodal features extracted through static and dynamic analysis, and learns the complex relationships between two different modal features using a stacked multimodal autoencoder, fully leveraging the advantages of hybrid analysis methods and demonstrating high detection capability for botnets.

[0004] To achieve the above-mentioned technical objectives and effects, the present invention is implemented through the following technical solution:

[0005] A botnet detection method based on a multimodal stacked autoencoder includes the following steps:

[0006] (1) Obtain the executable file of the application and save it in ELF format;

[0007] (2) Perform dynamic and static analysis on the dataset containing benign programs and zombie programs respectively, and extract dynamic features based on flow and static features based on printable string information (PSI) graph;

[0008] (3) Pre-train two stacked autoencoders (SAEs) to encode dynamic features and static features respectively, and extract deep complex features;

[0009] (4) Fusion of dynamic and static features based on multimodal autoencoder (MAE);

[0010] (5) Fine-tune the multimodal stacked automatic encoder (MSAE);

[0011] (6) The encoder of the trained MSAE model is used as a feature extractor, and the output of the shared hidden layer of the model is used as the input of a softmax layer for classification, so as to realize the detection of zombie programs.

[0012] Several alternative methods are provided below, but they are not intended as additional limitations on the overall solution above. They are merely further additions or optimizations. Provided there are no technical or logical contradictions, each alternative method can be combined individually with respect to the overall solution above, or multiple alternative methods can be combined with each other.

[0013] Preferably, the dynamic analysis of the dataset containing benign programs and zombie programs in step (2) specifically includes the following steps:

[0014] 1) Perform network behavior analysis on ELF files using the Cuckoo sandbox and record network traffic in pcap format;

[0015] 2) Based on the 5-tuple {source IP address, source port number, destination IP address, destination port number, protocol}, perform flow segmentation on the network traffic recorded in the pcap file, aggregating packets with the same 5-tuple into flow data f = {p1, p2, ..., p...}. i}, where p i This indicates data packets that have the same 5-tuple;

[0016] 3) The streaming data is aggregated again by taking the union of the streaming data collected during the same program runtime to form the streaming record of the corresponding ELF file:

[0017]

[0018] 4) Extract statistical features based on flow records, including the average, maximum, and minimum total number of data packets in the flow; the average, maximum, and minimum communication duration of the flow; and the average, maximum, and minimum number of bytes contained in data packets within the flow, totaling nine feature dimensions. Obtain a dynamic feature set based on the flow. Where n represents the number of ELF samples. This represents the flow-based features extracted from the i-th ELF sample.

[0019] Preferably, step (2) involves static analysis of the dataset containing benign programs and zombie programs, specifically including the following steps:

[0020] 1) Use the packer tool DiE to check if the ELF file is packed, and then use UPX and IDAPro to unpack and disassemble the binary program;

[0021] 2) Construct a function call graph (FCG) and a printable string information (PSI) graph based on the caller-callee relationship in the assembly code;

[0022] 3) Next, a graph embedding technique called graph2vec is used to convert the PSI graph into numerical vector data to obtain a static feature set. in This represents the PSI-based features extracted from the i-th ELF sample.

[0023] Preferably, the function call graph is defined as a directed graph G = (V, E), consisting of a vertex set V = {v1, v2, ..., v...}. m} and edge set E = {e 12 ,e 13 ,...,e ij} is composed of, where m represents the number of vertices, e ij Represents the function v i Call function v j In FCG, vertices correspond to functions contained in the program's assembly code, and edges represent caller-callee relationships between two functions.

[0024] Preferably, the process of constructing the function call graph is summarized as follows:

[0025] a) Extract a set of identified functions from the assembly code;

[0026] b) Then determine the entry point function;

[0027] c) Construct the FCG using a breadth-first search algorithm, if the function v is identified. i and v j If a caller-callee relationship exists, then vertex v i and v j Add to vertex set V, and add edge e ij Add it to edge set E.

[0028] Preferably, the PSI graph is constructed from functions and relationships selected from the FCG that closely resemble the operation steps of a zombie program, with the aim of minimizing computational complexity. Specifically:

[0029] a) Extract all printable string information (PSI) in the binary file using the IDAPro plugin, and select PSIs that contain at least three characters in length;

[0030] b) Select a set P = {psi1, psi2, ..., psi...} containing important semantic information (that can reveal the attacker's intent) to form a PSI set. k};

[0031] c) For vertex v in the function call graph i If v i The function represents at least one important printable string information psi. i Then vertex v i Add the vertex set V' of the PSI graph and continue with step 4); otherwise, skip step 4.

[0032] d) Traverse all representation functions v i The edge of the calling relationship e ij If function v j It also contains at least one psi i ,and Then vertex v j Add the vertex set V' of the PSI graph and add the edge e ij Add the edge set E' to the PSI graph;

[0033] e) Repeat steps 3) and 4) until all vertices in the function call graph have been traversed, and finally output the PSI graph G' = (V', E').

[0034] Preferably, in step (3), two SAEs are pre-trained, specifically as follows:

[0035] A SAE is pre-trained using dynamic features, and its encoder consists of two fully connected layers and a ReLU activation function. The decoder and encoder are symmetric structures.

[0036] Another SAE is pre-trained using static features. Its encoder consists of two convolutional layers, one fully connected layer, and a ReLU activation function. The decoder and encoder are also symmetrical structures.

[0037] Two pre-trained SAEs are used to encode dynamic and static data respectively to obtain the latent representations of these two modalities.

[0038] Preferably, step (4) involves fusing dynamic and static features based on a multimodal autoencoder, specifically as follows:

[0039] The final hidden layer outputs of the encoders of two pre-trained SAEs are concatenated and used as the input of a multimodal autoencoder.

[0040] The hidden layer based on a multimodal autoencoder fuses the latent representations of two modal data to generate a shared latent representation;

[0041] Finally, a stacked multimodal autoencoder (MSAE) with all pre-trained layers and shared hidden layers was constructed.

[0042] Preferably, the fine-tuning process in step (5) is as follows:

[0043] The goal of an autoencoder is to minimize the reconstruction error between the input and output, enabling the shared hidden layer to learn a shared latent representation of the bimodal data. The loss function is defined as:

[0044]

[0045] in, and These are the input dynamic feature vector and the static feature vector, respectively. and It is the corresponding reconstructed vector output by MSAE;

[0046] The parameters of the pre-trained layers are fixed, and the gradient descent algorithm is used for training. Only the weights and parameters of the shared hidden layers are updated.

[0047] Preferably, in step (6), the encoder of the MSAE model is used as a feature extractor, and a softmax layer is used for classification. Specifically:

[0048] Unfold the stacked autoencoder and add a softmax output layer on top of the shared hidden layer to output the predicted label corresponding to the i-th input.

[0049]

[0050] Where W represents the weights of the softmax layer, b represents the bias of the softmax layer, T is the number of object label categories, and z (i) It is the i-th output of the shared hidden layer.

[0051] Preferably, in order to improve the detection accuracy of botnets, the improved MSAE also incorporates the classification error into the loss function during the fine-tuning stage, minimizing the classification error based on the cross-entropy loss function:

[0052]

[0053] Where y (i) It is the true label of the i-th input sample. These are the corresponding predicted labels;

[0054] Ultimately, the goal of minimizing MSAE is the reconstruction error L. r And classification error L c Weighted sum:

[0055] L=αL r +βL c +λR (5)

[0056] Where R is the regularization term, which is achieved by applying L2 regularization to the weights of each layer in the network; α, β and λ are weighting factors.

[0057] Preferably, the weighting factors α and β are adaptively calculated using the softmax function:

[0058]

[0059] Compared with the prior art, the present invention has the following advantages:

[0060] 1. This invention combines static and dynamic analysis methods to extract flow-based features and PSI graph-based features to detect zombie programs. By leveraging the complementary advantages of the two analysis methods, it achieves higher accuracy compared to single features.

[0061] 2. This invention utilizes the powerful self-learning capability of a multimodal autoencoder to automatically fuse static and dynamic features through iterative training of the network model. Compared to the simple method of fusing features by directly splicing the features of these two modalities, this invention can extract the complex relationships between the two modal features and fully leverage the advantages of hybrid analysis.

[0062] 3. This invention trains the MSAE model using pre-training and fine-tuning, which does not require a large amount of labeled dataset, and adds the penalty for classification error to the loss function in the fine-tuning stage, further enhancing the performance of the network model. Attached Figure Description

[0063] Figure 1 This is a flowchart illustrating the training and testing process according to one embodiment of the present invention;

[0064] Figure 2 This is a schematic diagram of a zombie program detection scheme according to an embodiment of the present invention;

[0065] Figure 3 This is a schematic diagram of the pre-trained network and fine-tuned network structure according to an embodiment of the present invention;

[0066] Figure 4 This is a schematic diagram of the MSAE network structure according to an embodiment of the present invention. Detailed Implementation

[0067] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to embodiments. It should be understood that the specific embodiments described herein are only for explaining the invention and are not intended to limit the invention. The described embodiments are only some embodiments of the invention, not all embodiments. Based on the embodiments of this invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this invention.

[0068] The application principle of the present invention will be described in detail below with reference to the accompanying drawings.

[0069] In this embodiment of the invention, multimodal features extracted by static and dynamic analysis are combined. A stacked multimodal autoencoder model is constructed based on a pre-trained stacked autoencoder and a fine-tuned multimodal autoencoder for automatically fusing static and dynamic features. Then, accurate detection of botnets is achieved based on the fused features. This embodiment provides a botnet detection method based on a multimodal stacked autoencoder, specifically, as follows... Figure 1 As shown, it includes the following steps:

[0070] (1) Obtain the executable file of the application and save it as ELF format.

[0071] (2) Perform dynamic and static analysis on the dataset containing benign programs and zombie programs respectively, and extract flow-based dynamic features and PSI graph-based static features.

[0072] (2.1) The specific steps of dynamic feature extraction are as follows:

[0073] (2.1.1) Perform network behavior analysis on ELF files using the Cuckoo sandbox and record network traffic in pcap format.

[0074] (2.1.2) Based on the 5-tuple {source IP address, source port number, destination IP address, destination port number, protocol}, perform flow segmentation on the network traffic recorded in the pcap file, and aggregate packets with the same 5-tuple into flow data f = {p1, p2, ..., p...}. i}, where p i This indicates data packets that have the same 5-tuple.

[0075] (2.1.3) The streaming data is aggregated again by taking the union of the streaming data collected during the same program runtime to form the streaming record of the corresponding ELF file:

[0076]

[0077] (2.1.4) Statistical features are extracted based on flow records, including the average, maximum, and minimum total number of data packets in the flow; the average, maximum, and minimum communication duration of the flow; and the average, maximum, and minimum number of bytes contained in data packets within the flow, totaling nine feature elements. A dynamic feature set based on the flow is obtained. Where n represents the number of ELF samples. This represents the flow-based features extracted from the i-th ELF sample.

[0078] (2.2) The specific steps for static feature extraction are as follows:

[0079] (2.2.1) Use the packer tool DiE to check whether the ELF file is packed, and then use UPX and IDAPro to unpack and disassemble the binary program.

[0080] (2.2.2) Construct a function call graph (FCG) and a printable string information (PSI) graph based on the function caller-callee relationship in the assembly code.

[0081] A function call graph is defined as a directed graph G = (V, E), consisting of a vertex set V = {v1, v2, ..., v...}. m} and edge set E = {e 12 ,e 13 ,...,e ij} is composed of, where m represents the number of vertices, e ij Represents the function v i Call function v j In an FCG, vertices correspond to unique functions contained in the program's assembly code, and edges represent caller-callee relationships between two functions. This embodiment uses an existing function call graph construction method based on a breadth-first search algorithm, utilizing a FIFO function queue to construct the FCG, specifically:

[0082] a) Extract the boundaries of a set of identified functions from the assembly code and save the functions into a function set named "FunSet";

[0083] b) Then extract all entry point functions, store them in "EntryFunSet", and add all entry point functions to the vertex set V;

[0084] c) Initialize the function queue using the entry point function and set its queuing flag "enQFlag" to true to prevent the same vertex from being queued repeatedly;

[0085] d) When the queue is not empty, remove the element from the head of the queue and apply function v. i Treat it as the function caller, then iterate through function v.i The instruction sequence is used to extract its called set;

[0086] e) When the callee is obtained, iterate through the callee set and check if the callee v already exists in the graph. j The vertex, if not, will be called by the caller v. j Add to vertex set V, and check if it already exists in the graph from the caller v. i To the callee v j The edge e ij If not, then edge e ij Add to edge set E;

[0087] f) Check if the callee is already in the queue. If not, set its queuing flag "enQFlag" to true and append it to the end of the queue.

[0088] g) Repeat steps d), e), and f) until the queue is empty.

[0089] Function call graphs (FCGs) are designed to represent all possible executions of a program. Therefore, FCGs are typically complex, with a large number of nodes and edges, requiring longer computation time and more memory. While the FCG represents all call relationships of a program, some call relationships may never occur during the actual execution of the program. To minimize computational complexity, this embodiment selects functions and relationships from the FCG that closely resemble the operation steps of a zombie program to construct a PSI graph, specifically:

[0090] a) Extract all printable string information (PSI) in the binary file using the IDAPro plugin. In order to balance detection accuracy and computational complexity, this embodiment selects PSIs with a length of at least three characters.

[0091] b) Then select PSIs containing important semantic information (that can reveal the attacker's intent) to form a set P = {psi1, psi2, ..., psi...} k};

[0092] c) For vertex v in the function call graph i If v i The function represents at least one important printable string information psi. i Then vertex v i Add the vertex set V' of the PSI graph and continue with step d), otherwise skip step d).

[0093] d) Traverse all representation functions v i The edge of the calling relationship e ij If function v jIt also contains at least one psi i ,and Then vertex v j Add the vertex set V' of the PSI graph and add the edge e ij Add the edge set E' to the PSI graph;

[0094] e) Repeat steps c) and d) until all vertices in the function call graph have been traversed, and finally output the PSI graph G' = (V', E').

[0095] (2.2.3) Following this, a graph embedding technique called graph2vec is used to convert the PSI graph into numerical vector data to obtain a static feature set. in This represents the PSI-based features extracted from the i-th ELF sample. The result of this step is a set of one-hot vectors of arbitrary length representing the atlas. In this embodiment, the PSI graph is represented as a numerical vector of length 1024.

[0096] (3) Pre-train two stacked autoencoders (SAEs) to encode dynamic features and static features respectively, and extract deep complex features.

[0097] The feature dataset extracted in the above steps is divided into a training set and a test set. The training set is then divided again to obtain a pre-training dataset and a fine-tuning dataset.

[0098] Using unsupervised learning, a pre-trained SAE is constructed using dynamic features from a pre-trained dataset as input. Its structure is as follows: Figure 3 As shown in (a). For ease of explanation, this SAE is referred to as SAE1 in this embodiment. SAE1 consists of an encoder and a decoder. The encoder consists of two fully connected layers, and the ReLU activation function is used between each layer. The decoder and encoder are symmetrical structures. The size of the input layer and output layer of SAE1 corresponds to the dimension of the dynamic features and is set to 9. The number of neurons in the two hidden layers of its encoder are 8 and 4, respectively. Therefore, the output size of the final hidden layer of the encoder is 4.

[0099] Using unsupervised learning, another SAE, referred to here as SAE2, is pre-trained using static features from a pre-training dataset as input. Its structure is as follows: Figure 3 As shown in (b), SAE2 consists of an encoder and a decoder, both of which are symmetrical structures. The encoder's structure is as follows:

[0100] ① Convolutional layer C1, with a kernel size of 3×3, 16 channels, and an output of 8×8×16;

[0101] ② In pooling layer P1, perform a 2×2 max pooling operation, and the output is 4×4×16;

[0102] ③ Convolutional layer C2, with a kernel size of 3×3, 32 channels, and an output of 4×4×32;

[0103] ④ In pooling layer P2, perform a 2×2 max pooling operation, and the output is 2×2×32;

[0104] ⑤ The fully connected layer FC1 consists of 128 neurons, uses the ReLU activation function, and outputs a 128-dimensional vector;

[0105] ⑥ The fully connected layer FC2 consists of 10 neurons and uses the ReLU activation function, so the output size of the encoder's final hidden layer is 10.

[0106] Then, two pre-trained SAEs are used to encode the dynamic and static features in the fine-tuned dataset, respectively, to obtain the latent representations of these two modalities.

[0107] (4) Fusion of dynamic and static features based on multimodal autoencoder.

[0108] To fuse static and dynamic features, this embodiment concatenates the final hidden layer outputs of the two pre-trained SAE encoders and uses them as input to a multimodal autoencoder. The implementation of a multimodal autoencoder essentially involves fusing the latent representations of the two modalities from the hidden layer of another autoencoder to generate a shared latent representation, such as... Figure 3 As shown in (c). Finally, a stacked multimodal autoencoder with all pre-trained layers and shared hidden layers was constructed, as follows: Figure 4 As shown.

[0109] (5) Fine-tune the multimodal stacked automatic encoder (MSAE).

[0110] The goal of an autoencoder is to minimize the reconstruction error between the input and output, enabling the shared hidden layer to learn a shared latent representation of the bimodal data. The loss function is defined as:

[0111]

[0112] in, and These are the input dynamic feature vector and the static feature vector, respectively. and It is the corresponding reconstructed vector output by MSAE;

[0113] We fine-tuned MSAE using a semi-supervised learning approach, taking the labeled fine-tuning dataset as the model input, fixing the parameters of the pre-trained layers, and using the Adam optimization function based on the gradient descent algorithm to optimize and update only the parameters of the shared hidden layers.

[0114] (6) The encoder of the trained MSAE model is used as a feature extractor, and the output of the shared hidden layer of the model is used as the input of a softmax layer for classification, so as to realize the detection of zombie programs.

[0115] The model structure for detecting botnets based on MSAE is as follows: Figure 4 As shown. Specifically, the stacked autoencoder is expanded, and a softmax output layer is added on top of the shared hidden layer to output the predicted label corresponding to the i-th input.

[0116]

[0117] Where W represents the weights of the softmax layer, b represents the bias of the softmax layer, T is the number of object label categories, and z (i) It is the i-th output of the shared hidden layer.

[0118] To improve the detection accuracy of botnets, this embodiment proposes an improved MSAE, which incorporates the classification error into the loss function during the fine-tuning stage, minimizing the classification error based on the cross-entropy loss function:

[0119]

[0120] Where y (i) It is the true label of the i-th input sample. These are the corresponding predicted labels;

[0121] Ultimately, the goal of minimizing MSAE is the reconstruction error L. r And classification error L c Weighted sum:

[0122]

[0123] Where R is the regularization term, used to prevent overfitting of the model, achieved by applying L2 regularization to the weights of each layer in the network, l refers to the number of layers in the network, and W l This refers to the weights of the corresponding layers; α and β are the weighting factors for reconstruction loss and classification loss, respectively, and λ is the regularization coefficient.

[0124] Furthermore, the weighting factors α and β are adaptively calculated using the softmax function:

[0125]

[0126] like Figure 2 As shown, this embodiment combines static and dynamic analysis methods, extracting static features based on the PSI graph and dynamic features based on the flow, respectively. Then, based on MSAE, the static and dynamic features are automatically fused. The fused features are automatically extracted through iterative network training, and botnets are detected based on these fused features. Compared to existing technologies, this invention can extract the complex relationships between bimodal features, fully leveraging the advantages of hybrid analysis.

[0127] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely illustrative of the principles of the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the present invention as claimed. The scope of protection of this invention is defined by the appended claims and their equivalents.

Claims

1. A botnet detection method based on a multimodal stacked autoencoder, characterized in that, Includes the following steps: (1) Obtain the executable file of the application and save it in ELF format; (2) Perform dynamic and static analysis on the dataset containing benign programs and zombie programs respectively, and improve... Extract dynamic features based on streams and static features based on printable string information graphs; In step (2), a static analysis is performed on the dataset containing benign programs and zombie programs, specifically as follows: 3-1) Use the packer tool DiE to check if the ELF file is packed, and then use UPX and IDAPro to unpack and disassemble the binary program; 3-2) Based on the caller-callee relationship in the assembly code, construct a function call graph and a printable string information graph; 3-3) A graph embedding technique called graph2vec is used to convert printable string information graphs into numerical vector data to obtain a static feature set. ,in Indicates from the first Features based on printable string infographics extracted from ELF samples; A function call graph is defined as a directed graph. , by vertex set and edge set Composition, where m represents the number of vertices, Representation function Calling functions In a function call graph, vertices correspond to functions contained in the program's assembly code, and edges represent the caller-callee relationship between two functions. (3) Pre-train two stacked autoencoders to encode dynamic features and static features respectively, and extract deep complex features; (4) Fusion of dynamic and static features based on multimodal autoencoder; (5) Fine-tune the multimodal stacked autoencoder; (6) The encoder of the fully trained multimodal stacked autoencoder model is used as a feature extractor, and the output of the shared hidden layer of the model is used as the input of a softmax layer for classification, thereby realizing the detection of zombie programs.

2. The botnet detection method based on a multimodal stacked autoencoder according to claim 1, characterized in that, The dynamic analysis of the dataset containing benign programs and zombie programs in step (2) specifically involves: 2-1) Perform network behavior analysis on ELF files using the Cuckoo sandbox and record network traffic in pcap format; 2-2) Based on the 5-tuple {source IP address, source port number, destination IP address, destination port number, protocol}, perform flow segmentation on the network traffic recorded in the pcap file, and aggregate packets with the same 5-tuple into flow data. ,in This indicates data packets that have the same 5-tuple; 2-3) The streaming data is aggregated again by taking the union of the streaming data collected during the same program runtime to form the streaming record of the corresponding ELF file: (1) 2-4) Extract statistical features based on flow records, including the average, maximum, and minimum total number of data packets in the flow; the average, maximum, and minimum communication duration of the flow; and the average, maximum, and minimum number of bytes contained in data packets in the flow, totaling nine feature dimensions; to obtain a dynamic feature set based on the flow. Where n represents the number of ELF samples, Indicates from the first Flow-based features extracted from ELF samples.

3. The botnet detection method based on a multimodal stacked autoencoder according to claim 1, characterized in that, The process of constructing the function call graph is as follows: a) Extract a set of identified functions from the assembly code; b) Then determine the entry point function; c) Construct a function call graph using a breadth-first search algorithm. If a function is identified... and If a caller-callee relationship exists, then the vertex will be... and Add to vertex set In the middle, and the edges Add to edge set middle.

4. The botnet detection method based on a multimodal stacked autoencoder according to claim 1, characterized in that, Printable string infographics are constructed by selecting functions and relationships from the function call graph that closely resemble the steps of a zombie program's operations. Specifically: 5-1) Extract all printable string information in the binary file using the IDAPro plugin, and select printable string information with a length of at least three characters; 5-2) Select printable strings containing important semantic information to form a set. ; 5-3) For vertices in the function call graph ,if The function being represented contains at least one important printable string. Then the vertex Adding the vertex set to the PSI graph If not, proceed to step 5-4; otherwise, skip step 5-4. 5-4) Traverse all representation functions Edge of call relationship If the function It also contains at least one ,and Then the vertex Adding the vertex set to the PSI graph and the edge Add edge set to PSI graph ; 5-5) Repeat steps 5-3) and 5-4) until all vertices in the function call graph have been traversed, and finally output the PSI graph. .

5. The botnet detection method based on a multimodal stacked autoencoder according to claim 1, characterized in that, In step (3), two stacked autoencoders are pre-trained, specifically as follows: A stacked autoencoder is pre-trained using dynamic features. The encoder consists of two fully connected layers and a ReLU activation function. The decoder and encoder are symmetric structures. Another stacked autoencoder is pre-trained using static features. Its encoder consists of two convolutional layers, one fully connected layer, and a ReLU activation function. The decoder and encoder are also symmetrical structures. Two pre-trained stacked autoencoders are used to encode dynamic and static data respectively, to obtain the latent representations of these two modalities.

6. The botnet detection method based on a multimodal stacked autoencoder according to claim 1, characterized in that, In step (4), dynamic and static features are fused based on a multimodal autoencoder, specifically as follows: The final hidden layer codes of two pre-trained stacked autoencoders are concatenated and used as the input of a multimodal autoencoder; The hidden layer based on a multimodal autoencoder fuses the latent representations of two modal data to generate a shared latent representation; Finally, a stacked multimodal autoencoder with all pre-trained layers and shared hidden layers was constructed.

7. The botnet detection method based on a multimodal stacked autoencoder according to claim 1, characterized in that, The fine-tuning process in step (5) is as follows: The goal of an autoencoder is to minimize the reconstruction error between the input and output, enabling the shared hidden layer to learn a shared latent representation of the bimodal data. The loss function is defined as: (2) in, and These are the input dynamic feature vector and the static feature vector, respectively. and It is the corresponding reconstructed vector output by MSAE; The parameters of the pre-trained layers are fixed, and the gradient descent algorithm is used for training. Only the weights and parameters of the shared hidden layers are updated.

8. The botnet detection method based on a multimodal stacked autoencoder according to claim 1, characterized in that, Step (6) specifically involves: Expand the stacked autoencoder and add a softmax output layer on top of the shared hidden layer, outputting the first hidden layer. The corresponding predicted label for each input : (3) in, Indicates the weights of the softmax layer. This indicates the bias of the softmax layer. It is the number of object tag categories. It is the output of the shared hidden layer.

9. A botnet detection method based on a multimodal stacked autoencoder according to claim 8, characterized in that, To improve the detection accuracy of zombie programs, the classification error is also added to the loss function in the fine-tuning stage. Based on the cross-entropy loss function, the classification error is minimized. (4) in It is the true label of the i-th input sample. These are the corresponding predicted labels; Ultimately, the goal of minimizing the multimodal stacked autoencoder is to reduce the reconstruction error. and classification error Weighted sum: (5) in It is a regularization term, implemented by L2 regularization of the weights of each layer in the network; , and It is a weighting factor; The softmax function is used to adaptively calculate the weighting factors for reconstruction error and classification error. and : (6)。