Meta-learning-based method for mining implicit association of malicious code gene information
By using a meta-learning approach, combined with graph neural networks and large language models, key subgraphs of function call graphs and control flow graphs are generated. This solves the problem of low accuracy and efficiency in identifying implicit associations in malware analysis, and enables precise mining of malware genetic information and accurate analysis of behavioral patterns.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- NANJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2025-09-08
- Publication Date
- 2026-06-18
AI Technical Summary
In existing technologies, the accuracy and efficiency of identifying implicit associations in malware gene information in malware analysis are low, making it difficult to effectively uncover the behavioral patterns and implicit associations of malware families.
A meta-learning-based approach is adopted, which generates function call graphs and control flow graphs using disassemblers. Combined with graph neural networks and large language models, feature extraction and fine-tuning are performed to identify key subgraphs and mine implicit associations of malicious code gene information.
It significantly improves the accuracy and efficiency of malware analysis, enabling precise extraction of malware genetic features and in-depth mining of its implicit correlation features, providing innovative support for malware detection and security protection.
Smart Images

Figure CN2025119860_18062026_PF_FP_ABST
Abstract
Description
A Meta-Learning-Based Method for Mining Latent Associations in Malicious Code Genetic Information Technical Field
[0001] This invention relates to a method for mining implicit associations in the genetic information of malicious code based on meta-learning, belonging to the field of information mining technology. Background Technology
[0002] With the rapid development of internet technology, cybersecurity issues have become increasingly prominent, especially malware attacks, which have become a major threat to personal privacy and corporate data security. To effectively combat malware, researchers are constantly exploring new detection technologies and methods. In recent years, Graph Neural Networks (GNNs) have been widely used in the field of malware analysis due to their powerful graph data processing capabilities. GNNs can extract valuable information from complex graph structures, helping to identify and understand the behavioral patterns of malware.
[0003] Meta-learning is a learning strategy that enables machine learning algorithms to quickly adapt to new tasks, achieving good generalization results even with limited sample sizes. In the field of malware analysis, meta-learning is used to improve the detection efficiency of models when faced with unknown malware. Combined with graph neural networks (GNNs), meta-learning can help models learn more quickly the key subgraphs in the malware's control flow graph and function call graph—those parts that have a decisive influence on the malware's behavior.
[0004] With the development of deep learning technology, large-scale pre-trained models have achieved remarkable results in fields such as natural language processing. These large language models typically have a huge number of parameters and are able to capture complex relationships between data. However, directly applying these models to specific tasks often leads to overfitting or insufficient generalization ability, resulting in low accuracy and efficiency in identifying malicious code.
[0005] In summary, existing problems in current malware analysis restrict the effective discovery of malware family behavior patterns and implicit associations. Summary of the Invention
[0006] The purpose of this invention is to provide a method for mining implicit associations of malicious code gene information based on meta-learning, which solves the problem that the accuracy and efficiency of identifying implicit associations of malicious code gene information in the existing technology need to be improved.
[0007] The technical solution of this invention is:
[0008] A method for mining implicit associations in the genetic information of malicious code based on meta-learning includes the following steps:
[0009] S1. Use disassembler tools to identify and mark functions in malicious code, generate function call graph (FCG), and use the generated FCG to extract features and preprocess to obtain attribute function call graph;
[0010] S2. Construct a gene feature extraction model based on sequence-to-sequence mechanism, and use meta-learning to train the feature extraction model based on sequence-to-sequence mechanism to obtain the trained feature extraction model based on sequence-to-sequence mechanism. The attribute function call graph generated in step S1 is used to obtain the key subgraph of the function call graph through the trained feature extraction model based on sequence-to-sequence mechanism.
[0011] S3. Obtain assembly code by parsing the binary file of malicious code using a disassembler, construct the call relationship between functions, generate a control flow graph (CFG), and extract semantic and structural features from the CFG to obtain an attribute control flow graph.
[0012] S4. Construct a graph-based gene feature extraction model, optimize the graph-based gene feature extraction model using meta-learning, and obtain the optimized graph-based gene feature extraction model. Use the optimized graph-based gene feature extraction model on the attribute control flow graph generated in step S3 to obtain the key subgraph of the control flow graph.
[0013] S5. Using the key subgraphs of the function call graph obtained in step S3 and the key subgraphs of the control flow graph obtained in step S4, a template-based prompt word automatic generation technique is used to obtain a fine-tuned corpus. The large language model is then fine-tuned to obtain a fine-tuned large language model. The fine-tuned large language model is then used to mine the implicit associations of malicious code gene information.
[0014] Furthermore, in step S2, the feature extraction model based on the sequence-to-sequence mechanism includes a graph attention network (GAT), a first leaky linear rectified function layer (i.e., the first LeakyRELU layer), a first fully connected layer, a second leaky linear rectified function layer (i.e., the second LeakyRELU layer), a sequence-to-sequence model, and a second fully connected layer.
[0015] Graph Attention Network (GAT): Classifies attribute function call graphs to obtain attention weights;
[0016] The first LeakyRELU layer: calculates the set of node embeddings containing neighbor features for N nodes from the input attention weights;
[0017] The first fully connected layer: The current attribute node feature embeddings of N nodes in the graph are called by the input attribute function to obtain a set of current attribute node feature embeddings of N nodes with the same dimension as the node embedding set.
[0018] The second LeakyRELU layer: The node feature set of N nodes is obtained by concatenating the node embedding set containing neighbor features and the node feature embedding set containing the current attribute features.
[0019] Sequence-to-sequence model: Extract global features from the final node feature set of the input N nodes, obtain nodes with attention weights, output them to the key subgraph construction module, and generate a global graph vector, which is then output to the second fully connected layer;
[0020] Key subgraph construction module: Ranks nodes with attention weights by importance and generates a sequence of important subgraphs as key subgraphs of the control flow graph;
[0021] The second fully connected layer calculates the classification probability from the input global graph vector to obtain the classification result.
[0022] Furthermore, the sequence-to-sequence model includes a normalized exponential function layer (softmax layer), a weighted averaging module, a concatenation layer, and a Long Short-Term Memory (LSTM) network.
[0023] Softmax layer: The final node features u are derived from the input final node feature set. i And the query vector q at time steps t = 0, 1, 2, ..., T t Calculate the attention weight α′ of the node. i,t =softmax(ui·q) t );
[0024] Weighted averaging module: Generates aggregated full-map information from the features of the node at the current time step.
[0025] Concatenation layer: This layer combines the query vector q at time step t. t and aggregated full graph information r t Concatenate the features at time step t to obtain the feature vector generated by concatenation. The output is then fed into a Long Short-Term Memory (LSTM) network; after time step T, the global graph vector is obtained. The output is sent to the second fully connected layer;
[0026] Long Short-Term Memory (LSTM) network: Feature vectors generated from the input at time step t Obtain the query vector at time step t+1 And output it to the softmax layer.
[0027] Furthermore, in step S2, meta-learning is used to train the feature extraction model based on the sequence-to-sequence mechanism, specifically as follows:
[0028] S21. Randomly sample N categories from the attribute function call graph. Each category contains K support set samples and several query set samples.
[0029] S22. On the support set, the base learner starts with the initial global meta-parameters θ provided by the meta-learner and optimizes the task-specific parameters θ′: Where α is the learning rate of the base learner. To support the loss gradient of the set with respect to the initialization parameter θ, To support the loss function on the set, f θ The base learner employs a feature extraction model based on a sequence-to-sequence mechanism;
[0030] S23. On the query set, the meta-learner evaluates the performance of the model updated by the base learners, and optimizes the global meta-parameter θ using the loss of the query set: Where β is the learning rate of the meta-learner. To find the loss gradient of the query set with respect to the global meta-parameter θ, The loss function is set for the query set, and the global meta-parameter θ is updated. Training is completed after a set number of training iterations.
[0031] Furthermore, in step S4, the graph-based gene feature extraction model includes a graph attention network model classifier, a control flow graph interpreter (CFG interpreter), and a key subgraph generation module.
[0032] Graph Attention Network Model Classifier: The node embeddings of the input attribute control flow graph (CFG) are generated by the node embedding generation component, and the category of each attribute control flow graph is predicted by the classification component as the class label. The node embeddings and class labels are output to the CFG interpreter.
[0033] CFG interpreter: includes an initial learning module and an interpretation module.
[0034] Initial learning module: The node scoring component calculates the node score on the node embedding, multiplies the node embedding by the node score to obtain the weighted embedding, and uses the weighted embedding to generate the classification probability through the classification component, thereby obtaining the classification result of each node;
[0035] The explanation module: Based on the node scores and the actual number of nodes in the attribute control flow graph, the attribute control flow graph is pruned step by step. After removing the set number of nodes with the lowest scores step by step, multiple subgraphs are obtained from the remaining nodes each time. The most important subgraph is taken as the key subgraph of the control flow graph.
[0036] Furthermore, in the interpretation module, the attribute control flow graph is pruned multiple times based on node scores and the actual number of nodes in the attribute control flow graph. After removing a predetermined number of nodes with the lowest scores, multiple subgraphs are obtained from the remaining nodes each time. The most important subgraph is then used as the key subgraph of the control flow graph. Specifically,
[0037] S41. Obtain the actual number N of the attribute control flow graph obtained in step S3. real The ordered node set Vordered, used to store nodes sorted by importance, is initialized to an empty set. The subgraph set subgraphs, used to store the adjacency matrix of the subgraphs of the remaining nodes after each iteration, is initialized to the attribute control flow graph obtained in step S3. The indices of all nodes in the attribute control flow graph are stored in the variable all_node_indices to track the nodes that have not been pruned.
[0038] S42. Based on the set step size, calculate the number of nodes Nstep that need to be pruned in each iteration:
[0039] S43. Based on the node scores obtained from the initial learning module, append the Nstep nodes with the lowest scores to the ordered node set Vordered in order. Remove the indices corresponding to the Nstep nodes with the lowest scores from the variable all_node_indices, and update the subgraph set subgraphs: remove all incoming and outgoing edges of the Nstep nodes with the lowest scores, update the subgraph adjacency matrix of the remaining nodes, and save it as the current subgraph; repeat the above steps until all the node indices in the variable all_node_indices are removed.
[0040] S44. Reverse the order of nodes in the ordered node set Vordered, so that the first node is the most important node for the classification task and the last node is the least important node; at the same time, reverse the order of subgraphs in the subgraph set subgraphs, and use the first subgraph, i.e. the most important subgraph, as the key subgraph of the control flow graph.
[0041] Further, step S5 specifically involves,
[0042] S51. The prompt word generation technology based on template generates prompt words. The attribute function call graph obtained in step S1 and the attribute control flow graph obtained in step S3 are used as input data. The key subgraph of the function call graph obtained in step S2 and the key subgraph of the control flow graph obtained in step S4 are used as target labels to construct fine-tuning corpus.
[0043] S52. Based on the generated fine-tuning corpus, use the DoRA fine-tuning method to fine-tune the large language model.
[0044] Further, in step S51, generating prompt words specifically involves designing a prompt word template t = fill(P, D), where P is a predefined pattern, and N prompt words {t1, t2, ..., t3} are generated from the input data D through the designed prompt word template mapping. i ,...,t N}
[0045] Further, step S52 specifically involves, during the fine-tuning process, fine-tuning the corpus {(t i ,y i )}, where t i For the i-th prompt word generated, y i For the i-th target label, use the optimization objective function: Here, θ represents the model parameters of the large language model; the model parameters θ of the large language model are continuously updated through backpropagation until the set number of fine-tuning steps is reached.
[0046] The beneficial effects of this invention are:
[0047] I. This method for mining implicit associations in malicious code gene information based on meta-learning can accurately mine implicit associations in malicious code gene information, significantly improving the accuracy and efficiency of malicious code analysis, and providing innovative technical support and application directions for the fields of malicious code detection and security protection.
[0048] Second, this invention can accurately extract the genetic features of malicious code. Through a sequence-to-sequence mechanism-based genetic feature extraction model and a graph-based genetic feature extraction model, it can perform multi-level analysis of attribute function call graphs and attribute control flow graphs, respectively. The features extracted from attribute function call graphs and attribute control flow graphs are integrated into the genetic information of malicious code, enabling a more accurate representation of the genetic features of malicious code and improving the effectiveness of classification, detection, and behavior analysis.
[0049] Third, this meta-learning-based method for mining implicit associations in malware gene information, through template-based automatic prompt word generation and fine-tuning, can deeply and accurately uncover implicit association features between malware. Utilizing a fine-tuned large language model, it can effectively construct precise associations of malware behavior patterns, providing a reliable basis for malware detection and family tracing in complex scenarios, and significantly improving the accuracy and comprehensiveness of implicit association mining. Attached Figure Description
[0050] Figure 1 is a flowchart illustrating the method for mining implicit associations of malicious code gene information based on meta-learning according to an embodiment of the present invention.
[0051] Figure 2 is a schematic diagram illustrating the method for mining implicit associations of malicious code gene information based on meta-learning according to an embodiment of the present invention;
[0052] Figure 3 is an illustrative diagram illustrating the gene feature extraction model based on the sequence-to-sequence mechanism in the embodiment;
[0053] Figure 4 is an illustrative diagram illustrating the graph-based gene feature extraction model in this embodiment;
[0054] Figure 5 is a schematic diagram illustrating the sequence-to-sequence model in the embodiment. Detailed Implementation
[0055] The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
[0056] This embodiment provides a method for mining implicit associations in malicious code gene information based on meta-learning, as shown in Figures 1 and 2, including the following steps:
[0057] S1. Use a disassembler to identify and mark functions in malicious code, generate a function call graph (FCG), and use the generated FCG to extract features and preprocess them to obtain an attribute function call graph.
[0058] In step S1, the malicious code is disassembled using the disassembler IDA. Plugins or scripts provided by IDA are used to generate a function call graph (FCG). The assembly instructions of the FCG are converted into feature vectors using the Word2Vec model. This process includes instruction filtering and CBOW model training. Finally, the function feature vectors are obtained by averaging the instruction vectors of each function, forming an attributed function call graph.
[0059] S2. Construct a gene feature extraction model based on sequence-to-sequence mechanism, and use meta-learning to train the feature extraction model based on sequence-to-sequence mechanism to obtain the trained feature extraction model based on sequence-to-sequence mechanism. The attribute function call graph generated in step S1 is used to obtain the key subgraph of the function call graph through the trained feature extraction model based on sequence-to-sequence mechanism.
[0060] In step S2, the feature extraction model based on the sequence-to-sequence mechanism includes a graph attention network (GAT), a first leaky linear rectified function layer (i.e., the first LeakyRELU layer), a first fully connected layer, a second leaky linear rectified function layer (i.e., the second LeakyRELU layer), a sequence-to-sequence model, and a second fully connected layer, as shown in Figure 3.
[0061] Graph Attention Network (GAT): Classifies attribute function call graphs to obtain attention weights;
[0062] The first LeakyRELU layer: Calculates the set of node embeddings {v1, v2, ..., vn} containing neighbor features of N nodes from the input attention weights. N};
[0063] The first fully connected layer: It calls the current attribute node feature embeddings of N nodes in the graph from the input attribute function, obtaining a set of node embeddings {v1, v2, ..., v...}. N The current attribute node feature embedding set {w1, w2, ..., w} of N nodes with consistent dimensions. N};
[0064] The second LeakyRELU layer: This layer takes the node embedding set {v1, v2, ..., v...} of the input N nodes and includes features from their neighbors. N} and the current attribute node feature embedding set {w1, w2, ..., w} of N nodes N The final node feature set {u1, u2, ..., u} is obtained by concatenating the corresponding N nodes. N};
[0065] Sequence-to-sequence model: Extract global features from the final node feature set of the input N nodes, obtain nodes with attention weights, output them to the key subgraph construction module, and generate a global graph vector, which is then output to the second fully connected layer;
[0066] As shown in Figure 5, the sequence-to-sequence model includes a normalized exponential function layer (softmax layer), a weighted averaging module, a splicing layer, and a long short-term memory network (LSTM).
[0067] Softmax layer: Consists of the final node feature set {u1, u2, ..., u} from the input. N The final node feature u and the query vector q at time step t in} t Calculate the attention weight α′ of the node. i,t =softmax(u i ·q t );
[0068] Weighted averaging module: Generates aggregated full-map information from the features of the node at the current time step.
[0069] Concatenation layer: This layer combines the query vector q at time step t. t and aggregated full graph information r t Concatenate the features at time step t to obtain the feature vector generated by concatenation. To provide richer information for the next iteration, after time step T, the global graph vector is obtained.
[0070] Long Short-Term Memory (LSTM) network: Feature vectors generated from the input at time step t Obtain the query vector at time step t+1 And output it to the softmax layer.
[0071] Key subgraph construction module: Ranks nodes with attention weights by importance and generates a sequence of important subgraphs as key subgraphs of the control flow graph;
[0072] The second fully connected layer calculates the classification probability from the input global graph vector to obtain the classification result.
[0073] The classification results obtained from the second fully connected layer can help verify the effectiveness of the key subgraphs in the interpreted control flow graph. This is because the classification results directly depend on the extracted key subgraphs of the control flow graph. If the extracted key subgraphs are features with significant influence, then these features can improve classification performance. Therefore, the accuracy of the classification results can serve as an indicator of the effectiveness of the key subgraphs.
[0074] In step S2, the second LeakyRELU layer adds the neighbor features of each node to the current attribute node features to obtain the final node features. Where W1 is the weight of the first fully connected layer. It is the feature of the current attribute node. It is the attention weight, W k x is the weight matrix of the k-th attention head among the K attention heads in a graph attention network model. j These are the input features of the neighboring nodes. The final node feature set {u1, u2, ..., u} is obtained from the final node features. N The sequence is fed into the sequence model, based on the generated attention weights α′. i,t After sorting the nodes, the nodes with the highest attention weights are identified as the most critical functions. High weights mean that the features of these nodes have a greater impact on the classification results, thus helping to explain the malware classification results. A global graph vector is generated by iterating through time steps t from 0, 1, 2, ..., T using a sequence-to-sequence model. This global graph vector is then fed into a second fully connected layer to calculate the classification probability, obtaining the classification result. The cross-entropy loss function is then used to evaluate the difference between the predicted result and the actual label. The feature extraction model based on the sequence-to-sequence mechanism, optimized through a meta-learning framework, can quickly adapt to different tasks, significantly improve the generalization ability of feature aggregation and updating, identify the key nodes that have the greatest impact on the classification results of the attribute function call graph, and thus obtain the key subgraphs of the function call graph. This improves adaptability and efficiency, ensuring flexibility and accuracy when handling different tasks.
[0075] In step S2, meta-learning is used to train the feature extraction model based on the sequence-to-sequence mechanism. Specifically,
[0076] S21. Randomly sample N categories from the attribute function call graph. Each category contains K support set samples and several query set samples.
[0077] S22. On the support set, the base learner starts with the initial global meta-parameters θ provided by the meta-learner and optimizes the task-specific parameters θ′: Where α is the learning rate of the base learner. To support the loss gradient of the set with respect to the initialization parameter θ, To support the loss function on the set, f0 is used as the base learner and a feature extraction model based on a sequence-to-sequence mechanism is adopted;
[0078] S23. On the query set, the meta-learner evaluates the performance of the model updated by the base learners, and optimizes the global meta-parameter θ using the loss of the query set: Where β is the learning rate of the meta-learner. To find the loss gradient of the query set with respect to the global meta-parameter θ, The loss function is set for the query set, and the global meta-parameter θ is updated. Training is completed after a set number of training iterations.
[0079] Steps S21-S23 can improve training efficiency, enable rapid adaptation to new tasks, and enhance the model's adaptability and accuracy to new tasks.
[0080] S3. Obtain assembly code by parsing the binary file of the malicious code using a disassembler, construct the call relationship between functions, generate a control flow graph (CFG), and extract semantic and structural features from the CFG to obtain an attribute control flow graph.
[0081] In step S3, the sample assembly file is read and parsed into instructions. Each instruction includes information such as address, opcode, and operands. Regular expressions and specific syntax rules are then used to effectively parse the instruction content in the assembly file. Basic blocks are constructed based on the instruction flow in the assembly file, and these basic blocks are connected to form a complete control flow graph. A pre-trained language model, such as BERT, can generate the semantic embedding of each node to obtain the attribute control flow graph (ACFG).
[0082] S4. Construct a graph-based gene feature extraction model, optimize the graph-based gene feature extraction model using meta-learning, and obtain the optimized graph-based gene feature extraction model. Use the optimized graph-based gene feature extraction model on the attribute control flow graph generated in step S3 to obtain the key subgraph of the control flow graph.
[0083] In step S4, the graph-based gene feature extraction model includes a graph attention network model classifier, a control flow graph interpreter (CFG interpreter), and a key subgraph generation module, as shown in Figure 4:
[0084] Graph Attention Network Model Classifier: The node embeddings of the input attribute control flow graph (CFG) are generated by the node embedding generation component, and the category of each attribute control flow graph is predicted by the classification component as the class label. The node embeddings and class labels are output to the CFG interpreter.
[0085] CFG interpreter: includes an initial learning module and an interpretation module.
[0086] Initial learning module: The node scoring component calculates the node score on the node embedding, multiplies the node embedding by the node score to obtain the weighted embedding, and uses the weighted embedding to generate the classification probability through the classification component, thereby obtaining the classification result of each node;
[0087] The explanation module: Based on node scores and the actual number of nodes in the attribute control flow graph, the attribute control flow graph is pruned multiple times. After removing a predetermined number of nodes with the lowest scores, multiple subgraphs are obtained from the remaining nodes at each stage. The most important subgraph is selected as the key subgraph of the control flow graph. Specifically:
[0088] S41. Obtain the actual number N of the attribute control flow graph obtained in step S3. real The ordered node set Vordered, used to store nodes sorted by importance, is initialized to an empty set. The subgraph set subgraphs, used to store the adjacency matrix of the subgraphs of the remaining nodes after each iteration, is initialized to the attribute control flow graph obtained in step S3. The indices of all nodes in the attribute control flow graph are stored in the variable all_node_indices to track the nodes that have not been pruned.
[0089] S42. Based on the set step size, calculate the number of nodes Nstep that need to be pruned in each iteration:
[0090] S43. Based on the node scores obtained from the initial learning module, append the Nstep nodes with the lowest scores to the ordered node set Vordered in order. Remove the indices corresponding to the Nstep nodes with the lowest scores from the variable all_node_indices, and update the subgraph set subgraphs: remove all incoming and outgoing edges of the Nstep nodes with the lowest scores, update the subgraph adjacency matrix of the remaining nodes, and save it as the current subgraph; repeat the above steps until all the node indices in the variable all_node_indices are removed.
[0091] S44. Reverse the order of nodes in the ordered node set Vordered, so that the first node is the most important node for the classification task and the last node is the least important node; at the same time, reverse the order of subgraphs in the subgraph set subgraphs, and use the first subgraph, i.e. the most important subgraph, as the key subgraph of the control flow graph.
[0092] In step S44, after removing the least important node in each iteration, a new subgraph is generated, and the size of the attribute control flow graph is gradually reduced. After iterative pruning, the smaller subgraphs are placed in the subgraph set subgraphs. This process is repeated, and each subgraph adds a gradually smaller subgraph to it. Since the node with the lowest score is removed each time, the subgraphs become more important as they go further down the graph. After reversing, the first one is the most important subgraph.
[0093] In step S4, the node representation is adjusted by the node scores through the initial learning module. This allows the model to prioritize nodes that have a significant impact on classification, thereby reducing interference from irrelevant or unimportant nodes and ultimately making the classification more accurate and robust. The process of optimizing the graph-based gene feature extraction model using meta-learning is the same as in steps S21-S23. Combining meta-learning framework optimization can improve the adaptability to different tasks and increase training efficiency.
[0094] S5. Using the key subgraphs of the function call graph obtained in step S3 and the key subgraphs of the control flow graph obtained in step S4, a template-based prompt word automatic generation technique is used to obtain a fine-tuned corpus. The large language model is then fine-tuned to obtain a fine-tuned large language model. The fine-tuned large language model is then used to mine the implicit associations of malicious code gene information.
[0095] S51. The prompt words are generated using template-based automatic prompt word generation technology. The attribute function call graph obtained in step S1 and the attribute control flow graph obtained in step S3 are used as input data D. The key subgraphs of the function call graph obtained in step S2 and the key subgraphs of the control flow graph obtained in step S4 are used as target labels to construct a fine-tuned corpus.
[0096] In step S51, prompt words are generated. Specifically, a prompt word template t = fill(P, D) is designed, where P is a predefined pattern. N prompt words {t1, t2, ..., t3} are generated from the input data D through the designed prompt word template. i ,...,t N}
[0097] S52. Based on the generated fine-tuning corpus, the large language model is fine-tuned using the DoRA fine-tuning method. Specifically, during the fine-tuning process, the fine-tuning corpus {(t i ,y i )}, where ti For the i-th prompt word generated, y i For the i-th target label, use the optimization objective function: Here, θ represents the model parameters of the large language model; the model parameters θ of the large language model are continuously updated through backpropagation until the set number of fine-tuning steps is reached.
[0098] In step S5, the large language model is fine-tuned based on automatically generated prompts to accurately uncover implicit associations in malicious code, improving its performance in malicious code detection and behavior analysis, and enhancing its ability to identify and process new types of malicious code. The DoRA fine-tuning method enhances the model's representational capabilities in specific domain tasks to improve its understanding of implicit associations and complex semantics, making the model better adaptable to specific domain tasks and significantly improving its responsiveness to task requirements and the effectiveness of implicit association discovery. The fine-tuned large language model can then obtain corresponding key subgraphs by providing user-supplied graph representations of malicious code.
[0099] This meta-learning-based method for mining implicit associations in malicious code genetic information can accurately mine implicit associations in malicious code genetic information, significantly improving the accuracy and efficiency of malicious code analysis, and providing innovative technical support and application directions for the fields of malicious code detection and security protection.
[0100] This meta-learning-based method for mining latent associations in malware gene information first generates a function call graph (FCG) and a control flow graph (CFG) using a disassembler, then obtains an attribute function call graph and an attribute control flow graph. A graph attention network is then used to classify the attribute function call graph, extracting malware gene features and obtaining key subgraphs of the function call graph. Key nodes in the attribute control flow graph are identified using a sequence-to-sequence mechanism within the meta-learning framework. The node scoring component and embedding generation component are optimized, and the control flow graph is gradually pruned to generate key subgraphs. Finally, by automatically generating prompt words and fine-tuning a large language model, accurate mining of latent associations in malware and identification of behavioral patterns are achieved. This method has the following advantages: By combining the semantic and structural information of the Function Call Graph (FCG) and the Control Flow Graph (CFG), it mines malicious code features at multiple levels, achieving a more comprehensive and accurate analysis; it can adapt quickly: based on the key node identification and subgraph generation mechanism of the meta-learning framework, it can quickly adapt to different types of malicious code tasks without the need for complex adjustments for each task; it can classify efficiently: using graph neural networks to classify malicious code, from feature extraction to behavioral pattern analysis, it has high computational efficiency and prediction accuracy; it has strong scalability: by adjusting model parameters or fine-tuning pre-trained large models, it can adapt to the needs of new malicious code detection and supports the mining of implicit associations in large-scale malicious code samples.
[0101] This meta-learning-based method for mining implicit associations in malware gene information uses a graph attention network (GAT) to obtain graph structure data representations of malware. It further employs a meta-learning paradigm to jointly acquire feature representations of the malware's attribute control flow graph and attribute function call graph, training a malware family classifier (achieving malware attribution analysis based on graph features). A graph interpreter is used to mine key nodes in the attribute control flow graph and attribute function call graph by evaluating node importance. Using template-based and large language model text generation capabilities, a batch of implicit association mining corpora are constructed based on the key subgraph mining results. Finally, the DoRA fine-tuning method significantly improves the performance of the large language model on the implicit association mining task, achieving an accuracy of 93.53%.
[0102] This meta-learning-based method for mining implicit associations in malware gene information extracts features from the graph structure of malware using a graph attention network (GAT). It then utilizes a meta-learning framework combined with control flow graphs and function call graphs for key subgraph mining and family classification. By combining graph neural networks, meta-learning, and large language model fine-tuning techniques, it achieves accurate mining of implicit associations in malware, exhibiting efficient feature extraction capabilities and high mining accuracy.
[0103] This invention can accurately extract the genetic features of malicious code. Through a sequence-to-sequence genetic feature extraction model and a graph-based genetic feature extraction model, it can perform multi-level analysis of attribute function call graphs and attribute control flow graphs, respectively. The features extracted from these graphs are integrated into the genetic information of the malicious code, resulting in a more accurate representation of its genetic characteristics and improved classification, detection, and behavioral analysis. Furthermore, by using disassemblers and graph neural networks (GNNs), the structural relationships between malicious code segments are comprehensively analyzed, and behavioral patterns are deeply mined from semantic and topological characteristics, significantly improving the accuracy and comprehensiveness of feature extraction.
[0104] This meta-learning-based method for mining implicit associations in malware gene information, through template-based automatic prompt word generation and meta-learning-based fine-tuning, can deeply and accurately uncover implicit association features between malware. Utilizing a fine-tuned large language model, it can effectively construct precise associations of malware behavior patterns, providing a reliable basis for malware detection and family tracing in complex scenarios, and significantly improving the accuracy and comprehensiveness of implicit association mining.
[0105] This meta-learning-based method for mining latent associations in malware genetic information utilizes malware genetic information, a set of characteristic attributes extracted from malware for identification and classification. Latent associations manifest as structural patterns in key subgraphs within function call graphs or control flow graphs. These key subgraphs are the explicit carriers of latent associations, providing robust support for understanding and detecting malware in joint analysis of function call graphs and control flow graphs. This invention offers advantages such as high-precision latent association mining of malware genetic information, rapid model adaptation, powerful graph structure analysis capabilities, and good scalability, significantly improving the efficiency and accuracy of malware analysis and classification. It enables latent association mining of malware genetic information, revelation of dynamic behavior patterns, and in-depth mining of malware family relationships. This invention achieves efficient and accurate malware classification and analysis, providing intelligent threat detection, resource optimization management, and latent association mining capabilities for the cybersecurity field.
[0106] This meta-learning-based method for mining implicit associations of malware gene information, based on a sequence-to-sequence mechanism gene feature extraction model, can quickly identify key nodes in the attribute function call graph that affect classification results. Based on node scoring and embedding-generated subgraph pruning methods, it dynamically optimizes the importance ranking, thereby significantly improving the efficiency of mining implicit associations of malware families in multi-task scenarios.
[0107] This meta-learning-based method for mining implicit associations of malicious code gene information combines the semantic features and graph structure features of attribute control flow graphs with the classification task of graph attention networks. It can effectively improve the model's adaptability and maintain high classification performance even in complex and ever-changing code environments. It is also robust to accurately classify multiple malware families.
[0108] The above description is merely an embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principle of the present invention should be included within the scope of the claims of the present invention.
Claims
1. A method for mining implicit associations in malicious code gene information based on meta-learning, characterized in that: Includes the following steps, S1. Use disassembler tools to identify and mark functions in malicious code, generate function call graph (FCG), and use the generated FCG to extract features and preprocess to obtain attribute function call graph; S2. Construct a gene feature extraction model based on sequence-to-sequence mechanism, and use meta-learning to train the feature extraction model based on sequence-to-sequence mechanism to obtain the trained feature extraction model based on sequence-to-sequence mechanism. The attribute function call graph generated in step S1 is used to obtain the key subgraph of the function call graph through the trained feature extraction model based on sequence-to-sequence mechanism. S3. Obtain assembly code by parsing the binary file of malicious code using a disassembler, construct the call relationship between functions, generate a control flow graph (CFG), and extract semantic and structural features from the CFG to obtain an attribute control flow graph. S4. Construct a graph-based gene feature extraction model, optimize the graph-based gene feature extraction model using meta-learning, and obtain the optimized graph-based gene feature extraction model. Use the optimized graph-based gene feature extraction model on the attribute control flow graph generated in step S3 to obtain the key subgraph of the control flow graph. S5. Using the key subgraphs of the function call graph obtained in step S3 and the key subgraphs of the control flow graph obtained in step S4, a template-based prompt word automatic generation technique is used to obtain a fine-tuned corpus. The large language model is then fine-tuned to obtain a fine-tuned large language model. The fine-tuned large language model is then used to mine the implicit associations of malicious code gene information.
2. The method for mining implicit associations of malicious code gene information based on meta-learning as described in claim 1, characterized in that: In step S2, the feature extraction model based on the sequence-to-sequence mechanism includes a graph attention network (GAT), a first leaky linear rectified function layer (i.e., the first LeakyRELU layer), a first fully connected layer, a second leaky linear rectified function layer (i.e., the second LeakyRELU layer), a sequence-to-sequence model, and a second fully connected layer. Graph Attention Network (GAT): Classifies attribute function call graphs to obtain attention weights; The first LeakyRELU layer: calculates the set of node embeddings containing neighbor features for N nodes from the input attention weights; The first fully connected layer: The current attribute node feature embeddings of N nodes in the graph are called by the input attribute function to obtain a set of current attribute node feature embeddings of N nodes with the same dimension as the node embedding set. The second LeakyRELU layer: The node feature set of N nodes is obtained by concatenating the node embedding set containing neighbor features and the node feature embedding set containing the current attribute features. Sequence-to-sequence model: Extract global features from the final node feature set of the input N nodes, obtain nodes with attention weights, output them to the key subgraph construction module, and generate a global graph vector, which is then output to the second fully connected layer; Key subgraph construction module: Ranks nodes with attention weights by importance and generates a sequence of important subgraphs as key subgraphs of the control flow graph; The second fully connected layer calculates the classification probability from the input global graph vector to obtain the classification result.
3. The method for mining implicit associations of malicious code gene information based on meta-learning as described in claim 2, characterized in that: Sequence-to-sequence models include a normalized exponential function layer (softmax layer), a weighted averaging module, a concatenation layer, and a Long Short-Term Memory (LSTM) network. Softmax layer: The final node features u are derived from the input final node feature set. i And the query vector q at time steps t = 0, 1, 2, ..., T t Calculate the attention weight α′ of the node. i,t =softmax(u i ·q t ); Weighted averaging module: Generates aggregated full-map information from the features of the node at the current time step. Concatenation layer: This layer combines the query vector q at time step t. t and aggregated full graph information r t Concatenate the features at time step t to obtain the feature vector generated by concatenation. The output is then fed into a Long Short-Term Memory (LSTM) network; after time step T, the global graph vector is obtained. The output is sent to the second fully connected layer; Long Short-Term Memory (LSTM) network: Feature vectors generated from the input at time step t Obtain the query vector at time step t+1 And output it to the softmax layer.
4. The method for mining implicit associations of malicious code gene information based on meta-learning as described in claim 1, characterized in that: In step S2, meta-learning is used to train the feature extraction model based on the sequence-to-sequence mechanism. Specifically, S21. Randomly sample N categories from the attribute function call graph. Each category contains K support set samples and several query set samples. S22. On the support set, the base learner starts with the initial global meta-parameters θ provided by the meta-learner and optimizes the task-specific parameters θ′: Where α is the learning rate of the base learner. To support the loss gradient of the set with respect to the initialization parameter θ, To support the loss function on the set, f θ The base learner employs a feature extraction model based on a sequence-to-sequence mechanism; S23. On the query set, the meta-learner evaluates the performance of the model updated by the base learners, and optimizes the global meta-parameter θ using the loss of the query set: Where β is the learning rate of the meta-learner. To find the loss gradient of the query set with respect to the global meta-parameter θ, The loss function is set for the query set, and the global meta-parameter θ is updated. Training is completed after a set number of training iterations.
5. The method for mining implicit associations of malicious code gene information based on meta-learning as described in any one of claims 1-4, characterized in that: In step S4, the graph-based gene feature extraction model includes a graph attention network model classifier, a control flow graph interpreter (CFG interpreter), and a key subgraph generation module. Graph Attention Network Model Classifier: The node embeddings of the input attribute control flow graph (CFG) are generated by the node embedding generation component, and the category of each attribute control flow graph is predicted by the classification component as the class label. The node embeddings and class labels are output to the CFG interpreter. CFG interpreter: includes an initial learning module and an interpretation module. Initial learning module: The node scoring component calculates the node score on the node embedding, multiplies the node embedding by the node score to obtain the weighted embedding, and uses the weighted embedding to generate the classification probability through the classification component, thereby obtaining the classification result of each node; The explanation module: Based on the node scores and the actual number of nodes in the attribute control flow graph, the attribute control flow graph is pruned step by step. After removing the set number of nodes with the lowest scores step by step, multiple subgraphs are obtained from the remaining nodes each time. The most important subgraph is taken as the key subgraph of the control flow graph.
6. The method for mining implicit associations of malicious code gene information based on meta-learning as described in claim 5, characterized in that: In the explanation module, the attribute control flow graph is pruned multiple times based on node scores and the actual number of nodes in the graph. After removing a predetermined number of nodes with the lowest scores, multiple subgraphs are obtained from the remaining nodes each time. The most important subgraph is then used as the key subgraph of the control flow graph. Specifically, S41. Obtain the actual number N of the attribute control flow graph obtained in step S3. real The ordered node set Vordered, used to store nodes sorted by importance, is initialized to an empty set. The subgraph set subgraphs, used to store the adjacency matrix of the subgraphs of the remaining nodes after each iteration, is initialized to the attribute control flow graph obtained in step S3. The indices of all nodes in the attribute control flow graph are stored in the variable all_node_indices to track the nodes that have not been pruned. S42. Based on the set step size, calculate the number of nodes Nstep that need to be pruned in each iteration: S43. Based on the node scores obtained from the initial learning module, append the Nstep nodes with the lowest scores to the ordered node set Vordered in order. Remove the indices corresponding to the Nstep nodes with the lowest scores from the variable all_node_indices, and update the subgraph set subgraphs: remove all incoming and outgoing edges of the Nstep nodes with the lowest scores, update the subgraph adjacency matrix of the remaining nodes, and save it as the current subgraph; repeat the above steps until all the node indices in the variable all_node_indices are removed. S44. Reverse the order of nodes in the ordered node set Vordered, so that the first node is the most important node for the classification task and the last node is the least important node; at the same time, reverse the order of subgraphs in the subgraph set subgraphs, and use the first subgraph, i.e. the most important subgraph, as the key subgraph of the control flow graph.
7. The method for mining implicit associations of malicious code gene information based on meta-learning as described in any one of claims 1-4, characterized in that: Step S5, specifically, S51. The prompt words are generated by the template-based prompt word automatic generation technology. The attribute function call graph obtained in step S1 and the attribute control flow graph obtained in step S3 are used as input data D. The key subgraph of the function call graph obtained in step S2 and the key subgraph of the control flow graph obtained in step S4 are used as target labels to construct fine-tuning corpus. S52. Based on the generated fine-tuning corpus, use the DoRA fine-tuning method to fine-tune the large language model.
8. The method for mining implicit associations of malicious code gene information based on meta-learning as described in claim 7, characterized in that: In step S51, prompt words are generated. Specifically, a prompt word template t = fill(P, D) is designed, where P is a predefined pattern. N prompt words {t1, t2, ..., t3} are generated from the input data D through the designed prompt word template. i ,...,t N } 9. The method for mining implicit associations of malicious code gene information based on meta-learning as described in claim 7, characterized in that: Step S52, specifically, involves fine-tuning the corpus {(t)} during the fine-tuning process. i ,y i )}, where t i For the i-th prompt word generated, y i For the i-th target label, use the optimization objective function: Here, θ represents the model parameters of the large language model; the model parameters θ of the large language model are continuously updated through backpropagation until the set number of fine-tuning steps is reached.