Malicious code image recognition method and system based on conditional generative adversarial network
By using a conditional generative adversarial network-based approach, the generator produces shallow, medium, and deep features. Combined with sparse plasticity Conformer modules and sparse gating mechanisms, it solves the problems of strong task dependence, insufficient feature representation, and inadequate data augmentation security in malicious code image recognition, thus achieving efficient and accurate malicious code recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING AEROSPACE WANYUAN TECH CO LTD
- Filing Date
- 2026-03-25
- Publication Date
- 2026-06-23
Smart Images

Figure CN122265798A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of network and information security technology, specifically relating to a method and system for malicious code image recognition based on conditional generative adversarial networks, as well as an electronic device and a computer-readable storage medium. Background Technology
[0002] In malicious code image recognition tasks, sample scarcity, rapid family updates, and large distribution differences are common core challenges faced by researchers. To address these issues, drawing on the few-shot learning approach from image classification, scholars both domestically and internationally have gradually developed a technical approach centered on meta-learning, metric learning, and data augmentation.
[0003] Meta-learning-based few-shot learning methods focus on "learning how to learn" and typically employ a two-layer structure consisting of a meta-learner and a base learner. The meta-learner summarizes cross-task transfer patterns through multi-task contextual training, providing initialization parameters or update rules for the base learner. This enables the base learner to quickly adapt to new malware families with a limited number of labeled samples, leveraging historical knowledge for identification.
[0004] Metric learning-based methods identify samples by mapping them to a unified embedding space. In malicious code scenarios, researchers can use static features such as PE / ELF file header fields and Opcode sequences, or dynamic features such as API call sequences and control flow graphs. After embedding and mapping, classification is achieved by calculating the similarity between the sample and the class prototype.
[0005] Data augmentation-based methods enhance the ability to distinguish small sample classes by expanding the sample size. These methods mainly fall into two categories: one is to perform simple perturbations such as random replacement or reordering of features like byte sequences and opcode fragments; the other is to introduce generative models such as generative adversarial networks and variational autoencoders to generate new samples at the feature layer to expand the coverage of training data and alleviate overfitting.
[0006] Despite the initial progress made by the above methods, significant shortcomings remain in practical applications: Meta-learning-based methods rely on large-scale sets of basic tasks, while malicious code data is limited by security and compliance constraints, making it difficult to construct a comprehensive set of tasks, thus limiting model adaptation efficiency and recognition accuracy; In metric learning-based methods, the distribution of malicious code features varies greatly and their performance is inconsistent across platforms, affecting the stability of the embedding space, and existing metric methods are unable to fully characterize the global features of multi-stage attack chains; In data augmentation-based methods, simple perturbation methods have limited effectiveness, and the generated models need improvement in sample stability, adversarial capabilities, and distribution consistency, while the security and compliance of the generated samples also restricts the promotion of the methods.
[0007] In summary, existing methods suffer from problems such as strong task dependence, insufficient feature representation, and inadequate data augmentation security and stability. There is an urgent need for a malicious code image recognition technology that overcomes these shortcomings and is efficient and robust, in order to improve the accuracy and applicability of malicious code image recognition under small sample conditions. Summary of the Invention
[0008] To address the aforementioned technical problems in the prior art, namely how to achieve malicious code image recognition that balances sufficient feature representation with data augmentation security and stability, this application provides a malicious code image recognition method and system based on conditional generative adversarial networks.
[0009] In a first aspect of this application, a method for malicious code image recognition based on conditional generative adversarial networks is provided, comprising:
[0010] Constructing a conditional generative adversarial network (GAN), wherein the GAN includes a generator, a discriminator, and a classifier, and the construction of the GAN includes:
[0011] The generator receives an input random vector and preset condition information, generates shallow features, mid-level features and deep features based on the random vector and the condition information, and obtains fused feature samples based on the shallow features, the mid-level features and the deep features;
[0012] The discriminator receives the fused feature sample and the real sample, and determines whether the fused feature sample is the real sample based on the fused feature sample and the real sample. The real sample is a real malicious code image sample.
[0013] The classifier receives the fused feature sample, the real sample, and preset category information, and uses a self-attention algorithm to classify the fused feature sample;
[0014] Calculate the discriminator loss function, the classifier loss function, and the generator loss function;
[0015] The generator, discriminator, and classifier undergo adversarial training based on the generator loss function, the discriminator loss function, and the classifier loss function, until the generator loss function, the discriminator loss function, and the classifier loss function satisfy preset conditions; and
[0016] The constructed conditional generative adversarial network is used to classify the input malicious code image samples.
[0017] The shallow features are low-level texture and edge features near the input end, the mid-level features are features that reflect local structure and pattern relationships, and the deep features are features that contain global semantic and category discrimination information.
[0018] Among them, the shallow features , ,in, For convolution calculation, This represents the number of convolutional layers included in the shallow feature extraction stage. It can be 1, 2, or 3. The number of channels for shallow features. and These represent the height and width of the shallow feature in the spatial dimension, respectively. z is a random vector, and y is preset condition information. Indicates a fully connected structure;
[0019] The middle layer features , ,in, Indicates upsampling, The number of channels for the mid-layer features. and These represent the height and width of the mid-level feature in the spatial dimension, respectively;
[0020] The deep features ,in, Indicates multi-head self-attention computation, Concat Indicates a connection. Here, Ha is the learnable coefficient, and Ha is the number of attention heads in multi-head self-attention. , , , d is the feature embedding dimension of the token. , , , , , These are the query mapping matrix, key mapping matrix, and value mapping matrix, respectively. , , , , , , This represents the feature embedding mapping matrix, used to map the expanded vectors of each local sub-region in the mid-layer features to the feature embedding space of the token. The preset embedding bias vector is used to translate and adjust the features in the feature embedding space during the mapping. cp is the side length of each local sub-region (patch) when dividing the mid-level features in the spatial dimension. , This represents all spatial locations covered by the s-th local sub-region patch. Let be the sub-region of the s-th mid-level feature. The sub-region of the mid-level feature is obtained by dividing the mid-level feature M into P×P non-overlapping regions in the spatial dimension. .
[0021] Optionally, obtaining the fused feature sample based on the shallow features, the mid-layer features, and the deep features includes:
[0022] The shallow feature L, the middle feature M, and the deep feature D are respectively mapped to the same channel dimension through 1×1 convolution and spatially aligned to obtain... , , ;
[0023] ,
[0024] Wherein, the same channel dimension is a preset feature embedding dimension. For spatial alignment functions, , , , These represent the sizes of L, M, and D, respectively. For space dimensions, For channel dimension, , , , They represent , and Size, After alignment , , The spatial dimension, C is the feature embedding dimension;
[0025] Will , , Convert each into a token sequence , ;
[0026] Calculate query ,key and value , where [;] indicates splicing, , , It is a learnable matrix;
[0027] Calculate the scaled dot product attention score matrix: ,in, The feature dimension of the key;
[0028] Calculate attention weights ;
[0029] Calculate fusion feature samples , ;
[0030] Calculate fusion feature samples Unflatten() is the inverse transform function corresponding to Flatten(). The feature dimension is the value.
[0031] Optionally, the generator loss function is: ,in, These are the fused feature samples generated by the generator under the random vector z and conditional information y. To counteract the generation loss, Let D be the expected value of the random variable z and the conditional information y, and let D be the output of the discriminator, where D is 0 or 1. To generate a loss function for classification, C represents the output class of the classifier, and CE represents the cross-entropy loss function. J represents the total number of categories in the classification task, and k represents the k-th category. This represents the k-th component of the true label vector. , This represents the predicted probability of the classifier for the k-th class. Weight coefficients are used to generate the loss for classification. For the sparse regularization term, the gate variable M is the set of introduced gating variables. The weight coefficients of the sparse regularization term are, Indicates an indicator function, Here, i represents the index along the height direction of the fused feature sample, and j represents the index along the width direction of the fused feature sample. Let be the feature response value of the c-th channel at spatial location (i,j), where c represents the channel index; and .
[0032] Calculate the average activation of each gating variable. ,in, Let T represent the average activation value of the g-th gated variable, and let T represent the number of time steps used to count the activation state of the g-th gated variable during training.
[0033] like Then delete the gated variable. The corresponding channel or path, This is the preset threshold.
[0034] Optionally, the discriminator loss function is:
[0035] ,in, To counteract the loss of judgment, For self-supervised representation learning loss, Loss due to causal consistency constraints and These are preset weighting coefficients;
[0036] ,in, The true probability output by the discriminator. To conform to the true data distribution Take the mathematical expectation of the true sample x. To conform to the potential noise distribution The random vector z takes the expected value;
[0037] x is the real sample. The sample is generated after cropping, flipping, or adding noise to x. This represents the intermediate features of the discriminator;
[0038] ,in, The sample generated after adding texture perturbation or noise to x.
[0039] Optionally, the classifier includes a feature embedding module, an SP-Conformer compositing module, and a classification head module. The SP-Conformer compositing module includes U SP-Conformer modules and a convolutional neural network module. The classification of the fused feature samples using a self-attention algorithm includes:
[0040] The feature embedding module calculates the embedded feature vector. Where r is the number of tokens, d is the feature embedding dimension of the token. The embedded feature vector of the i-th token obtained after tokenizing the fused feature sample;
[0041] The SP-Conformer combination module constructs queries. ,key ,value ,in, For learnable query matrix, For learnable key matrix, For a learnable value matrix, For single-head attention, the embedding dimension is; and,
[0042] Computational sparse plastic attention Where ⊙ is the Hadamard element-wise multiplication operator, G is the sparse selection mask matrix, and the elements of G are G ij [0,1], , For indicator functions, For the scoring function, , , The learnable coefficient matrix, The mapped feature dimensions, For adaptive threshold, ;
[0043] The SP-Conformer compositing module calculates the feature map of each channel c of the convolutional neural network. ,in, The weights of the convolution kernel corresponding to channel c. These are the learnable parameters obtained through backpropagation training. Let be the gate variable for channel c. ∈[0,1], where X is the output feature of the SP-Conformer module. , If the preset threshold is used, Then the SP-Conformer combination module abandons the calculation of channel c;
[0044] The classification head module takes the feature map as input and outputs the classification result.
[0045] Optionally, the classifier loss function is: ,in, For cross-entropy loss, P represents the total number of categories in the classification task. For the p-th component of the true category label, Let be the predicted score of the classifier for the q-th class. For channel gating sparsity loss, For attention mask sparsity loss, The preset weighting coefficients, Index of modules that introduce sparse attention mechanisms. For the first Channel-gated variables of a module that introduces a sparse attention mechanism. For the first The attention mask matrix corresponding to a module that introduces a sparse attention mechanism.
[0046] Optionally, the malicious code image recognition method based on conditional generative adversarial networks further includes:
[0047] Randomly crop or flip the original real samples to generate augmented samples;
[0048] The augmented sample is merged with the original real sample to generate the real sample.
[0049] In a second aspect of this application, a malicious code image recognition system based on conditional generative adversarial networks is provided, the system comprising:
[0050] Conditional Generative Adversarial Network (GAN), comprising a generator, a discriminator, and a classifier:
[0051] The generator is configured to receive an input random vector and preset condition information, generate shallow features, mid-level features and deep features based on the random vector and the condition information, obtain fused feature samples based on the shallow features, the mid-level features and the deep features, and calculate the generator loss function.
[0052] The discriminator is used to receive the fused feature sample and the real sample, determine whether the fused feature sample is the real sample based on the fused feature sample and the real sample, and calculate the discriminator loss function, wherein the real sample is a real malicious code image sample;
[0053] The classifier is used to receive the fused feature samples, the real samples, and preset category information, classify the fused feature samples using a self-attention algorithm, and calculate the classifier loss function.
[0054] The generator, the discriminator, and the classifier are further used to perform adversarial training based on the generator loss function, the discriminator loss function, and the classifier loss function until the generator loss function, the discriminator loss function, and the classifier loss function meet preset conditions;
[0055] The conditional generative adversarial network is also used to classify input malicious code image samples.
[0056] In a third aspect of this application, an electronic device is provided, comprising:
[0057] At least one processor; and,
[0058] A memory communicatively connected to at least one of the processors; wherein,
[0059] The memory stores instructions that can be executed by the processor to implement the aforementioned malicious code image recognition method based on conditional generative adversarial networks.
[0060] In a fourth aspect of this application, a computer-readable storage medium is provided, the computer-readable storage medium storing computer instructions for execution by the computer to implement the above-described malicious code image recognition method based on conditional generative adversarial networks.
[0061] In a fifth aspect of this application, a computer program product containing instructions is provided, which, when executed by a computer device, cause the computer device to perform the aforementioned malicious code image recognition method based on a conditional generative adversarial network.
[0062] The malicious code image recognition method and system based on conditional generative adversarial networks provided in this application have the following beneficial effects:
[0063] 1. Achieving efficient expansion of training data and improving the quality and security of data augmentation: This application uses a lightweight conditional generative adversarial network to generate virtual samples. Compared with existing simple feature perturbation methods, it can generate more diverse samples that conform to the feature distribution of malicious code images, effectively alleviating the problem of small sample scarcity. At the same time, it uses a cross-scale feature activation mechanism to perform weighted fusion of features at different levels, combined with a sparse gating mechanism to suppress redundant features and invalid connections, which significantly improves the quality and generation efficiency of the generated samples. It avoids the defects of existing generation models in terms of stability and distribution consistency. The generated virtual samples do not need to rely on the direct reuse of real malicious code data, making it easier to meet security compliance requirements.
[0064] 2. Enhanced feature modeling capabilities, improving recognition accuracy and generalization: This application employs a sparse, malleable Conformer module in the feature extraction and classification stages. Through a dynamic sparse attention mechanism, it efficiently models the global dependencies of malicious code. Combined with a channel gating mechanism, it accurately captures key local details, solving the problems of existing metric learning methods' inability to comprehensively characterize multi-stage attack chain features and poor embedding space stability. The malicious code image recognition method and system based on conditional generative adversarial networks provided in this application can adapt to the characteristics of rapid updates in malicious code families and large differences in feature distribution. This allows the model to fully learn effective features even under small sample conditions, significantly improving recognition accuracy and generalization ability to new variants.
[0065] 3. Lightweight and highly adaptable, suitable for real-world applications: This application introduces sparsity design in both the generative network and classification modules. By simplifying the model structure through sparse gating and dynamic sparse attention mechanisms, it reduces computational resource consumption and solves the dependence of existing meta-learning methods on large-scale basic task sets. This enables efficient deployment in resource-constrained real-world security scenarios. At the same time, the model's adaptive design allows it to quickly adapt to different platforms and different types of malicious code features, exhibiting strong robustness and making it widely applicable in the fields of malicious code detection and network security protection. Attached Figure Description
[0066] Figure 1 This is a flowchart illustrating one implementation of the malicious code image recognition method based on conditional generative adversarial networks according to this application.
[0067] Figure 2 This is a structural block diagram of one embodiment of the malicious code image recognition system based on conditional generative adversarial networks of this application. Detailed Implementation
[0068] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description, in conjunction with the accompanying drawings and examples, further clarifies this application. It should be understood that the specific examples described herein are merely illustrative and not intended to limit the scope of this application. Furthermore, the technical features involved in the various embodiments of this application described below can be combined with each other as long as they do not conflict with each other.
[0069] The present application will now be described in detail with reference to the accompanying drawings. A first aspect of this application provides a method for malicious code image recognition based on conditional generative adversarial networks. Figure 1 The diagram illustrates a flowchart of one implementation of the malicious code image recognition method based on conditional generative adversarial networks (GANs) of this application. Figure 1 As shown, the malicious code image recognition method based on conditional generative adversarial networks in the first embodiment of this application includes:
[0070] Step S101: Construct a conditional generative adversarial network, which includes a generator, a discriminator, and a classifier;
[0071] Step S102: Use the constructed conditional generative adversarial network to classify the input malicious code image samples.
[0072] Specifically, step S101 includes:
[0073] Step S1011: The generator receives an input random vector and preset condition information, generates shallow features, mid-level features and deep features based on the random vector and the condition information, and obtains a fused feature sample based on the shallow features, the mid-level features and the deep features.
[0074] In step S1012, the discriminator receives the fused feature sample and the real sample, and determines whether the fused feature sample is the real sample based on the fused feature sample and the real sample. The real sample is a real malicious code image sample.
[0075] Step S1013: The classifier receives the fused feature sample, the real sample, and preset category information, and uses a self-attention algorithm to classify the fused feature sample;
[0076] Step S1014: Calculate the discriminator loss function, the classifier loss function, and the generator loss function;
[0077] In step S1015, the generator, the discriminator, and the classifier undergo adversarial training based on the generator loss function, the discriminator loss function, and the classifier loss function until the generator loss function, the discriminator loss function, and the classifier loss function satisfy preset conditions.
[0078] Traditional conditional generative adversarial networks (GANs) consist of a generator and a discriminator. The generator produces virtual samples that approximate real samples, while the discriminator is responsible for distinguishing between real and virtual samples. The generator's goal is to generate realistic virtual samples that make it difficult for the discriminator to distinguish between real and virtual samples. The discriminator's goal is to improve the accuracy of distinguishing between real and virtual samples. The generator and discriminator are trained adversarially with a loss function as the link to achieve a dynamic balance between their goals, ultimately achieving the goal of the generator outputting highly realistic (indistinguishable from real) virtual samples.
[0079] This application presents a malicious code image recognition method based on Conditional Generative Adversarial Networks (CGANs). Building upon traditional CGANs, a classifier is introduced. The proposed CGAN includes a generator, a discriminator, and a classifier. The classifier classifies received samples, including virtual samples generated by the generator and real samples. Similar to the adversarial mechanism between the generator and discriminator, the classifier, generator, and discriminator also undergo adversarial training using a loss function as a link to achieve optimization, resulting in a dynamically balanced CGAN. Ultimately, this achieves the goals of the generator outputting highly realistic (indistinguishable from real) virtual samples, the discriminator accurately identifying the samples, and the classifier improving classification accuracy.
[0080] Specifically, in step S1011, the generator receives an input random vector and conditional information, the conditional information including malicious code category information. It generates shallow features, mid-level features, and deep features based on the random vector and the conditional information, and obtains a fused feature sample based on the shallow features, mid-level features, and deep features. Traditional generators directly generate virtual samples based on random vectors and conditional information. In this application, the generator first generates shallow features, mid-level features, and deep features based on the random vector and the conditional information. The shallow features are low-level texture and edge features near the input, the mid-level features are features reflecting local structure and pattern relationships, and the deep features are features containing global semantic and category discrimination information. Then, based on the shallow features, mid-level features, and deep features, a fused feature sample, i.e., a virtual sample, is obtained. Specifically, the shallow features... , ,in, For convolution calculation, This represents the number of convolutional layers included in the shallow feature extraction stage. It can be 1, 2, or 3. The number of channels for shallow features. and These represent the height and width of the shallow feature in the spatial dimension, respectively. z is a random vector, and y is preset condition information. Indicates a fully connected layer; the middle layer features , ,in, Indicates upsampling, The number of channels for the mid-layer features. and These represent the height and width of the mid-level features in spatial dimensions, respectively; the deep features ,in, This refers to multi-head self-attention computation, which means that in a multi-head self-attention mechanism, computation is performed in parallel using multiple independent attention heads. Indicates a connection. Here, Ha is the learnable coefficient, and Ha is the number of attention heads in multi-head self-attention. , , , d is the feature embedding dimension of the token. , , , , , These are the query mapping matrix, key mapping matrix, and value mapping matrix, respectively. , , , , , , The feature embedding mapping matrix is used to map the expanded vectors of each local sub-region in the mid-layer features to the feature embedding space of the token, thereby achieving a linear projection from the original spatial feature representation to the Transformer token representation. is a preset embedding bias vector used to translate and adjust features in the feature embedding space during the mapping process, thereby enhancing feature representation and improving the learning stability of the model. cp is the side length of each local sub-region (patch) when dividing mid-level features in the spatial dimension. , This represents commonly used tensor dimension transformation functions. This represents all spatial locations covered by the s-th local sub-region patch. Let be the sub-region of the s-th mid-level feature. The sub-region of the mid-level feature is obtained by dividing the mid-level feature M into P×P non-overlapping regions in the spatial dimension. .
[0081] Specifically, in step S1011, the shallow feature L, the middle feature M, and the deep feature D are mapped to the same channel dimension through 1×1 convolution and spatially aligned to obtain... , , , Wherein, the same channel dimension is a preset feature embedding dimension. These are commonly used space alignment functions. , , , These represent the sizes of L, M, and D, respectively. For space dimensions, For channel dimension, ,, , , They represent , and Size, After alignment , , The spatial dimension, C is the feature embedding dimension; , , Convert each into a token sequence ; Calculate query ,key and value , where [;] indicates splicing, , , Given a learnable matrix; calculate the scaled dot product attention score matrix: ,in, The feature dimension of the key; calculate the attention weights. ; Calculate fusion feature samples , ; calculate the fusion feature samples Unflatten() is a commonly used inverse transform function corresponding to Flatten(). The feature dimension is the value.
[0082] By first dividing the features into layers, then dynamically calculating the weights of the layered features through an attention mechanism, and then merging the layered features according to the weights, virtual samples can be generated. This enables the adaptive fusion of shallow details and deep semantics of the samples, thereby improving the quality of the virtual samples generated by the generator.
[0083] Specifically, steps S1012 and S1013 can refer to existing discriminators and classifiers for discrimination and classification. In one possible implementation, the classifier includes a feature embedding module, an SP-Conformer combination module, and a classification head module. The SP-Conformer combination module includes U SP-Conformer modules and a convolutional neural network module. Step S1013 may include: the feature embedding module calculating an embedded feature vector. Where r is the number of tokens, d is the feature embedding dimension of the token. The embedded feature vector of the i-th token obtained after tokenizing the fused feature samples; the SP-Conformer combination module constructs the query. ,key ,value ,in, For learnable query matrix, For learnable key matrix, For a learnable value matrix, For single-head attention, the embedding dimension is defined; and for sparse plastic attention, the embedding dimension is calculated. Where ⊙ is the Hadamard element-wise multiplication operator, G is the sparse selection mask matrix, and the elements of G are G ij [0,1], , For indicator functions, For the scoring function, , , The learnable coefficient matrix, The mapped feature dimensions, For adaptive threshold, The SP-Conformer compositing module calculates the feature map of each channel c of the convolutional neural network. ,in, The weights of the convolution kernel corresponding to channel c. These are the learnable parameters obtained through backpropagation training. Let be the gate variable for channel c. [0,1], where X is the output feature of the SP-Conformer module. , If the preset threshold is used, If the SP-Conformer combination module abandons the calculation of channel c, the classification head module takes the feature map as input and outputs the classification result. The classifier receives the fused feature samples generated by the generator, the real samples, and the preset category information, where the category information is the category of malicious code, such as ACBackdoor. In the classifier, each SP-Conformer module sequentially includes a sparse plastic self-attention sublayer and a sparse channel convolutional sublayer, where the self-attention sublayer is used to model the global dependencies between input features, and the convolutional sublayer is used to model local structural features. Tokenization refers to the string processing technique of segmenting text into discrete units (tokens).
[0084] Specifically, in step S1014, the generator loss function can be: ,in, These are the fused feature samples generated by the generator under the random vector z and conditional information y. To counteract the generation loss, Let D be the expected value of the random variable z and the conditional information y, and let D be the output of the discriminator, where D is 0 or 1. To generate a loss function for classification, C represents the output class of the classifier, and CE represents the cross-entropy loss function. J represents the total number of categories in the classification task, and k represents the k-th category. This represents the k-th component of the true label vector. , This represents the predicted probability of the classifier for the k-th class. Weight coefficients are used to generate the loss for classification. For the sparse regularization term, the gate variable M is the set of introduced gating variables. The weight coefficients of the sparse regularization term are, Indicates an indicator function, Here, i represents the index along the height direction of the fused feature sample, and j represents the index along the width direction of the fused feature sample. Let be the feature response value of the c-th channel at spatial location (i,j), where c represents the channel index. Simultaneously, calculate the average activation value of each gating variable. ,in, Let T represent the average activation level of the g-th gated variable, and let T represent the number of time steps used to count the activation state of the g-th gated variable during training. For example, if the training consists of 50 epochs and each epoch contains 100 batches, then T = 5000. Then delete the gated variable. The corresponding channel or path, If the threshold is set to the preset threshold, otherwise, the gate variable is retained. Based on the average activation of the gate variable during training, channels or subnetworks with consistently low activation probabilities are structurally pruned. The gate variable is approximately binary; when it is 0, the corresponding convolutional channel, attention head, or subnetwork path is closed in the forward computation, and the corresponding convolution or attention operation is skipped, thus reducing redundant computation. When the gate variable is 1, the complete computation process of the corresponding convolutional channel, attention head, or subnetwork path is retained. Furthermore, to further develop a stable sparse structure, a sparsity regularization term or activation penalty term for the gate variable can be introduced into the training loss function.
[0085] Specifically, in step S1014, the discriminator loss function can be... ,in, To counteract the loss of judgment, For self-supervised representation learning loss, Loss due to causal consistency constraints and The preset weighting coefficients, ,in, The true probability output by the discriminator. To conform to the true data distribution Take the mathematical expectation of the true sample x. To conform to the potential noise distribution Take the mathematical expectation of the random vector z. x is the real sample. The sample is generated after cropping, flipping, or adding noise to x. This represents the intermediate features of the discriminator. ,in, The sample is generated by adding texture perturbation or noise to x. The loss function of the discriminator in this application introduces self-supervised representation learning and causal consistency constraints as additional loss terms. The self-supervised representation learning constraint, through additional training on the samples, enables the discriminator to learn more stable and structurally stronger feature representations, improving its ability to identify subtle artifacts in virtual samples. The causal consistency constraint, by applying consistent perturbations to real and virtual samples and comparing their response changes in the discriminator, enables the discriminator to perceive the causal structure of the samples. In adversarial training, this constraint can be applied to the generator, ensuring that the generated virtual samples are consistent with real samples in terms of semantic change trends. Simultaneously, the causal consistency constraint maintains discriminative stability under different perturbation conditions, avoiding the pattern collapse problem commonly encountered in adversarial training.
[0086] Specifically, in step S1014, the classifier loss function can be... ,in, For cross-entropy loss, P represents the total number of categories in the classification task. For the p-th component of the true category label, Let be the predicted score of the classifier for the q-th class. For channel gating sparsity loss, For attention mask sparsity loss, The preset weighting coefficients, Index of modules that introduce sparse attention mechanisms. For the first Channel-gated variables of a module that introduces a sparse attention mechanism. For the first The attention mask matrix corresponding to a module that introduces a sparse attention mechanism.
[0087] The above process constrains excessive gating (i.e., the number of gating variables with a value of 1 for more than a preset threshold exceeds a preset limit) that remain in an active state for extended periods, retaining only channels and substructures that contribute significantly to the task. Furthermore, based on the average activation of gating variables during training, channels or sub-network paths with persistently low activation probabilities (activation probabilities below a preset threshold for more than a preset time limit) can be structurally pruned, thereby further reducing network size and computational overhead without significantly decreasing recognition performance. Through the collaborative design of cross-scale, multi-layer feature fusion and sparse gating, not only can efficient information flow and fusion between different levels be achieved, but the computational path can also be adaptively adjusted, effectively improving the diversity and realism of generated samples. This is particularly suitable for expanding training data and enhancing the model's generalization ability in small-sample scenarios. A sparse selection matrix is introduced into the attention computation of the classifier. This matrix assigns a learnable gating variable to each element of the attention matrix, controlling whether an element participates in the computation (elements are excluded when set to zero or shrunk to near zero). This suppresses redundant connections in the attention matrix, preserving key attention distributions. In one possible implementation, a Top-k selection strategy can be used, selecting the k positions with the highest attention scores in each row or column of the attention matrix and setting the remaining elements to zero, thus forcibly retaining the most contributing connections. By using gating variables or the Top-k selection strategy to achieve global attention sparsity and suppress redundant connections, computational overhead is effectively reduced and the model's ability to model key dependencies is improved. Sparse channel selection is introduced into the convolutional layers of the classifier. Gating variables control the activation of convolutional channels. When the gating variable of a channel is below a preset threshold, the convolution operation corresponding to that channel is skipped in the forward computation, the output is set to zero, or it does not participate in subsequent feature aggregation. Convolution operations are only performed on a small number of key channels with gating variables above the threshold, achieving selective activation of the convolution kernel. The sparse attention and channel gating mechanisms described above can effectively reduce redundant computations and significantly improve computational efficiency. They can also dynamically activate sub-network structures of different sizes according to the input complexity, thereby enhancing the adaptability and generalization ability of the conditional generative adversarial network in this application under small sample scenarios. Under small sample conditions, the sparse attention and channel gating mechanisms can also effectively alleviate the overfitting problem, further improving the adaptability and robustness of the conditional generative adversarial network in this application to complex samples.
[0088] Specifically, in step S1015, the AdamW optimization algorithm can be used to calculate each loss function. The generator, discriminator, and classifier undergo adversarial training based on the discriminator loss function, the classifier loss function, and the generator loss function until the discriminator loss function, the classifier loss function, and the generator loss function satisfy preset conditions. The adversarial training can use conventional adversarial training methods, so they will not be elaborated upon.
[0089] Specifically, after completing adversarial training between the generator and discriminator, and adversarial training between the generator and classifier, the conditional generative adversarial network (GAN) of this application is constructed. Therefore, in step S102, the constructed GAN can be used to classify the input malicious code image samples, enabling the implementation of corresponding defense measures based on different malicious code image types. Specifically, the classifier of the aforementioned GAN classifies the input malicious code image samples. The input samples undergo feature extraction and classification based on category information using the classifier's sparse plasticity Conformer, outputting predicted category labels. By comparing these labels with the true category labels, the classification result is obtained, and the classification accuracy can be further calculated.
[0090] In one possible implementation, the real samples (denoted as the original real samples) input to the conditional generative adversarial network (input to the discriminator and classifier) can be expanded before being input to the conditional generative adversarial network to increase sample diversity and improve the model's generalization performance. Specifically, the original real samples can be randomly pruned or randomly flipped to generate expanded samples, and then the expanded samples can be added to the original real samples to obtain the real samples used to input to the conditional generative adversarial network.
[0091] The malware image recognition method based on conditional generative adversarial networks (GANs) provided in this application employs a lightweight GAN to generate virtual samples. Combined with cross-scale feature activation and sparse gating mechanisms, it efficiently expands training data, improves generation quality and efficiency, and alleviates the problem of small sample scarcity. Furthermore, virtual samples do not require reuse of real data, making them easier to meet compliance requirements. The sparse and malleable Conformer module enhances feature modeling capabilities through dynamic sparse attention and channel gating mechanisms, addressing the problems of incomplete feature characterization and unstable embedding space in existing methods. This improves recognition accuracy and generalization ability for new variants under small sample conditions. The sparse design of the generator network and classification module simplifies the structure, reduces resource consumption, adapts to resource-constrained scenarios, and is suitable for different platforms and malware features. It exhibits strong robustness and can be widely applied to malware detection and network security protection.
[0092] A second aspect of this application provides a malicious code image recognition system based on conditional generative adversarial networks. Figure 2 This paper illustrates a structural block diagram of one embodiment of the malicious code image recognition system based on conditional generative adversarial networks (GANs) of this application. Figure 2 As shown, the malicious code image recognition system based on conditional generative adversarial networks according to the second embodiment of this application includes:
[0093] Conditional Generative Adversarial Network (GAN), comprising a generator, a discriminator, and a classifier:
[0094] The generator is configured to receive an input random vector and preset condition information, generate shallow features, mid-level features and deep features based on the random vector and the condition information, obtain fused feature samples based on the shallow features, the mid-level features and the deep features, and calculate the generator loss function.
[0095] The discriminator is used to receive the fused feature sample and the real sample, determine whether the fused feature sample is the real sample based on the fused feature sample and the real sample, and calculate the discriminator loss function, wherein the real sample is a real malicious code image sample;
[0096] The classifier is used to receive the fused feature samples, the real samples, and preset category information, classify the fused feature samples using a self-attention algorithm, and calculate the classifier loss function.
[0097] The generator, the discriminator, and the classifier are further used to perform adversarial training based on the generator loss function, the discriminator loss function, and the classifier loss function until the generator loss function, the discriminator loss function, and the classifier loss function meet preset conditions;
[0098] The conditional generative adversarial network is also used to classify input malicious code image samples.
[0099] In one possible implementation, the malicious code image recognition system based on conditional generative adversarial networks further includes a preprocessing module, which is specifically used to: randomly crop or randomly flip the original real sample to generate an augmented sample; and merge the augmented sample with the original real sample to generate the real sample.
[0100] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working process and related descriptions of the system described above can be referred to the corresponding processes in the foregoing method embodiments, and therefore will not be repeated here.
[0101] The malicious code image recognition system based on conditional generative adversarial networks provided in this application uses a lightweight conditional generative adversarial network, combined with cross-scale feature activation and sparse gating, which can efficiently expand data, improve generation quality and efficiency, and alleviate the scarcity of small samples. Moreover, the generation of virtual samples does not require the reuse of real data, making it more compliant. The system adopts a sparse and malleable Conformer module to enhance feature modeling through dynamic sparse attention and channel gating, which solves the defects of existing methods and improves the recognition accuracy and generalization ability of small samples. The system uses sparse design to simplify the structure and reduce resource consumption. The model has strong adaptability and high robustness, and can be better adapted to restricted scenarios.
[0102] It should be noted that the malicious code image recognition system based on conditional generative adversarial networks provided in the above embodiments is only an example of the division of the above functional modules. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the modules or steps in the embodiments of this application can be further decomposed or combined. For example, the modules in the above embodiments can be merged into one module, or further divided into multiple sub-modules to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of this application are only for distinguishing the various modules or steps and are not considered as an improper limitation of this application.
[0103] In a third aspect of this application, an electronic device is also provided, the electronic device comprising: at least one processor; and a memory communicatively connected to at least one of the processors; wherein the memory stores instructions executable by the processor, the instructions being executed by the processor to implement the above-described malicious code image recognition method based on conditional generative adversarial networks.
[0104] In a fourth aspect of this application, a computer-readable storage medium is also provided, the computer-readable storage medium storing computer instructions for execution by the computer to implement the above-described malicious code image recognition method based on conditional generative adversarial networks.
[0105] In a fifth aspect of this application, a computer program product containing instructions is also provided, which, when executed by a computer device, cause the computer device to perform the aforementioned malicious code image recognition method based on conditional generative adversarial networks.
[0106] Those skilled in the art will recognize that the modules and method steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. The programs corresponding to the software modules and method steps can be placed in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or any other form of storage medium known in the art. To clearly illustrate the interchangeability of electronic hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in electronic hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0107] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0108] The terms “first”, “second”, etc., are used to distinguish similar objects, not to describe or indicate a specific order or sequence.
[0109] The term "comprising" or any other similar term is intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus / device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent in such process, method, article, or apparatus / device.
[0110] The technical solution of the present invention has been described above with reference to the preferred embodiments shown in the accompanying drawings. However, it will be readily understood by those skilled in the art that the scope of protection of the present invention is obviously not limited to these specific embodiments. Without departing from the principles of the present invention, those skilled in the art can make equivalent changes or substitutions to the relevant technical features, and the technical solutions after these changes or substitutions will all fall within the scope of protection of the present invention.
Claims
1. A method for malicious code image recognition based on conditional generative adversarial networks, characterized in that, include: Constructing a conditional generative adversarial network (GAN), wherein the GAN includes a generator, a discriminator, and a classifier, and the construction of the GAN includes: The generator receives an input random vector and preset condition information, generates shallow features, mid-level features and deep features based on the random vector and the condition information, and obtains fused feature samples based on the shallow features, the mid-level features and the deep features; The discriminator receives the fused feature sample and the real sample, and determines whether the fused feature sample is the real sample based on the fused feature sample and the real sample. The real sample is a real malicious code image sample. The classifier receives the fused feature sample, the real sample, and preset category information, and uses a self-attention algorithm to classify the fused feature sample; Calculate the discriminator loss function, the classifier loss function, and the generator loss function; The generator, discriminator, and classifier undergo adversarial training based on the generator loss function, the discriminator loss function, and the classifier loss function, until the generator loss function, the discriminator loss function, and the classifier loss function satisfy preset conditions; and The constructed conditional generative adversarial network is used to classify the input malicious code image samples.
2. The method as described in claim 1, characterized in that, The shallow features , ,in, For convolution calculation, This represents the number of convolutional layers included in the shallow feature extraction stage. It can be 1, 2, or 3. The number of channels for shallow features. and These represent the height and width of the shallow feature in the spatial dimension, respectively. z is a random vector, and y is preset condition information. Indicates a fully connected state; The middle layer features , ,in, Indicates upsampling, The number of channels for the mid-layer features. and These represent the height and width of the mid-level feature in the spatial dimension, respectively; The deep features ,in, Indicates multi-head self-attention computation, Concat Indicates a connection. Here, Ha is the learnable coefficient, and Ha is the number of attention heads in multi-head self-attention. , , , d is the feature embedding dimension of the token. , , , , , These are the query mapping matrix, key mapping matrix, and value mapping matrix, respectively. , , , , , , This represents the feature embedding mapping matrix, used to map the expanded vectors of each local sub-region in the mid-layer features to the feature embedding space of the token. The preset embedding bias vector is used to translate and adjust the features in the feature embedding space during the mapping. cp is the side length of each local sub-region when dividing the mid-level features in the spatial dimension. , Represents the tensor dimension transformation function. This represents all spatial locations covered by the s-th local sub-region. Let be the sub-region of the s-th mid-level feature. The sub-region of the mid-level feature is obtained by dividing the mid-level feature M into P×P non-overlapping regions in the spatial dimension. .
3. The method as described in claim 2, characterized in that, The step of obtaining fused feature samples based on the shallow features, the middle features, and the deep features includes: The shallow feature L, the middle feature M, and the deep feature D are respectively mapped to the same channel dimension through 1×1 convolution and spatially aligned to obtain... , , ; , Wherein, the same channel dimension is a preset feature embedding dimension. For spatial alignment functions, , , , These represent the sizes of L, M, and D, respectively. For space dimensions, For channel dimension, , , , , They represent , and Size, After alignment , , The spatial dimension, C is the feature embedding dimension; Will , , Convert each into a token sequence ; Calculate query ,key and value , where [;] indicates splicing, , , It is a learnable matrix; Calculate the scaled dot product attention score matrix: ,in, The feature dimension of the key; Calculate attention weights ; Calculate fusion feature samples , ; Calculate fusion feature samples Unflatten() is the inverse transform function corresponding to Flatten(). The feature dimension is the value.
4. The method as described in claim 3, characterized in that, The generator loss function is: ,in, These are the fused feature samples generated by the generator under the random vector z and conditional information y. To counteract the generation loss, Let D be the mathematical expectation of the random variable z and the conditional information y, and let D be the output of the discriminator, where D is 0 or 1. To generate a loss function for classification, C represents the output class of the classifier, and CE represents the cross-entropy loss function. J represents the total number of categories in the classification task, and k represents the k-th category. This represents the k-th component of the true label vector. , This represents the predicted probability of the classifier for the k-th class. Weight coefficients are used to generate the loss for classification. For the sparse regularization term, the gate variable M is the set of introduced gating variables. The weight coefficients of the sparse regularization term are, Indicates an indicator function, Here, i represents the index along the height direction of the fused feature sample, and j represents the index along the width direction of the fused feature sample. Let be the feature response value of the c-th channel at spatial location (i,j), where c represents the channel index; and . Calculate the average activation of each gating variable. ,in, Let T represent the average activation value of the g-th gated variable, and let T represent the number of time steps used to count the activation state of the g-th gated variable during training. like Then delete the gated variable. The corresponding channel or path, This is the preset threshold.
5. The method as described in claim 4, characterized in that, The discriminant loss function is: ,in, To counteract the loss of judgment, For self-supervised representation learning loss, Loss due to causal consistency constraints and The preset weighting coefficients, ,in, The true probability output by the discriminator. To conform to the true data distribution Take the mathematical expectation of the true sample x. To conform to the potential noise distribution The random vector z takes the expected value; x is the real sample. The sample is generated after cropping, flipping, or adding noise to x. This represents the intermediate features of the discriminator; ,in, The sample generated after adding texture perturbation or noise to x.
6. The method as described in claim 5, characterized in that, The classifier includes a feature embedding module, an SP-Conformer ensemble module, and a classification head module. The SP-Conformer ensemble module includes U SP-Conformer modules and a convolutional neural network module. The classification of the fused feature samples using a self-attention algorithm includes: The feature embedding module calculates the embedded feature vector. ,in, r is the number of tokens. d is the feature embedding dimension of the token. The embedded feature vector of the i-th token obtained after tokenizing the fused feature sample; The SP-Conformer combination module constructs queries. ,key ,value ,in, For learnable query matrix, For learnable key matrix, It is a learnable value matrix. For single-head attention, the embedding dimension is ; and, Computational sparse plastic attention Where ⊙ is the Hadamard element-wise multiplication operator, G is the sparse selection mask matrix, and the elements of G are G ij [0,1], , For indicator functions, For the scoring function, , , The learnable coefficient matrix, The mapped feature dimensions, For adaptive threshold, ; The SP-Conformer compositing module calculates the feature map of each channel c of the convolutional neural network. ,in, The convolution kernel weights corresponding to channel c are... These are the learnable parameters obtained through backpropagation training. Let be the gate variable for channel c. ∈[0,1], where X is the output feature of the SP-Conformer module. , If the preset threshold is used, Then the SP-Conformer combination module abandons the calculation of channel c; The classification head module takes the feature map as input and outputs the classification result.
7. The method as described in claim 6, characterized in that, The classifier loss function is: ,in, For cross-entropy loss, P represents the total number of categories in the classification task. For the p-th component of the true category label, Let be the predicted score of the classifier for the q-th class. For channel gating sparsity loss, For attention mask sparsity loss, The preset weighting coefficients, Index of modules that introduce sparse attention mechanisms. For the first Channel-gated variables of a module that introduces a sparse attention mechanism. For the first The attention mask matrix corresponding to a module that introduces a sparse attention mechanism.
8. The method according to any one of claims 1-7, characterized in that, Also includes: Randomly crop or flip the original real samples to generate augmented samples; The augmented sample is merged with the original real sample to generate the real sample.
9. A malicious code image recognition system based on conditional generative adversarial networks, characterized in that, include: Conditional Generative Adversarial Network (GAN), comprising a generator, a discriminator, and a classifier: The generator is configured to receive an input random vector and preset condition information, generate shallow features, mid-level features and deep features based on the random vector and the condition information, obtain fused feature samples based on the shallow features, the mid-level features and the deep features, and calculate the generator loss function. The discriminator is used to receive the fused feature sample and the real sample, determine whether the fused feature sample is the real sample based on the fused feature sample and the real sample, and calculate the discriminator loss function, wherein the real sample is a real malicious code image sample; The classifier is used to receive the fused feature samples, the real samples, and preset category information, classify the fused feature samples using a self-attention algorithm, and calculate the classifier loss function. The generator, the discriminator, and the classifier are further used to perform adversarial training based on the generator loss function, the discriminator loss function, and the classifier loss function until the generator loss function, the discriminator loss function, and the classifier loss function meet preset conditions; The conditional generative adversarial network is also used to classify input malicious code image samples.
10. An electronic device, characterized in that, include: At least one processor; as well as A memory communicatively connected to at least one of the processors; wherein, The memory stores instructions that can be executed by the processor to implement the malicious code image recognition method based on conditional generative adversarial networks as described in any one of claims 1-8.
11. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that are executed by the computer to implement the malicious code image recognition method based on conditional generative adversarial networks as described in any one of claims 1-8.