A backdoor attack cleaning method and system for a computer vision neural network model
By dividing the visual neural network model into a feature extractor and a classifier, generating feature representations and constructing a dataset, and fine-tuning the classifier using the cross-entropy loss function and regularization strategy, the problem of backdoor attack removal without data dependency is solved, and backdoor attacks are effectively removed without affecting the original performance of the model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG UNIV
- Filing Date
- 2023-11-20
- Publication Date
- 2026-06-23
AI Technical Summary
Existing methods for removing backdoor attacks on computer vision neural network models rely on auxiliary data and cannot operate effectively in situations without data support. This is especially true in model sharing platforms or federated learning environments where defenders cannot obtain samples, making it difficult to effectively remove backdoor attacks.
The visual neural network model is divided into a feature extractor and a classifier. Multiple feature representations are generated and a dataset is constructed. The classifier is fine-tuned to remove backdoor attacks. The classification confidence is maximized by using the cross-entropy loss function and regularization strategy. Feature representations are generated and backdoors are removed without data dependency.
Without relying on any auxiliary data, it significantly reduces the success rate of backdoor attacks while basically maintaining the normal function of the model, and is applicable to backdoor attacks of various trigger types.
Smart Images

Figure CN117611968B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of neural network model security protection technology, and in particular to a method and system for eliminating backdoor attacks on computer vision neural network models. Background Technology
[0002] Backdoor attacks targeting deep neural network models are one of the major security threats facing artificial intelligence. A neural network model injected with a backdoor behaves normally for normal input samples, i.e., it outputs the correct predicted category. However, if the input sample contains a specific trigger, the injected neural network model will exhibit aberrant behavior pre-programmed by the attacker, such as classifying the sample into a specified target category.
[0003] Although various backdoor removal methods exist, they all rely on the same assumption: that the defender can access a set of labeled and verified samples without triggers, or that the defender can access online access samples with triggers. These assumptions may not hold true in some real-world scenarios. For example, if the defender is the maintainer of a model-sharing platform, they may not be able to access any auxiliary samples when checking for backdoors in the models on the platform; or if the defender is the server in a horizontal federated learning system, they may not be able to access any local samples belonging to the federated learning participants. Summary of the Invention
[0004] To address the current limitations of existing backdoor removal methods for computer vision neural network models, which rely on auxiliary data and cannot operate without data dependencies, this invention provides a method and system for removing backdoors from computer vision neural network models, enabling backdoor removal without data dependencies.
[0005] This invention provides the following technical solution:
[0006] In a first aspect, this invention proposes a method for removing backdoor attacks on computer vision neural network models, applied in the field of image recognition, including:
[0007] The visual neural network model to be processed is divided into a feature extractor part and a classifier part;
[0008] For each predicted category of the visual neural network model, multiple feature representations are generated using the feature extractor part of the visual neural network model;
[0009] The dataset is constructed using the generated feature representations, with each feature representation treated as a sample and its label being the predicted category at the time of generation.
[0010] The classifier part of the visual neural network model is fine-tuned using the constructed dataset to eliminate backdoor attacks on the visual neural network model.
[0011] Furthermore, the last few fully connected layers of the visual neural network model are used as the classifier part, and the rest are used as the feature extractor part.
[0012] Furthermore, the generation strategy of using the feature extractor part to generate multiple feature representations is to maximize the classification confidence of each predicted category in the classifier part.
[0013] Furthermore, the generation strategy is expressed as:
[0014]
[0015]
[0016] Where CE is the cross-entropy loss function, and N c λ represents the number of predicted categories in the visual neural network model. l1 To control the parameters of L1 regularization, M cls (.) represents the classifier part of the visual neural network model, c represents the label of the predicted class, and IR c This represents the feature representation corresponding to the predicted category c. N represents the value of the i-th dimension of the feature representation corresponding to the predicted category c. d The dimension of the feature representation is represented by ||.||, and the L1 norm is represented by ||.||.
[0017] Furthermore, the number of feature representations generated for each predicted category of the visual neural network model is the same.
[0018] Furthermore, when fine-tuning the classifier part of the visual neural network model, the training objective is:
[0019]
[0020] Where X,y represent the generated feature representation samples and their labels, and D IR This represents the dataset constructed from the generated feature representations, where CE is the cross-entropy loss function, and M... c ′ ls λ represents the fine-tuned classifier part. l2 To control the parameters of L2 regularization, ||.||2 represents the L2 norm.
[0021] Furthermore, the backdoor attack includes one or more forms such as using pixel block triggers, image filter triggers, image watermark triggers, using specific natural features as triggers, or using a mixture of specific normal features as triggers.
[0022] Secondly, this invention proposes a backdoor attack removal system for computer vision neural network models, applied in the field of image recognition, comprising:
[0023] The model segmentation module is used to divide the visual neural network model to be processed into a feature extractor part and a classifier part;
[0024] The feature representation generation module is used to generate multiple feature representations for each predicted category of the visual neural network model using the feature extractor part of the visual neural network model.
[0025] The dataset building module is used to build a dataset using the generated feature representations, treating each feature representation as a sample, with the sample label being the predicted category at the time of generation;
[0026] The backdoor removal module is used to fine-tune the classifier part of a visual neural network model using a constructed dataset to remove backdoor attacks from the visual neural network model.
[0027] Furthermore, the generation strategy of the feature representation generation module is to maximize the classification confidence of each predicted category in the classifier part.
[0028] Furthermore, in the model segmentation module, the last few fully connected layers of the visual neural network model are used as the classifier part, and the remaining part is used as the feature extractor part.
[0029] Compared with existing technologies, the advantages of this invention are: it can reverse-generate the feature representation of the model for each category based on the model parameters without relying on any auxiliary data, and fine-tune several layers of the model based on the feature representation and their corresponding labels, thereby removing backdoors already implanted in the model while basically maintaining the normal function of the model. This invention automatically generates feature representations to construct a fine-tuning dataset, filling the gap in the current lack of backdoor removal technology for neural network models that does not rely on data. Attached Figure Description
[0030] Figure 1 A schematic diagram of the architecture of a backdoor removal system for computer vision neural network models;
[0031] Figure 2 This is a flowchart illustrating a backdoor removal method for computer vision neural network models.
[0032] Figure 3 The present invention provides examples of five types of triggers, including pixel block triggers, image filter triggers, and image watermark triggers.
[0033] Figure 4To demonstrate the generational cleanup effect under different types of backdoor attacks, (a) uses a pixel block trigger to perform a backdoor attack on the model, and (b) uses an image watermark trigger to perform a backdoor attack on the model. Detailed Implementation
[0034] The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be noted that the embodiments described below are intended to facilitate the understanding of the present invention and do not limit it in any way.
[0035] The architecture of a data-independent computer vision neural network model backdoor removal system is as follows: Figure 1 As shown, it mainly includes a model segmentation module, a feature representation generation module, a dataset construction module, and a backdoor removal module.
[0036] Among them, such as Figure 1 As shown, the model segmentation module is used to select a specific separator layer. Using this separator layer as a dividing line, the neural network model to be processed is segmented into a feature extractor part and a classifier part (where the classifier part includes this specific separator layer). This system will fine-tune the classifier part to disrupt backdoor attacks implanted in the neural network, thereby achieving the purpose of eliminating backdoor attacks. In this embodiment, the neural network model segmentation strategy is to select the last few fully connected layers of the neural network model as the classifier part. For example, for ResNet-18 or GoogLeNet, the last layer is selected; for VGG-16, the last three layers are selected. In this art, the last few fully connected layers of a neural network model are generally defined as the classifier part of the model, and this invention follows this general definition.
[0037] The feature representation generation module generates several feature representations for each predicted category from the feature extractor section described above, and temporarily stores them in computer memory. During the generation process, the L1 norm is used to regularize the generated results, with a coefficient set to 0.01. For example, on the GTSRB dataset, there are 43 categories. If two feature representations are generated for each category, then 43 × 2 = 86 feature representations are generated.
[0038] The dataset construction module is used to construct a dataset by taking the feature representations generated by the feature representation generation module as samples. The label of each feature representation sample is the category corresponding to its generation. For example, in the GTSRB dataset, there are 43 categories. If feature representations are generated twice for each category, the constructed dataset will have 43 × 2 = 86 feature representations and 43 classes, with each class containing 2 feature representations.
[0039] The backdoor removal module is used to fine-tune the classifier part of the model to be processed on the constructed feature representation dataset. For example, on the GTSRB dataset, if the model is GoogLeNet, its last layer is fine-tuned. The fine-tuning dataset uses the constructed feature representation dataset, and the parameters of the last layer are regularized using the L2 norm with a coefficient of 0.005.
[0040] The system embodiments described above are merely illustrative. The modules may or may not be physically separate; that is, they may be located in one place or distributed across multiple network units. Embodiments of the system of the present invention can be applied to any device with data processing capabilities, such as a computer or other similar device or apparatus. The system embodiments can be implemented through software, hardware, or a combination of both. Taking software implementation as an example, as a logical device, it is formed by the processor of any data processing device loading the corresponding computer program instructions from non-volatile memory into memory for execution.
[0041] The backdoor removal method for computer vision neural network models includes the following steps, such as... Figure 2 As shown:
[0042] (1) Divide the neural network model to be processed into a feature extractor part and a classifier part.
[0043] (2) For each predicted category of the neural network model, generate multiple feature representations of that category in the feature extractor section described above. The generation strategy is to maximize the classification confidence of that category in the output layer, as shown in the formula:
[0044]
[0045]
[0046] Where CE is the cross-entropy loss function, and N c λ represents the number of predicted categories in the neural network model. l1 To control the L1 regularization parameter, this embodiment sets it to 0.01; M cls (.) represents the classifier part of the neural network model, c represents the label of the source class, and IR c This represents the feature representation corresponding to source category c. N represents the value of the i-th dimension of the feature representation corresponding to the source category c. d Let ||.|| denote the dimension of the feature representation of the neural network model, and ||.|| denote the L1 norm.
[0047] (3) Construct a dataset using the feature representations generated above. Each feature representation is treated as a sample, and its label is the category it corresponds to when it was generated. In this step, for each predicted category c, its feature representation is generated multiple times according to step (2). Taking two times as an example, it is denoted as {(IR′ c ,c),(IR″ c After performing the operation for all predicted categories, a dataset D consisting of feature representations is obtained. IR The label of each sample in the dataset is its corresponding predicted category.
[0048]
[0049] (4) On the dataset constructed above, fine-tune the classifier part of the model to be processed to remove the backdoor of the neural network.
[0050] In this step, during fine-tuning, the feature representation is used as part of the sample input classifier, and the training objective is:
[0051]
[0052] Where X,y represent the generated feature representation samples and their labels, and M′ cls λ represents the fine-tuned classifier part. l2 To control the parameters of L2 regularization, this embodiment sets them to 0.005; ||.||2 represents the L2 norm.
[0053] To verify the effectiveness of this invention, the method was tested on the classic object recognition dataset CIFAR-10 and the classic road sign recognition dataset GTSRB, using two classic model structures: GoogLeNet and VGG-16. For backdoor attacks, three trigger types were used: patch trigger, filter trigger, and blending trigger. It is worth noting that this invention is not limited to these three trigger types; it is also applicable to backdoor attacks using other trigger types, such as backdoor attacks using specific natural features as triggers or backdoor attacks using a mixture of specific normal features as triggers. Examples of the five trigger types described above are... Figure 3 As shown.
[0054] This experiment uses ACC / ASR as the evaluation metrics. ACC (accuracy) is the classification accuracy of the model on its original task, meaning the proportion of test samples correctly classified by the model for samples without triggers out of all samples in the test set. ASR (attack success rate) is the backdoor attack success rate, meaning the number of times the model is triggered by a backdoor (misclassifying samples into the backdoor's specified category) divided by the number of samples in the batch with triggers.
[0055] Under the above experimental settings, backdoors corresponding to the three trigger types were implanted into the neural network model. Then, the backdoor removal effect of the method of the present invention was tested on the model with the backdoors implanted, where the fine-tuning epoch was set to 10. The experimental results are shown in Table 1 below.
[0056] Table 1. Backdoor removal effects for three trigger types.
[0057]
[0058] As shown in Table 1, this invention significantly eliminates backdoors in the model across multiple datasets, model structures, and backdoor trigger settings. For example, on the GTSRB dataset and the GoogLeNet model architecture, for backdoor attacks of the image filter trigger type, this invention reduces the ASR (Accuracy Rate) of the backdoor attack from 99.9% to 1.2%. However, this invention also incurs some performance loss on the model's normal original task. For instance, on the GTSRB dataset and the GoogLeNet model architecture, for backdoor attacks of the image filter trigger type, this invention reduces the ACC (Accuracy Rate) of the original task from 90.1% to 87.5%. However, compared to the reduction in the backdoor attack success rate, this performance loss on the original task is a very small cost.
[0059] The reason this invention has a powerful removal effect on backdoors of various trigger types is that it reverse-generates the feature representation of each predicted category, rather than reverse-genererating samples in the input sample space. Regardless of the type of trigger used in the backdoor attack (image filter, pixel block, or image watermark), the trigger pattern will be extracted into several dimensions of the feature vector or feature map after processing by the feature extractor, and further processed in the model's classifier. Therefore, no matter what type of trigger the backdoor attack uses, this invention will disrupt its related logic in the model's classifier, rendering the backdoor attack ineffective, specifically manifested as a significant decrease in ASR (Action Success Rate).
[0060] To further illustrate the beneficial effects of this invention, taking the GTSRB dataset and GoogLeNet model from the above experiments as examples, the changes in the model's ACC and ASR during the final step of the method, namely fine-tuning, are as follows: Figure 4 As shown in the figure, the left image (a) shows the generational cleanup effect on the model with a backdoor implanted using a pixel block trigger, and the right image (b) shows the generational cleanup effect on the model with a backdoor implanted using an image watermark trigger. It can be seen that as the fine-tuning process proceeds, the ASR of the backdoor attack drops rapidly to close to 0.0%, while the ACC of the original task degrades slowly.
[0061] The embodiments described above provide a detailed explanation of the technical solutions and beneficial effects of the present invention. It should be understood that the above descriptions are merely specific embodiments of the present invention and are not intended to limit the present invention. Any modifications, additions, and equivalent substitutions made within the scope of the principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for removing backdoor attacks in computer vision neural network models, characterized in that, Applications in the field of image recognition include: The visual neural network model to be processed is divided into a feature extractor part and a classifier part; For each predicted category of the visual neural network model, multiple feature representations are generated using the feature extractor part of the visual neural network model. The strategy for generating these multiple feature representations is to maximize the classification confidence of each predicted category in the classifier part. The dataset is constructed using the generated feature representations, with each feature representation treated as a sample and its label being the predicted category at the time of generation. The classifier part of the visual neural network model is fine-tuned using the constructed dataset to eliminate backdoor attacks on the visual neural network model. When fine-tuning the classifier part of a visual neural network model, the training objective is: ; in, The generated feature representations represent samples and their labels. This represents the dataset constructed from the generated feature representations. Let cross-entropy be the loss function. This represents the fine-tuned classifier portion. To control the parameters of L2 regularization, This represents the L2 norm.
2. The method for removing backdoor attacks on computer vision neural network models according to claim 1, characterized in that, The last few fully connected layers of the visual neural network model are used as the classifier part, and the rest are used as the feature extractor part.
3. The method for removing backdoor attacks on computer vision neural network models according to claim 1, characterized in that, The generation strategy is expressed as: ; ; in, This represents the number of predicted categories by the visual neural network model. To control the parameters of L1 regularization, This represents the classifier part of a visual neural network model. Labels indicating predicted categories Indicates the prediction category The corresponding feature representation, Indicates the prediction category The value of the i-th dimension of the corresponding feature representation, This represents the dimension of the feature representation. This represents the L1 norm.
4. The method for removing backdoor attacks on computer vision neural network models according to claim 1, characterized in that, The number of feature representations generated for each predicted category of the visual neural network model is the same.
5. The method for removing backdoor attacks on computer vision neural network models according to claim 1, characterized in that, The backdoor attacks include one or more forms such as using pixel block triggers, image filter triggers, image watermark triggers, using specific natural features as triggers, and using a mixture of specific normal features as triggers.
6. A system for removing backdoor attacks on computer vision neural network models, characterized in that, Applications in the field of image recognition include: The model segmentation module is used to divide the visual neural network model to be processed into a feature extractor part and a classifier part; The feature representation generation module is used to generate multiple feature representations for each predicted category of the visual neural network model using the feature extractor part of the visual neural network model; the generation strategy of the feature representation generation module is to maximize the classification confidence of each predicted category in the classifier part. The dataset building module is used to build a dataset using the generated feature representations, treating each feature representation as a sample, with the sample label being the predicted category at the time of generation; The backdoor removal module is used to fine-tune the classifier part of a visual neural network model using a constructed dataset to remove backdoor attacks from the visual neural network model. When fine-tuning the classifier part of a visual neural network model, the training objective is: ; in, The generated feature representations represent samples and their labels. This represents the dataset constructed from the generated feature representations. Let cross-entropy be the loss function. This represents the fine-tuned classifier portion. To control the parameters of L2 regularization, This represents the L2 norm.
7. The computer vision neural network model backdoor attack removal system according to claim 6, characterized in that, In the aforementioned model segmentation module, the last few fully connected layers of the visual neural network model are used as the classifier part, and the remaining parts are used as the feature extractor part.