Remote sensing few-shot object detection method based on conditional hints and causal learning
By using conditional prompting and causal learning methods, conditional vectors are generated and knowledge distillation is performed using the backdoor criterion. This solves the problem of insufficient adaptability of visual language models in target detection with few samples in remote sensing images, and improves the detection performance and stability of the model in new categories.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA UNIV OF MINING & TECH
- Filing Date
- 2024-04-18
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies for target detection in remote sensing images with few samples suffer from problems such as insufficient adaptability of visual language models, empirical errors from knowledge distillation, and a contradiction between stability and adaptability, which limits their application, especially in the field of remote sensing.
We introduce conditional prompts and causal learning methods. We generate conditional vectors through the lightweight network Meta-Net and CLIP text encoder, and combine them with the backdoor criterion of causal learning for knowledge distillation to eliminate confusion factors and improve the model's generalization ability in new categories.
It improves the performance and adaptability of target detection with few samples in remote sensing images, solves the adaptability problem of visual language models in the field of remote sensing, and maintains the stability and generalization ability of the model on new categories.
Smart Images

Figure CN118334519B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of few-sample remote sensing target detection, and particularly relates to a remote sensing few-sample target detection method based on conditional prompts and causal learning. Background Technology
[0002] With the excellent performance of visual language models across various fields, cue learning, as an effective learning paradigm, is being widely applied in remote sensing image processing, especially in few-shot target detection tasks. Few-shot target detection is an important application in remote sensing image processing, aiming to identify and locate targets from target categories with a limited number of labeled samples.
[0003] To address the challenge of few-shot target detection in remote sensing images, researchers have proposed two main approaches: meta-learning-based and transfer learning-based. In meta-learning-based methods, the model is trained by sampling few-shot tasks from a large number of base classes, thereby quickly learning features for new categories. Transfer learning-based methods, on the other hand, focus on transferring knowledge from base classes to new categories to accelerate the model's learning process on these new categories.
[0004] However, due to the scarcity of remote sensing image data and the high cost of annotation, models often face the challenges of data sparsity and annotation difficulties. Therefore, knowledge distillation has become an effective solution, improving the model's generalization ability in few-shot object detection tasks by learning rich semantic information from large-scale pre-trained models (such as CLIP). Although knowledge distillation methods have brought new possibilities to remote sensing few-shot object detection, some difficulties still need to be overcome:
[0005] (1) Limited application of visual language models in remote sensing: Although visual language models perform well in various fields, their application in remote sensing target detection with few samples is relatively limited. The lack of effective methods to adapt these pre-trained models to remote sensing target detection with few samples limits the scope of their application.
[0006] (2) Experience error of knowledge distillation: In the process of knowledge distillation, although the student model learns the knowledge obtained by the teacher model from the open set, the experience error of the teacher model will also affect the target detection performance of the student model, thus limiting the effectiveness of the knowledge distillation method in remote sensing target detection with few samples.
[0007] (3) The contradiction between stability and adaptability: In remote sensing target detection with few samples, there is a certain contradiction between the performance stability of the model and its adaptability to new categories. The better the performance on the base class, the stronger the stability of the network, but the adaptability to new categories may be affected, resulting in poor performance of the model in practical applications. Summary of the Invention
[0008] Objective: The purpose of this invention is to overcome the shortcomings of existing technologies and propose a remote sensing few-shot target detection method based on conditional prompts and causal learning. Compared with other existing remote sensing few-shot target detection methods, this invention introduces conditional prompts. The manually set prompts in CLIP are transformed into learnable prompts. Simultaneously, a lightweight neural network is set up to generate conditional vectors for each image. These conditional vectors are loaded onto the learnable prompt vectors, improving generalization performance on new classes. Furthermore, from the perspective of causal theory, a new knowledge distillation paradigm for few-shot target detection is designed. Based on the few-shot causal theory model, the confounding factors that lead to reduced student model performance are analyzed, and the backdoor criterion is used to remove these confounding factors without introducing additional influencing factors.
[0009] Technical Solution: To achieve the objectives of this invention, this invention proposes a remote sensing target detection method based on conditional prompts and causal learning, which includes the following steps:
[0010] Step 1: Obtain the remote sensing dataset DIOR, divide it into base classes and new classes according to the proportion, and construct the remote sensing dataset;
[0011] Step 2: Construct the main branch network. Extract the features of the base class from Step 1 through the ResNet50 backbone network and feature pyramid. The RPN generates candidate regions based on the base class features. The RoI Align maps the candidate regions back to the feature map of the base class to extract the features of the candidate regions.
[0012] Step 3: Construct an auxiliary branch network. The auxiliary branch consists of a lightweight network Meta-Net and a CLIP text encoder. Input the candidate region features generated in Step 2 into the auxiliary branch, and learn the conditional cue vector v through Meta-Net and CLIP text encoder. The weights of the text encoder are initialized and do not participate in the training process.
[0013] Step 4: Input the conditional cue vector v into the main branch network, classify the candidate regions through the classification head, complete the training of the base class data, and freeze some parameters of the network.
[0014] Step 5: Based on the backdoor criterion in causal learning, perform knowledge distillation on the new class, adjust some parameters of the network, and then detect the remote sensing image to obtain the location and classification of the target in the remote sensing image.
[0015] Furthermore, the method for step 1 is as follows:
[0016] Step 1-1: Obtain the remote sensing dataset DIOR. The dataset contains 11,725 training images and 11,738 test images, with a total of 20 categories. The annotation information for each image includes the coordinates of the four vertices of the target to be detected and the category coordinates. Select five categories from the training set: Baseball field, Basketball court, Bridge, Chimney, and Ship, as the new class C. novel The remaining 15 categories serve as base class C. base The new class and the base class do not have the same category, that is:
[0017] C base ∩C novel =φ (1)
[0018] Steps 1-2: Construct the base class dataset D based on the base class category. base ={(x,y),y∈C base}, where x represents the input remote sensing image, y represents the category of remote sensing data, and D base It contains all the images and annotation files in 15 categories;
[0019] Steps 1-3: Construct a new class dataset D based on the new class categories. novel ={(x,y),y∈C novel}, D novel Each class contains only K samples, where K can take the values 2, 3, 5, or 10. The training dataset is D. few ={(x,y),y∈C base +C novel}, D few Each category contains only K samples.
[0020] Furthermore, the process of generating candidate regions and extracting candidate region features is as follows:
[0021] (2.1) Transfer the base class remote sensing data D base The remote sensing image features f are obtained by inputting them into the backbone network ResNet-50. i ;
[0022] (2.2) Use a convolution kernel of size 3x3 with padding of 2 and stride of 1 to traverse the feature map f i Nine anchor frames are generated at each pixel. The aspect ratio of the anchor frames is 1:1, 1:2, and 2:1. The length and width of the anchor frames are 128, 256, and 512 respectively.
[0023] (2.3) Use the softmax function to divide the anchor boxes into positive and negative samples. Randomly select 2000 positive sample anchor boxes, perform non-maximum suppression on these 2000 positive sample anchor boxes, sort the results and take the top 300 as the final proposal boxes. b i This represents the i-th suggestion box;
[0024] (2.4) Using bilinear interpolation to Mapping to F i In the process, a 7×7 feature map f is generated. i ', through loss L rpn Train RPN, L rpn It consists of smooth L1 and cross-entropy loss.
[0025] Furthermore, in step 3, the process of generating the conditional prompt vector v is as follows:
[0026] (3.1) f i Input into a lightweight network Meta-Net to generate m img m img Composed of 9 tokens, Meta-Net is a ReLu-Linear-ReLu structure;
[0027] (3.2) Convert the base class category text into a token using the Byte Pair Encoding (BPE) method, randomly initialize 8 tokens, concatenate them with the tokens converted from the category text, and then combine them with m. img Add;
[0028] (3.3) Using the Word2Vec word embedding model, the summed 9 tokens are converted into a cue vector. This vector is then input into the CLIP text encoder g(·) to generate a 512×15 conditional cue vector v. f i 'Input two fully connected layers and generate a visual feature vector f with dimensions 1024×512.' i ", to put v and f i "Multiply and process using the softmax function to obtain the auxiliary prediction probability p" aux , using p y This represents the label of the current sample, and finally uses cross-entropy loss L. aux Training auxiliary branch:
[0029]
[0030] Where N represents the total number of samples, This represents the auxiliary prediction probability of the i-th sample. This represents the label of the i-th sample.
[0031] Furthermore, in step 4, the specific methods for classifying candidate regions and freezing some parameters of the network are as follows:
[0032] (4.1) The candidate regions are classified using a cosine similarity-based classifier. The classification score s and probability p of the candidate regions are calculated using the following formulas:
[0033]
[0034]
[0035] Where, F(x) j This represents the feature of the j-th candidate region in the input remote sensing image x. Let ||·|| represent the weight vector of the i-th category, where ||·|| represents the L2 norm. This represents the j-th candidate region and category y of the input remote sensing image x. i Similarity scores between weight vectors It is category y i The probability score of the j-th suggestion region, where α represents the scaling factor, C = C base +C novel Indicates the quantity of all categories;
[0036] (4.2) The feature extraction network consists of ResNet50 and a feature pyramid, and its loss function L rcnn It consists of smooth L1 and cross-entropy loss, with its parameter w rcnn and auxiliary branch parameter w aux Freeze the parameters, unfreeze the parameters of other modules, and participate in gradient backpropagation to minimize the loss of the parameters. It is expressed as follows:
[0037]
[0038] Among them, w rpn The parameter W represents the RPN. ft ={w rpn ,w rcnn} represents the parameters of RPN and RCNN, f(D few ;θ * This indicates that the network is initialized using weights trained on the base class, so that the network can be used in the next step of D. few Adjustments are made, and argmin(·) represents taking the minimum loss.
[0039] Furthermore, the knowledge distillation process in step 5 is as follows:
[0040] (5.1) After training on the base class, the CLIP text encoder contains semantic knowledge P from the open dataset and classification knowledge K from the remote sensing domain. c and knowledge F used to distinguish context bg When adjusting to a new class, when the network model is input with a remote sensing image X, based on causal relationships and causal reasoning in computer vision, X can infer the class Y from three paths. The first path is to directly infer the class Y from the remote sensing image; the second path is to obtain the classification knowledge K of the remote sensing domain from the open-domain semantic knowledge P. c The first path infers the category Y; the second path obtains the knowledge F that distinguishes the context from the open-domain semantic knowledge P. bg Infer categories Y and F bg Divided into foreground knowledge F′ bg and background knowledge F″ bg Two layers, in this causal model, P and F bg All are confounding factors of X, remote sensing domain knowledge K c It is fixed; P cannot be observed, but F can be calibrated. bg The causal effect of X on Y is expressed by the following formula:
[0041]
[0042] Where do() represents the do operator, Y(X) represents the predicted category Y after the remote sensing image X is input into the detection network, P(Y(X)|do(X)) represents the causal effect of X on Y, and F m F represents bg Hierarchical knowledge F′ bg and F″ bg , P(F m ) represents the probability of each level of knowledge;
[0043] (5.2) Knowledge distillation of main and auxiliary branches: After inputting the new class image, execute step 2 again to obtain 300 positive sample anchor boxes. Knowledge distillation is performed from the foreground and background (FBD). + The background knowledge before positive samples is For negative samples, knowledge distillation is performed from the background using FBD. - Negative sample background knowledge is Foreground classification knowledge distillation (FCD) is performed on all positive and negative remote sensing images. The classification knowledge for all samples is P. FCD Using the KL divergence loss, the formula is as follows:
[0044]
[0045] Where KL() represents the KL divergence loss, This represents the distillation of background knowledge between auxiliary branches and the main branch. This represents knowledge distillation of background categories for auxiliary and main branches, KL(P) FCD This represents the foreground classification knowledge distillation for all remote sensing images, where α and β are hyperparameters with values of 4 and 0.5, respectively. The specific knowledge distillation formulas are shown below:
[0046]
[0047]
[0048]
[0049] in, These represent the background features of the auxiliary branch and the main branch, respectively. These represent the features of the background categories of the auxiliary branch and the main branch, respectively. Let w represent the foreground classification features of the auxiliary branch and the main branch remote sensing images, respectively, where N represents the number of samples and w represents the number of samples. rcnn and w aux Since it's already frozen, when optimizing the total loss L using gradient descent, only w will be adjusted. rpn :
[0050] L = L KL +L rcnn +L rpn +L aux (11).
[0051] Beneficial effects: Compared with the prior art, the technical solution of the present invention has the following beneficial technical effects:
[0052] First, visual language models are introduced into few-shot target detection and then applied to the field of few-shot target detection in remote sensing images to solve the problem that visual language models cannot be effectively adapted in the field of few-shot remote sensing, thus introducing additional usable information for few-shot target detection in remote sensing.
[0053] Second, from a causal perspective, the causes of erroneous experiences in the knowledge distillation paradigm were analyzed, and the backdoor criterion was used to eliminate the confounding factor without adding any additional variables.
[0054] Third, knowledge distillation is employed to improve the model's ability to acquire additional semantic information by learning from large-scale pre-trained models, thereby enhancing adaptability while maintaining model stability. Even when new class data is scarce, the student model can still acquire a significant amount of knowledge to learn, improving its detection performance for new classes and enhancing its adaptability. Attached Figure Description
[0055] Figure 1 This is a flowchart of the method of the present invention;
[0056] Figure 2 This is a network structure diagram of the present invention;
[0057] Figure 3 This is a diagram of the causal theory structure. Detailed Implementation
[0058] The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.
[0059] Please see Figure 1 This invention proposes a remote sensing target detection method based on conditional prompts and causal learning, which includes the following steps:
[0060] Step 1: Obtain the remote sensing dataset DIOR, divide it into base classes and new classes according to the proportion, and construct the remote sensing dataset;
[0061] Step 2: Construct the main branch network. Extract the features of the base class from Step 1 through the ResNet50 backbone network and feature pyramid. The RPN generates candidate regions based on the base class features. The RoI Align maps the candidate regions back to the feature map of the base class to extract the features of the candidate regions.
[0062] Step 3: Construct an auxiliary branch network. The auxiliary branch consists of a lightweight network Meta-Net and a CLIP text encoder. Input the candidate region features generated in Step 2 into the auxiliary branch, and learn the conditional cue vector v through Meta-Net and CLIP text encoder. The weights of the text encoder are initialized and do not participate in the training process.
[0063] Step 4: Input the conditional cue vector v into the main branch network, classify the candidate regions through the classification head, complete the training of the base class data, and freeze some parameters of the network.
[0064] Step 5: Based on the backdoor criterion in causal learning, perform knowledge distillation on the new class, adjust some parameters of the network, and then detect the remote sensing image to obtain the location and classification of the target in the remote sensing image.
[0065] Furthermore, the method for step 1 is as follows:
[0066] Step 1-1: Obtain the remote sensing dataset DIOR. The dataset contains 11,725 training images and 11,738 test images, with a total of 20 categories. The annotation information for each image includes the coordinates of the four vertices of the target to be detected and the category coordinates. Select five categories from the training set: Baseball field, Basketball court, Bridge, Chimney, and Ship, as the new class C. novel The remaining 15 categories serve as base class C. base The new class and the base class do not have the same category, that is:
[0067] C base ∩C novel =φ (1)
[0068] Steps 1-2: Construct the base class dataset D based on the base class category. base ={(x,y),y∈C base}, where x represents the input remote sensing image, y represents the category of remote sensing data, and D base It contains all the images and annotation files in 15 categories;
[0069] Steps 1-3: Construct a new class dataset D based on the new class categories. novel ={(x,y),y∈C novel}, D novel Each class contains only K samples, where K can take the values 2, 3, 5, or 10. The training dataset is D. few ={(x,y),y∈C base +C novel}, D few Each category contains only K samples.
[0070] Furthermore, the process of generating candidate regions and extracting candidate region features is as follows:
[0071] (2.1) Transfer the base class remote sensing data D base The remote sensing image features f are obtained by inputting them into the backbone network ResNet-50. i ;
[0072] (2.2) Use a convolution kernel of size 3x3 with padding of 2 and stride of 1 to traverse the feature map f i Nine anchor frames are generated at each pixel. The aspect ratio of the anchor frames is 1:1, 1:2, and 2:1. The length and width of the anchor frames are 128, 256, and 512 respectively.
[0073] (2.3) Use the softmax function to divide the anchor boxes into positive and negative samples. Randomly select 2000 positive sample anchor boxes, perform non-maximum suppression on these 2000 positive sample anchor boxes, sort the results and take the top 300 as the final proposal boxes. b i This represents the i-th suggestion box;
[0074] (2.4) Using bilinear interpolation to Mapping to F i In the process, a 7×7 feature map f is generated. i ', through loss L rpn Train RPN, L rpn It consists of smooth L1 and cross-entropy loss.
[0075] Furthermore, in step 3, the process of generating the conditional prompt vector v is as follows:
[0076] (3.1) f i Input into a lightweight network Meta-Net to generate m img m img Composed of 9 tokens, Meta-Net is a ReLu-Linear-ReLu structure;
[0077] (3.2) Convert the base class category text into a token using the Byte Pair Encoding (BPE) method, randomly initialize 8 tokens, concatenate them with the tokens converted from the category text, and then combine them with m. img Add;
[0078] (3.3) Using the Word2Vec word embedding model, the summed 9 tokens are converted into a cue vector. This vector is then input into the CLIP text encoder g(·) to generate a 512×15 conditional cue vector v. f i 'Input two fully connected layers and generate a visual feature vector f with dimensions 1024×512.' i ", to put v and f i "Multiply and process using the softmax function to obtain the auxiliary prediction probability p" aux , using p y This represents the label of the current sample, and finally uses cross-entropy loss L. aux Training auxiliary branch:
[0079]
[0080] Where N represents the total number of samples, This represents the auxiliary prediction probability of the i-th sample. This represents the label of the i-th sample.
[0081] Furthermore, in step 4, the specific methods for classifying candidate regions and freezing some parameters of the network are as follows:
[0082] (4.1) The candidate regions are classified using a cosine similarity-based classifier. The classification score s and probability p of the candidate regions are calculated using the following formulas:
[0083]
[0084]
[0085] Where, F(x) j This represents the feature of the j-th candidate region in the input remote sensing image x. Let ||·|| represent the weight vector of the i-th category, where ||·|| represents the L2 norm. This represents the j-th candidate region and category y of the input remote sensing image x. i Similarity scores between weight vectors It is category y i The probability score of the j-th suggestion region, where α represents the scaling factor, C = C base +C novel Indicates the quantity of all categories;
[0086] (4.2) The feature extraction network consists of ResNet50 and a feature pyramid, and its loss function L rcnn It consists of smooth L1 and cross-entropy loss, with its parameter w rcnn and auxiliary branch parameter w aux Freeze the parameters, unfreeze the parameters of other modules, and participate in gradient backpropagation to minimize the loss of the parameters. It is expressed as follows:
[0087]
[0088] Among them, w rpn The parameter W represents the RPN. ft ={w rpn ,w rcnn} represents the parameters of RPN and RCNN, f(D few ;θ * This indicates that the network is initialized using weights trained on the base class, so that the network can be used in the next step of D. few Adjustments are made, and argmin(·) represents taking the minimum loss.
[0089] Furthermore, the knowledge distillation process in step 5 is as follows:
[0090] (5.1) After training on the base class, the CLIP text encoder contains semantic knowledge P from the open dataset and classification knowledge K from the remote sensing domain. c and knowledge F used to distinguish context bg When adjusting to a new class, when the network model is input with a remote sensing image X, based on causal relationships and causal reasoning in computer vision, X can infer the class Y from three paths. The first path is to directly infer the class Y from the remote sensing image; the second path is to obtain the classification knowledge K of the remote sensing domain from the open-domain semantic knowledge P. c The first path infers the category Y; the second path obtains the knowledge F that distinguishes the context from the open-domain semantic knowledge P. bg Infer categories Y and F bg Divided into foreground knowledge F′ bg and background knowledge F″ bg Two layers, in this causal model, P and F bg All are confounding factors of X, remote sensing domain knowledge K c It is fixed; P cannot be observed, but F can be calibrated. bg The causal effect of X on Y is expressed by the following formula:
[0091]
[0092] Where do() represents the do operator, Y(X) represents the predicted category Y after the remote sensing image X is input into the detection network, P(Y(X)|do(X)) represents the causal effect of X on Y, and F m F represents bg Hierarchical knowledge F′ bg and F″ bg , P(F m ) represents the probability of each level of knowledge;
[0093] (5.2) Knowledge distillation of main and auxiliary branches: After inputting the new class image, execute step 2 again to obtain 300 positive sample anchor boxes. Knowledge distillation is performed from the foreground and background (FBD). + The background knowledge before positive samples is For negative samples, knowledge distillation is performed from the background using FBD. - Negative sample background knowledge is Foreground classification knowledge distillation (FCD) is performed on all positive and negative remote sensing images. The classification knowledge for all samples is P. FCD Using the KL divergence loss, the formula is as follows:
[0094]
[0095] Where KL() represents the KL divergence loss, This represents the distillation of background knowledge between auxiliary branches and the main branch. This represents knowledge distillation of background categories for auxiliary and main branches, KL(P) FCD This represents the foreground classification knowledge distillation for all remote sensing images, where α and β are hyperparameters with values of 4 and 0.5, respectively. The specific knowledge distillation formulas are shown below:
[0096]
[0097]
[0098]
[0099] in, These represent the background features of the auxiliary branch and the main branch, respectively. These represent the features of the background categories of the auxiliary branch and the main branch, respectively. Let w represent the foreground classification features of the auxiliary branch and the main branch remote sensing images, respectively, where N represents the number of samples and w represents the number of samples. rcnn and w aux Since it's already frozen, when optimizing the total loss L using gradient descent, only w will be adjusted. rpn :
[0100] L = L KL +L rcnn +L rpn +L aux (11).
[0101] The dataset of this invention will be further described below:
[0102] The experimental dataset used in the invention experiment was the DIOR remote sensing dataset.
[0103] DIOR is a publicly available remote sensing dataset released by Northwestern Polytechnical University in 2019. The DIOR dataset is a large-scale, publicly available object detection dataset from remote sensing images, containing 23,463 images collected from Google Earth. It covers 20 object classes: airplanes, airports, baseball fields, basketball courts, bridges, chimneys, dams, highway service areas, highway toll booths, ports, golf courses, athletic fields, overpasses, ships, stadiums, storage tanks, tennis courts, train stations, vehicles, and windmills. The entire dataset is divided into three parts: a training set, a validation set, and a test set, containing 5,862, 5,863, and 11,738 images, respectively. The spatial resolution of the dataset ranges from 0.5 to 30 meters, and all images are 800×800 pixels in size. Fifteen classes were randomly selected from the DIOR dataset as base classes, and five classes were selected as new classes, with sampling numbers of 2, 3, 5, and 10 for each new class.
[0104] Furthermore, it should be understood that although this specification describes embodiments, not every embodiment contains only one independent technical solution. This narrative style is merely for clarity. Those skilled in the art should consider the specification as a whole, and the technical solutions in each embodiment can also be appropriately combined to form other embodiments that can be understood by those skilled in the art.
Claims
1. A remote sensing target detection method based on conditional prompting and causal learning, characterized in that, The method includes the following steps: Step 1: Obtain the remote sensing dataset DIOR, divide it into base classes and new classes according to the proportion, and construct the remote sensing dataset; Step 2: Construct the main branch network. Extract the features of the base class from Step 1 through the ResNet50 backbone network and feature pyramid. The RPN generates candidate regions based on the base class features. The RoI Align maps the candidate regions back to the feature map of the base class to extract the features of the candidate regions. Step 3: Construct an auxiliary branch network. The auxiliary branch consists of a lightweight network Meta-Net and a CLIP text encoder. The candidate region features generated in Step 2 are input into the auxiliary branch, and the conditional cue vectors are learned through Meta-Net and the CLIP text encoder. The weights of the text encoder, once initialized, do not participate in the training process. Step 4, conditional prompt vector Input the main branch network, classify candidate regions using the classification head, complete the training of the base class data, and freeze some parameters of the network; Step 5: Based on the backdoor criterion in causal learning, perform knowledge distillation on the new class, adjust some parameters of the network, and then detect the remote sensing image to obtain the location and classification of the target in the remote sensing image.
2. The remote sensing target detection method based on conditional prompting and causal learning according to claim 1, characterized in that, The method for step 1 is as follows: Step 1-1: Obtain the DIOR remote sensing dataset. The dataset contains 11,725 training images and 11,738 test images, with a total of 20 categories. The annotation information for each image includes the coordinates of the four vertices of the target to be detected and the category. Select five categories from the training set: Baseball field, Basketball court, Bridge, Chimney, and Ship, as new categories. The remaining 15 categories serve as base classes. The new class and the base class do not have the same category, that is: (1) Steps 1-2: Construct the base class dataset based on the base class category. , This represents the input remote sensing image. Indicates the category of remote sensing data. It contains all the images and annotation files in 15 categories; Steps 1-3: Construct a new class dataset based on the new class categories. , Each class contains only K samples, where K can take the values 2, 3, 5, or 10. The training dataset is... , Each category contains only K samples.
3. The remote sensing target detection method based on conditional prompting and causal learning according to claim 2, characterized in that, In step 2, the process of generating candidate regions and extracting candidate region features is as follows: (2.1) Base class remote sensing data The remote sensing image features are obtained by inputting them into the backbone network ResNet-50. ; (2.2) Use a convolution kernel of size 3x3 with padding of 2 and stride of 1 to traverse the feature map. Nine anchor frames are generated at each pixel. The aspect ratio of the anchor frames is 1:1, 1:2, and 2:
1. The length and width of the anchor frames are 128, 256, and 512 respectively. (2.3) Use the softmax function to divide the anchor boxes into positive and negative samples. Randomly select 2000 positive sample anchor boxes and perform non-maximum suppression on these 2000 positive sample anchor boxes. After sorting the results, take the top 300 as the final proposal boxes. , This represents the i-th suggestion box; (2.4) Using bilinear interpolation to Mapped to In the process, a 7×7 feature map is generated. Through loss Train the RPN. It consists of smooth L1 and cross-entropy loss.
4. The remote sensing target detection method based on conditional prompting and causal learning according to claim 3, characterized in that, In step 3, the conditional hint vector The generation process is as follows: (3.1) will Input into a lightweight network Meta-Net to generate , Composed of 9 tokens, Meta-Net is a ReLu-Linear-ReLu structure; (3.2) Convert the base class category text into a token using the Byte Pair Encoding method, randomly initialize 8 tokens, and concatenate them with the tokens converted from the category text. Add; (3.3) Use the Word2Vec word embedding model to convert the summed 9 tokens into a cue vector, and input it into the CLIP text encoder. Generate a 512×15 conditional cue vector. ,Will Input two fully connected layers to generate a visual feature vector with dimensions of 1024×512. ,Will and Multiply and process using the softmax function to obtain the auxiliary prediction probability p aux, Use p y The label of the current sample is used, and finally, cross-entropy loss is applied. Training auxiliary branch: (2) Where N represents the total number of samples, This represents the auxiliary prediction probability of the i-th sample. This represents the label of the i-th sample.
5. The remote sensing target detection method based on conditional prompting and causal learning according to claim 4, characterized in that, In step 4, the specific methods for classifying candidate regions and freezing some parameters of the network are as follows: (4.1) The candidate regions are classified using a cosine similarity-based classifier. The classification score s and probability p of the candidate regions are calculated using the following formulas: (3) (4) in, This represents the feature of the j-th candidate region in the input remote sensing image x. This represents the weight vector of the i-th category. Where represents the L2 norm, This represents the j-th candidate region and category of the input remote sensing image x. Similarity scores between weight vectors It is category y i The probability score of the j-th suggested region Indicates the scaling factor. Indicates the quantity of all categories; (4.2) The feature extraction network consists of ResNet50 and a feature pyramid, and its loss function L rcnn Composed of smooth L1 and cross-entropy loss, its parameters are... and auxiliary branch parameters Freeze the parameters, unfreeze the parameters of other modules, and participate in gradient backpropagation to minimize the loss of the parameters. It is expressed as follows: (5) in, Parameters representing RPN This represents the parameters of RPN and RCNN. This indicates that the network is initialized using weights trained on the base class, so that the network can proceed in the next step... Make adjustments. This indicates taking the minimum loss.
6. The remote sensing target detection method based on conditional prompting and causal learning according to claim 5, characterized in that, The knowledge distillation process in step 5 is as follows: (5.1) After training on the base class, the CLIP text encoder contains semantic knowledge P from the open dataset and classification knowledge K from the remote sensing domain. c and knowledge F used to distinguish context bg When adjusting to a new class, when the network model is input with a remote sensing image X, based on causal relationships and causal reasoning in computer vision, X can infer the class Y from three paths. The first path is to directly infer the class Y from the remote sensing image; the second path is to obtain the classification knowledge K of the remote sensing domain from the open-domain semantic knowledge P. c The first path infers the category Y; the second path obtains the knowledge F that distinguishes the context from the open-domain semantic knowledge P. bg Infer categories Y and F bg Divided into prospective knowledge and background knowledge Two layers, in this causal model, P and F bg All are confounding factors of X, remote sensing domain knowledge K c It is fixed; P cannot be observed, but F can be calibrated. bg The causal effect of X on Y is expressed by the following formula: (6) in, Let Y(X) represent the do operator, and Y(X) represent the predicted category Y after the remote sensing image X is input into the detection network. This indicates the causal effect of X on Y. F represents bg Layered knowledge and , This represents the probability of each level of knowledge. (5.2) Knowledge distillation of main and auxiliary branches: After inputting the new class image, execute step 2 again to obtain 300 positive sample anchor boxes. Knowledge distillation is performed from the foreground and background (FBD). + The background knowledge before positive samples is For negative samples, knowledge distillation is performed from the background using FBD (Factory Distillation). - Negative sample background knowledge is For all positive and negative sample remote sensing images, foreground classification knowledge distillation (FCD) is performed, and the classification knowledge for all samples is... Using the KL divergence loss, the formula is as follows: (7) Where KL() represents the KL divergence loss, This represents the distillation of background knowledge between auxiliary branches and the main branch. This represents the background knowledge distillation of auxiliary and main branches, as well as the knowledge distillation of background categories. This represents the distillation of foreground classification knowledge for all remote sensing images. These are hyperparameters, with values of 4 and 0.5 respectively. The specific distillation formulas for each knowledge are shown below: (8) (9) (10) in, , These represent the background features of the auxiliary branch and the main branch, respectively. , These represent the features of the background categories of the auxiliary branch and the main branch, respectively. , Let N represent the foreground classification features of the remote sensing images in the auxiliary and main branches, respectively, and let N represent the number of samples. and Since it's already frozen, when optimizing the total loss L using gradient descent, only the value will be adjusted. : (11)。