Remote sensing image scene classification method based on graph convolution template feature and double-teacher knowledge distillation

This remote sensing image scene classification method, which utilizes graph convolutional template features and dual-teacher knowledge distillation, addresses the issues of large discrepancies between model and real features and resource-constrained deployment in remote sensing image scene classification. It constructs a lightweight network, improves classification accuracy, and achieves effective deployment on resource-constrained devices.

CN118887452BActive Publication Date: 2026-06-16ENJOYOR COMPANY LIMITED

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ENJOYOR COMPANY LIMITED
Filing Date
2024-07-09
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing remote sensing image scene classification methods rely on prior knowledge, resulting in large differences between the model and the real image features. This makes it difficult to deploy effectively on devices with limited computing resources, and the methods also perform poorly when aggregating features at multiple scales, affecting classification accuracy.

Method used

A lightweight network is constructed using graph convolutional template features and dual-teacher knowledge distillation. The feature saliency is enhanced by a graph convolutional template enhancement module, and a student model is trained using a dual-teacher knowledge distillation model. A feedback supplementary decoder is designed by combining modality contribution allocation and exchange correlation fusion modules to improve classification accuracy.

🎯Benefits of technology

While maintaining high accuracy, a lightweight network was constructed to improve the accuracy of remote sensing image scene classification, and it was effectively deployed on resource-constrained devices, surpassing the performance of existing methods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118887452B_ABST
    Figure CN118887452B_ABST
Patent Text Reader

Abstract

The application discloses a kind of based on graph convolution template feature and double teacher knowledge distillation remote sensing image scene classification method, applied in deep learning technical field.The present application includes: S1, obtain remote sensing image data, and remote sensing image data is preprocessed;S2, the remote sensing image data after pre-processing is divided into training set and test set;S3, establish double teacher knowledge distillation model, double teacher knowledge distillation model includes two teacher models and a student model, training set and test set in S2 are input to double teacher knowledge distillation model to realize the training and test of model;S4, the remote sensing image to be detected is input to the double teacher knowledge distillation model trained, and the corresponding remote sensing image scene classification prediction graph is output by model.The present application can be more accurately described in various classification targets in remote sensing image scene, to effectively improve the accuracy of classification prediction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of deep learning technology, and more specifically to a remote sensing image scene classification method based on graph convolution template features and dual-teacher knowledge distillation. Background Technology

[0002] The increasing use of satellites equipped with high-resolution image acquisition devices has made it easier to obtain high-quality remote sensing images. These images provide richer details, offering significant opportunities for the development of scene classification in the field of remote sensing. The goal of remote sensing image scene classification is to correctly label high-resolution remote sensing images using predefined semantic categories based on semantic differences, understand and analyze the rich information contained in the images, and facilitate tasks such as urban planning, environmental protection, and land cover mapping.

[0003] Compared to close-up images, which have a small shooting range and uniform distribution of features of the same category, remote sensing images are characterized by high resolution, diverse information, similarity of some target features, and complex scenes. Therefore, scene classification on remote sensing images is very challenging. Traditional remote sensing image scene classification methods rely heavily on selecting target features based on prior knowledge, resulting in significant differences between the constructed models and the features of real images.

[0004] The rise of deep learning has provided a novel approach to remote sensing image scene classification. Numerous studies have shown that, compared to traditional methods, deep learning-based scene classification can significantly improve classification accuracy, fully meeting various real-world needs. Because remote sensing images often contain complex scenes, certain features can easily be masked by similar or adjacent features. Simultaneously, long-range contextual information is crucial for pixel-level prediction; therefore, a remote sensing image scene classification method that enhances the saliency of feature categories and mines long-range contextual information of targets is needed to address current challenges. It is important to note that features at different scales contain different information. Directly aggregating multi-scale features without additional processing can lead to negative side effects on prediction due to poorly performing features. Therefore, a more suitable decoder is required to generate a complete prediction map. Finally, considering practical applications, deep learning-based remote sensing image scene classification methods urgently need to achieve a good balance between model size and computational complexity, overcoming the challenge of deployment on devices with limited computing and storage resources.

[0005] Therefore, proposing a remote sensing image scene classification method based on graph convolution template features and dual-teacher knowledge distillation to address the difficulties in existing technologies is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0006] In view of this, the present invention provides a remote sensing image scene classification method based on graph convolution template features and dual-teacher knowledge distillation, which solves the problem of deployment and application under limited resources while ensuring high accuracy.

[0007] To achieve the above objectives, the present invention provides the following technical solution:

[0008] A remote sensing image scene classification method based on graph convolution template features and dual-teacher knowledge distillation includes the following steps:

[0009] S1. Acquire remote sensing image data and preprocess the remote sensing image data;

[0010] S2. Divide the preprocessed remote sensing image data into a training set and a test set;

[0011] S3. Establish a dual-teacher knowledge distillation model, which includes two teacher models and one student model. Input the training set and test set from S2 into the dual-teacher knowledge distillation model to achieve model training and testing.

[0012] S4. Input the remote sensing image to be detected into the trained dual-teacher knowledge distillation model, and the model outputs the corresponding remote sensing image scene classification prediction map.

[0013] Optionally, in the method described above, the remote sensing image data acquired in S1 may include the Vaihingen dataset and the Potsdam dataset.

[0014] Optionally, in the above method, the preprocessing of remote sensing data in S1 includes randomly flipping and cropping each remote sensing image and its corresponding depth image to achieve data augmentation.

[0015] Optionally, in the above method, the dual-teacher knowledge distillation model in S3 includes two teacher models and one student model. The outputs of the first teacher clustering template algorithm and the second teacher clustering template algorithm are multiplied element-wise, and the results, along with the output of the first student clustering template algorithm, are fed into the clustering knowledge distillation, where the pairwise similarity loss function is used for knowledge distillation. The outputs of the first teacher coordinate graph convolutional network and the second teacher coordinate graph convolutional network are multiplied element-wise and fed together with the output of the first student convolutional network into the context information mining knowledge distillation, where the Dess loss function is used for knowledge distillation. The final outputs of the two teacher models are multiplied element-wise and fed together with the output of the student model into the prediction knowledge distillation, where the Kulbak-Leibler divergence loss function is used for knowledge distillation.

[0016] Optionally, in the above method, the training of the dual-teacher knowledge distillation model in S3 is divided into two stages: training the teacher model and training the student model. The training of the teacher model is as follows:

[0017] Phase 1: Input the training set into the teacher model, use the cross-entropy loss function to calculate the loss between the teacher model's predicted image and the corresponding real semantic segmentation image, train multiple times until the teacher model converges and the loss function value is reduced to the minimum to obtain a set of trained teacher models.

[0018] Optionally, the student model can be trained using the methods described above as follows:

[0019] The second stage involves inputting the training set into the student model and using the cross-entropy loss function to calculate the loss L between the student model's predicted image and the corresponding ground truth semantic segmentation image. pre The knowledge from the trained teacher model is transferred to the student model using a dual-teacher knowledge distillation model: clustering knowledge distillation results in a loss L. sp Loss L is obtained by extracting contextual information and knowledge distillation. dice Predicting knowledge distillation yields loss L kld The sum of the four loss function values ​​is used as the loss of the student model. The model is trained multiple times until the student model converges and the loss function value is reduced to the minimum, thus obtaining a well-trained student model.

[0020] Optionally, in the above method, inputting the remote sensing image to be detected into the trained dual-teacher knowledge distillation model in S4 specifically involves inputting the remote sensing image to be detected into the finally trained student model, and the student model outputting the corresponding remote sensing image scene classification prediction map.

[0021] As can be seen from the above technical solution, compared with the prior art, the present invention discloses a remote sensing image scene classification method based on graph convolution template features and dual-teacher knowledge distillation, which has the following beneficial effects:

[0022] 1) This invention introduces knowledge distillation into the task of remote sensing image scene classification, and simultaneously constructs two neural network models: a teacher model and a student model. Remote sensing images from both modalities are input into the teacher model for training, resulting in a teacher's neural network scene classification model. Then, with the teacher model parameters fixed, the same images are input into the student neural network, and the student model is trained using the relevant knowledge from the teacher model for supervision. To increase the robustness of the model, two teachers with different parameters are trained. A proposed confidence weight module is used to assign weights to the knowledge of the two teachers, performing three types of knowledge distillation: clustering knowledge, contextual information mining knowledge, and prediction knowledge.

[0023] 2) This invention yields a lightweight network for remote sensing image scene classification, with only 16.02M parameters, while surpassing the performance of state-of-the-art (SOTA) methods. After feature extraction via a neural network, this invention employs a graph convolutional template enhancement module. This module utilizes a clustering template algorithm to generate template features, further distinguishing various features. Simultaneously, it directly leverages the designed coordinate graph convolutional inference network to mine long-range contextual information. Therefore, it can accurately describe various classification targets in remote sensing image scenes, effectively improving the accuracy of classification prediction.

[0024] 3) In order to reduce the differences between multimodal images and achieve full feature fusion, we designed a modality contribution allocation module and an exchange correlation fusion module. During the fusion process, the modality contribution allocation module is used to assign weights to different modalities, and the exchange correlation fusion module is used to exchange the correlation matrices of the features of the two modalities to achieve full modality fusion.

[0025] 4) Based on this, in order to enhance the representation of the prediction, we designed a feedback supplement decoder, including a low-order feedback supplement decoder and a high-order feedback supplement decoder: the output of the first-layer low-order feedback supplement decoder is used as input and fed back to the second-layer high-order feedback supplement decoder to supplement and correct the output prediction. Attached Figure Description

[0026] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0027] Figure 1 A flowchart of a remote sensing image scene classification method based on graph convolution template features and dual-teacher knowledge distillation provided by the present invention;

[0028] Figure 2 This is a block diagram of the teacher (student) model of the method of the present invention;

[0029] Figure 3 Here is a block diagram of the graph convolution template enhancement module;

[0030] Figure 4 This is a block diagram of a coordinate graph convolutional inference network.

[0031] Figure 5 Block diagram of modal contribution allocation module;

[0032] Figure 6 To exchange the block diagram of the correlation fusion module;

[0033] Figure 7 Supplementing the module block diagram for low-order feedback;

[0034] Figure 8 Block diagram of the advanced feedback supplementary module;

[0035] Figure 9 A flowchart of the dual-teacher knowledge distillation model;

[0036] Figure 10 Comparison of predicted images for the Vaihingen dataset, where 10a is the original remote sensing image from the Vaihingen dataset, 10b is the predicted image obtained by the teacher model, and 10c is the predicted image obtained by the student model.

[0037] Figure 11 The images are comparisons of predicted images from the Potsdam dataset. In the comparison, 11a is the original remote sensing image from the Potsdam dataset, 11b is the predicted image obtained by the teacher model, and 11c is the predicted image obtained by the student model. Detailed Implementation

[0038] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0039] Reference Figure 1 As shown, this invention discloses a remote sensing image scene classification method based on graph convolution template features and dual-teacher knowledge distillation, comprising the following steps:

[0040] S1. Acquire remote sensing image data and preprocess the remote sensing image data;

[0041] S2. Divide the preprocessed remote sensing image data into a training set and a test set;

[0042] S3. Establish a dual-teacher knowledge distillation model, which includes two teacher models and one student model. Input the training set and test set from S2 into the dual-teacher knowledge distillation model to achieve model training and testing.

[0043] S4. Input the remote sensing image to be detected into the trained dual-teacher knowledge distillation model, and the model outputs the corresponding remote sensing image scene classification prediction map.

[0044] Furthermore, the remote sensing image data acquired in S1 includes the Vaihingen dataset and the Potsdam dataset.

[0045] Specifically, the Vaihingen dataset includes 16 height-labeled images of 2500×2500 pixels, a color map composed of near-infrared, green, and red pixels, and a corresponding depth map. We divide the dataset into two parts: 11 images for training and 5 images for validation. The Potsdam dataset also includes 24 height-labeled images of 6000×6000 pixels, along with a color map and a depth map. From these, we select seven images as the validation set, and the remaining images are used to train the model. In this invention, all remote sensing images are cropped to 256×256 pixels.

[0046] Furthermore, S1 performs preprocessing on remote sensing data, including randomly flipping and cropping each remote sensing image and its corresponding depth image to achieve data augmentation.

[0047] The teacher (student) model is constructed as follows: the teacher and student models differ only in their backbone networks; everything else is identical. The network in this invention mainly consists of three parts: feature extraction, feature fusion, and feature aggregation.

[0048] The model's feature extraction operations mainly include a color image input layer, a depth image input layer, eight convolutional blocks, and eight graph convolutional template enhancement modules. Feature fusion mainly includes four modality contribution allocation modules and four exchange-correlation fusion modules. Feature aggregation mainly includes four low-order feedback supplementation modules and four high-order feedback supplementation modules.

[0049] like Figure 2 As shown, the color image passes through the first, second, third, and fourth convolutional blocks sequentially. The depth image passes through the fifth, sixth, seventh, and eighth convolutional blocks sequentially. Both the color image and depth image corresponding to the vertical direction are input into the graph convolutional template enhancement module to enhance feature saliency and extract contextual information. The output of the graph convolutional template enhancement module is input into the modality contribution allocation module to assign weights to different modalities in the feature fusion stage. Its output and the output of the graph convolutional template enhancement module are simultaneously input into the exchange correlation fusion module for feature fusion. The output of the exchange correlation fusion module is sent to the low-order feedback supplementation module of the first-layer decoder. Finally, in addition to the input of the low-order feedback supplementation module being sent to the high-order feedback supplementation module of the second-layer decoder, the output of the first low-order feedback supplementation module is used as feedback features to sequentially supplement and correct the information of the high-order feedback supplementation module, ultimately obtaining the three outputs of the model.

[0050] First, the model's input consists of the width and height of each image's color and depth maps, respectively, with W=256 and H=256. (The color image has R, G, and B channels, while the depth map is a single channel; we initially copy it three times along the channels to obtain a three-channel depth map.) For the teacher network, we use a ResNet-34 backbone for feature extraction, and for the student network, we use a MobileNet-v2 backbone. Except for the backbone, the teacher and student networks have the same structure. Because the features extracted from the first layer by the backbone network typically contain a large amount of interference, the features extracted from the first layer are not used in this invention, and the first layer is not shown in the figures.

[0051] Specifically:

[0052] In the teacher network, the first and sixth convolutional blocks each consist of a single convolutional block, with inputs of a 3-channel color image and a depth image, respectively. Each convolutional block is constructed by sequentially connecting a convolutional layer (Conv), a normalization layer (BatchNorm), and an activation layer (Activation, Act, with ReLU activation function). The kernel size of the convolutional layer is 7×7, the number of kernels is 64, the stride is 2, and the padding is 3. After normalization, the output features are passed through the activation layer. The feature map sets output by the first and sixth convolutional blocks are denoted as follows: The width of the feature map is Height is

[0053] Both the second and seventh convolutional blocks consist of a max pooling operation and three residual blocks. The inverse residual blocks are formed by sequentially connecting convolutional layers, normalization layers, and activation layers. The max pooling operation has a 3×3 kernel size, a stride of 2, and edge padding of 1. The convolutional layers have 3×3 and 3×3 kernel sizes respectively, with 64 kernels in each layer, a stride of 2, and edge padding of 1. After normalization, the kernels pass through activation layers to output feature maps. The feature map sets output by the second and seventh convolutional blocks are denoted as follows: The width of the feature map is Height is

[0054] The third convolutional block has the same structure as the eighth convolutional block, consisting of four residual blocks connected sequentially. Each residual block is composed of a convolutional layer, a normalization layer, and an activation layer connected sequentially. The convolutional layers have kernel sizes of 3×3 and 3×3, with 128 kernels, a stride of 2, and edge padding of 1. After normalization, the kernels pass through the activation layer to output feature maps. The feature map sets output by the third and eighth convolutional blocks are denoted as follows: The width of the feature map is Height is

[0055] The fourth convolutional block has the same structure as the ninth convolutional block, consisting of six residual blocks connected sequentially. Each residual block is composed of a convolutional layer, a normalization layer, and an activation layer connected sequentially. The convolutional layers have kernel sizes of 3×3 and 3×3, with 256 kernels, a stride of 2, and edge padding of 1. After normalization, the output feature maps are passed through activation layers. The feature map sets output by the fourth and ninth convolutional blocks are denoted as follows: The width of the feature map is Height is

[0056] The fifth convolutional block has the same structure as the tenth convolutional block, consisting of three residual blocks connected sequentially. Each residual block is composed of a convolutional layer, a normalization layer, and an activation layer connected sequentially. The convolutional layers have kernel sizes of 3×3 and 3×3, with 512 kernels in total, a stride of 2, and edge padding of 1. After normalization, the kernels pass through the activation layer to output feature maps. The feature map sets output by the fifth and tenth convolutional blocks are denoted as follows: The width of the feature map is Height is

[0057] In the student network, both the first and sixth convolutional blocks consist of four inverse residual blocks connected sequentially, with three input channels for each. Each inverse residual block is composed of a convolutional layer, a normalization layer, and an activation layer connected sequentially. The convolutional layers have kernel sizes of 1×1, 3×3, and 1×1, with 24 kernels, strides of 1, 2, and 1, and edge padding of 0, 1, and 0. After normalization, the kernels pass through the activation layer to output feature maps. The feature map sets output by the first and sixth convolutional blocks are denoted as follows: The width of the feature map is Height is

[0058] Both the second and seventh convolutional blocks are constructed from three inverse residual blocks connected sequentially. Each inverse residual block consists of a convolutional layer, a normalization layer, and an activation layer connected sequentially. The convolutional kernel sizes are 1×1, 3×3, and 1×1, with 32 kernels and strides of 1, 2, and 1. The edge padding is 0, 1, and 0. After normalization, the output feature maps are passed through the activation layer. The feature map sets output by the second and seventh convolutional blocks are denoted as follows: The width of the feature map is Height is

[0059] The third convolutional block has the same structure as the eighth convolutional block, consisting of four inverse residual blocks connected sequentially. Each inverse residual block is composed of a convolutional layer, a normalization layer, and an activation layer connected sequentially. The convolutional kernel sizes are 1×1, 3×3, and 1×1, with 64 kernels in total. The strides are 1, 2, and 1, and the edge padding is 0, 1, and 0. After normalization, the output feature maps are passed through the activation layers. The feature map sets output by the third and eighth convolutional blocks are denoted as follows: The width of the feature map is Height is

[0060] The fourth convolutional block has the same structure as the ninth convolutional block, consisting of three inverse residual blocks connected sequentially. Each inverse residual block comprises a convolutional layer, a normalization layer, and an activation layer connected sequentially. The convolutional kernel sizes are 1×1, 3×3, and 1×1, with 96 kernels and strides of 1, 2, and 1 respectively. Edge padding is 0, 1, and 0. After normalization, the output feature maps are passed through the activation layer. The feature map sets output by the fourth and ninth convolutional blocks are denoted as follows: The width of the feature map is Height is

[0061] The fifth convolutional block has the same structure as the tenth convolutional block, consisting of three inverse residual blocks connected sequentially. Each inverse residual block is composed of a convolutional layer, a normalization layer, and an activation layer connected sequentially. The convolutional layers have kernel sizes of 1×1, 3×3, and 1×1, with 160 kernels and strides of 1, 2, and 1 respectively. Edge padding is 0, 1, and 0. After normalization, the output feature maps are passed through the activation layers. The feature map sets output by the fifth and tenth convolutional blocks are denoted as follows: The width of the feature map is Height is

[0062] The rest of the procedures are exactly the same for both teachers and students.

[0063] For graph convolution template enhancement modules, such as Figure 3As shown, a clustering template algorithm is first used to generate template features, and then a coordinate graph convolutional inference network is used to mine long-range contextual information. The outputs of the second and seventh convolutional blocks, the third and eighth convolutional blocks, the fourth and ninth convolutional blocks, and the fifth and tenth convolutional blocks are respectively input into the first graph convolutional template enhancement module, the second graph convolutional template enhancement module, ..., up to the fourth graph convolutional template enhancement module. Figure 3 As shown, Input 1 represents the features extracted from the depth map or color map through the backbone network, and Input 2 represents a learned 2D matrix with dimensions k and 1, where k represents the number of templates in a template feature. In the clustering template algorithm: Input 1 is normalized to obtain Output 4. Output 4 is sent to the channel maximum value 1 and channel average value 1 to obtain the maximum and average values ​​of the features according to the channel dimensions. The two are added pixel by pixel to obtain Output 5, which is directly subtracted from Input 2 pixel by pixel to obtain Output 6. Output 4 is input to the eleventh convolutional block, and the obtained value is multiplied pixel by pixel by the activation function 1 (Sigmoid) to obtain Output 7. Output 7 is accumulated by channel addition according to the channel dimensions and then sent to Normalization 2 and Fully Connected 1 to obtain Output 8. Output 8 is the template feature obtained by the clustering algorithm. We multiply it with Output 4 pixel by pixel, and add the result pixel by pixel with Output 4 and Output 8. The obtained feature is fed into the coordinate graph convolutional inference network to obtain Output 9. Output 9 is pixel-level added to both output 4 and output 8, and the results are sent to the twelfth and thirteenth convolutional blocks, respectively. The two outputs are then concatenated and sent to the fourteenth convolutional block to obtain the final output 10. The outputs of the convolutional template enhancement modules in the first and fifth images are simultaneously input into the first modality contribution allocation module and the first exchange-correlation fusion module, and so on, until the outputs of the convolutional template enhancement modules in the fourth and eighth images are simultaneously input into the fourth modality contribution allocation module and the fourth exchange-correlation fusion module.

[0064] For coordinate graph convolutional inference networks, such as Figure 4 As shown, input 3 represents the features output from the depth or color map stream after passing through the graph convolutional template enhancement module. Input 3 passes through the fifteenth, sixteenth, and seventeenth convolutional blocks sequentially, yielding outputs eleven, twelve, and thirteen, respectively. Output eleven undergoes average pooling and max pooling, with the results being pixel-level summed and fed into activation function two, followed by a dot product operation with output twelfth. The result then passes through activation function three and is multiplied pixel-by-pixel with output thirteen, with the result fed into the eighteenth and nineteenth convolutional blocks sequentially. Finally, a pixel-level residual concatenation is performed with input 3 to obtain output fourteen.

[0065] For the modal contribution allocation module, such as Figure 5As shown, inputs four and five represent the features of the depth flow and color flow respectively after passing through the graph convolutional template enhancement module. Input four is sequentially passed through the twentieth convolutional block, average pooling two, fully connected two, and fully connected three to obtain output fifteen; input five undergoes the same operation: sequentially passed through the twenty-first convolutional block, average pooling three, fully connected four, and fully connected five to obtain output sixteen. Outputs fifteen and sixteen are simultaneously fed into the cosine similarity calculation to obtain output seventeen. Output seventeen participates in the fusion module, providing a ratio for feature fusion across different modalities.

[0066] For the exchange correlation fusion module, such as Figure 6 As shown, inputs 6 and 7 are the features of the depth flow and color flow after passing through the graph convolutional template enhancement module, respectively; input 8 is the output of the modality contribution allocation module, and input 9 is 1 minus the output of the modality contribution allocation module. Input 6 is sent to the channel maximum value 2 and channel average value 2, respectively, and the two are directly added pixel by pixel to obtain output 18. Output 18 is input to the 22nd, 23rd, and 24th convolutional blocks, respectively, to obtain outputs 20, 21, and 23; input 7 performs the same operation: inputting the channel maximum value 3 and channel average value 3, the two are added pixel by pixel to obtain output 19, and then inputting them to the 25th, 26th, and 27th convolutional blocks, respectively, to obtain outputs 22, 24, and 25. Outputs 20 and 21 are multiplied, and after passing through activation function 5, they are multiplied by output 22 and added pixel by pixel. The result of multiplying this by the pixel-level multiplication of inputs 6 and 8 is added to the 28th convolutional block to obtain output 26. Similarly, outputs 24 and 25 are multiplied by a dot product, then multiplied by a dot product with output 23 via activation function 6 and summed pixel by pixel. The sum of these product values ​​is then added to the pixel-level multiplication of inputs 7 and 9 and fed into the 29th convolutional block to obtain output 27. Outputs 26 and 27 are then summed pixel by pixel to obtain output 28.

[0067] During the decoding stage, such as Figure 2 As shown, the outputs of the fourth and third exchange-correlation fusion modules are fed into the third low-order feedback supplementation module. Its output is then fed into both the second and third high-order feedback supplementation modules, until the outputs of the second and first exchange-correlation fusion modules are sent to the first low-order feedback supplementation module, and its output is sent to the first high-order feedback supplementation module. In addition, the first, second, and third high-order feedback supplementation modules also receive the output from the first low-order feedback supplementation module as feedback, and their outputs are 1, 2, and 3 as the final prediction results.

[0068] Low-order feedback supplementary modules, such as Figure 7As shown, input 8 is cascaded with input 9 through adaptive upsampling and the 28th convolutional block to obtain output 29. Output 29 is then fed into average pooling block 4, the 29th convolutional block, and the 30th convolutional block to obtain output 31. It is then fed into the 31st and 32nd convolutional blocks to obtain output 30. The two outputs are element-wise added and passed through activation function 7, then multiplied pixel-wise with inputs 8 and 9 respectively. The resulting pixel-wise addition is fed into the 33rd convolutional block to obtain output 32. Output 32 undergoes erosion and dilation operations, and the two are subtracted pixel-wise before being fed into the 34th convolutional block to obtain output 33. Output 33 and output 32 are then element-wise multiplied and added to obtain output 34.

[0069] Advanced feedback supplementary modules, such as Figure 8 As shown, the first half of the operation is completely consistent with the low-order feedback supplementation module. Input 12 represents the output of the first low-order feedback supplementation module. Input 10 is cascaded with Input 11 through adaptive upsampling and the 35th convolutional block to obtain Output 35. Output 35 is fed into Average Pooling 5, the 36th and 37th convolutional blocks to obtain Output 37, and then into the 38th and 39th convolutional blocks to obtain Output 36. The two are element-wise added and passed through Activation Function 8, and then multiplied by Input 10 and Input 11 pixel by pixel. The results are added pixel by pixel and fed into the 40th convolutional block to obtain Output 38. Input 12 is element-wise added to Output 38 through the 41st convolutional block to obtain Output 39.

[0070] Dual-teacher knowledge distillation model, such as Figure 9 As shown, two teacher models and one student model are constructed. The depth map and color map are fed into the first and second teacher models, yielding final outputs 40 and 41 respectively. These are then fed into the weight allocation module to obtain output 49, which is used to assign weights to the two teacher models. Outputs 43 (from the first teacher clustering template algorithm) and 44 (from the second teacher clustering template algorithm) are element-wise multiplied by 49. The results, along with output 45 (from the first student clustering template algorithm), are fed into the clustering knowledge distillation process using the pairwise similarity loss function (SP loss). Outputs 46 (from the first teacher coordinate graph convolutional network) and 47 (from the second teacher coordinate graph convolutional network) are also element-wise multiplied by 49 and fed into the context information mining knowledge distillation process along with output 48 (from the first student convolutional network), using the Dice loss function. The final outputs 40 and 41 from the two teacher models are also element-wise multiplied by 49 and fed into the prediction knowledge distillation process along with output 42 from the student model, using the Körbach-Leibler divergence loss function (KLD loss).

[0071] The training of the dual-teacher knowledge distillation model in S3 is divided into two stages: training the teacher model and training the student model. The training of the teacher model is as follows:

[0072] Phase 1: Input the training set into the teacher model, use the cross-entropy loss function to calculate the loss between the teacher model's predicted image and the corresponding real semantic segmentation image, train 200 times, repeat the above steps until the teacher model converges and the loss function value is reduced to the minimum to obtain a new teacher model.

[0073] Furthermore, the student model is trained as follows:

[0074] The second stage involves inputting the training set into the student model and using the cross-entropy loss function to calculate the loss L between the student model's predicted image and the corresponding ground truth semantic segmentation image. pre The knowledge from a new set of teacher models is transferred to the student models using a dual-teacher knowledge distillation model: clustering knowledge distillation incurs a loss L. sp Loss L is obtained by extracting contextual information and knowledge distillation. dice Predicting knowledge distillation yields loss L kld The sum of the four loss function values ​​is used as the loss of the student model. The model is trained 200 times until the student model converges and the loss function value is reduced to the minimum, thus obtaining a new student model.

[0075] In S4, the remote sensing image to be detected is input into the trained dual-teacher knowledge distillation model. Specifically, the remote sensing image to be detected is input into the finally trained student model, and the student model outputs the corresponding remote sensing image scene classification prediction map.

[0076] To further verify the feasibility and effectiveness of the method of the present invention, experiments were conducted.

[0077] All experiments were conducted in a PyTorch environment using an NVIDIA TiTAV graphics card and 16GB of RAM. We used two commonly used objective parameters for evaluating scene classification methods as evaluation metrics: Mean Class Accuracy (mAcc) and Mean Intersection over Union (MIoU) of the predicted and labeled images to evaluate the performance of the prediction model.

[0078] The method of this invention was used to predict each remote sensing image in the test sets of two remote sensing image databases, Vaihingen and Potsdam, to obtain the predicted image corresponding to each remote sensing image. The average class accuracy mAcc and the ratio of intersection to union of predicted image and label image MIoU, which reflect the scene classification effect of the method of this invention, are shown in Table 1. Table 1 shows the evaluation results of the method of this invention on the Vaihingen and Potsdam test sets. As can be seen from the data in Table 1, the prediction results of the remote sensing images obtained by the method of this invention are good, indicating that it is feasible and effective to use the method of this invention to obtain the predicted scene classification image corresponding to the remote sensing image.

[0079] Table 1

[0080]

[0081]

[0082] Figure 10 Predict image contrast maps for the Vaihingen dataset. Figure 10 a gives the original remote sensing images from the Vaihingen dataset; Figure 10 b provides the method of the present invention for... Figure 10 The original remote sensing image shown in image a is used for prediction, and the teacher's model obtains the predicted scene classification image. Figure 10 c provides the method of the present invention for... Figure 10 The original remote sensing image shown in figure a is used for prediction, and the student model obtains the predicted scene classification image with the participation of knowledge distillation. Figure 11 Predict image contrast maps for the Potsdam dataset. Figure 11 a gives the original remote sensing images from the Potsdam dataset; Figure 11 b provides the method of the present invention for... Figure 11 The original remote sensing image shown in image a is used for prediction, and the teacher's model obtains the predicted scene classification image. Figure 11 c provides the method of the present invention for... Figure 11 The original remote sensing image shown in image a is used for prediction, and the student model, with the participation of knowledge distillation, obtains a predicted scene classification image; comparison. Figure 10 a, Figure 10 b、 Figure 10 c, comparison Figure 11 a, Figure 11 b、 Figure 11 c. It can be seen that the accuracy of the predicted scene classification image obtained by the method of the present invention is high.

[0083] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to the method section.

[0084] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A remote sensing image scene classification method based on graph convolution template features and dual-teacher knowledge distillation, characterized in that, Includes the following steps: S1. Acquire remote sensing image data and preprocess the remote sensing image data; S2. Divide the preprocessed remote sensing image data into a training set and a test set; S3. Establish a dual-teacher knowledge distillation model, which includes two teacher models and one student model. Input the training set and test set from S2 into the dual-teacher knowledge distillation model to achieve model training and testing. S4. Input the remote sensing image to be detected into the trained dual-teacher knowledge distillation model, and the model outputs the corresponding remote sensing image scene classification prediction map. The dual-teacher knowledge distillation model in S3 consists of two teacher models and one student model. The outputs of the first teacher clustering template algorithm and the second teacher clustering template algorithm are multiplied element-wise, and the results, along with the output of the first student clustering template algorithm, are fed into the clustering knowledge distillation process, using a pairwise similarity loss function. The outputs of the first teacher coordinate graph convolutional network and the second teacher coordinate graph convolutional network are multiplied element-wise and fed together with the output of the first student convolutional network into the context information mining knowledge distillation process, using a Dess loss function. The final outputs of the two teacher models are multiplied element-wise and fed together with the output of the student model into the prediction knowledge distillation process, using a Kübbach-Leibler divergence loss function. The training of the dual-teacher knowledge distillation model in S3 is divided into two stages: training the teacher model and training the student model. The training of the teacher model is as follows: Phase 1: Input the training set into the teacher model, use the cross-entropy loss function to calculate the loss between the teacher model's predicted image and the corresponding real semantic segmentation image, train multiple times until the teacher model converges and the loss function value is reduced to the minimum to obtain a set of trained teacher models. The student model was trained as follows: The second stage involves inputting the training set into the student model and using the cross-entropy loss function to calculate the loss between the student model's predicted image and the corresponding ground truth semantic segmentation image. The knowledge from the trained teacher model is transferred to the student model using a dual-teacher knowledge distillation model: clustering knowledge distillation incurs loss. Loss is obtained by extracting contextual information and knowledge distillation. Predicting knowledge distillation yields loss The sum of the four loss function values ​​is used as the loss of the student model. The model is trained multiple times until the student model converges and the loss function value is reduced to the minimum, thus obtaining a well-trained student model.

2. The remote sensing image scene classification method based on graph convolution template features and dual-teacher knowledge distillation according to claim 1, characterized in that, The remote sensing image data acquired in S1 includes the Vaihingen dataset and the Potsdam dataset.

3. The remote sensing image scene classification method based on graph convolution template features and dual-teacher knowledge distillation according to claim 1, characterized in that, In S1, the preprocessing of remote sensing data includes randomly flipping and cropping each remote sensing image and its corresponding depth image to achieve data augmentation.

4. The remote sensing image scene classification method based on graph convolution template features and dual-teacher knowledge distillation according to claim 1, characterized in that, In S4, the remote sensing image to be detected is input into the trained dual-teacher knowledge distillation model. Specifically, the remote sensing image to be detected is input into the finally trained student model, and the student model outputs the corresponding remote sensing image scene classification prediction map.