A zero-shot remote sensing image scene classification method and system based on frequency decoupling

By employing frequency decoupling techniques and semantic guidance methods, the interference of irrelevant background information in remote sensing images is reduced, improving the accuracy and reliability of zero-sample remote sensing image scene classification. This solves the problem of background noise interference in traditional methods and achieves more efficient scene classification.

CN122289786APending Publication Date: 2026-06-26INNER MONGOLIA UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INNER MONGOLIA UNIV OF TECH
Filing Date
2026-04-03
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Traditional remote sensing scene classification methods rely on a large number of labeled samples, making it difficult to cover new scene categories. Furthermore, irrelevant background information in remote sensing images interferes with model feature learning and semantic alignment, leading to classification errors and limiting the application of zero-shot learning techniques in remote sensing image scene classification.

Method used

Frequency decoupling technology is employed to extract low-frequency components of remote sensing images as suppression signals through frequency domain analysis. An attention mechanism is introduced to weaken background interference. Visual features are extracted using an improved ResNet101 and ViT encoder, and semantic guidance is performed in conjunction with attribute features to enhance the attention to semantically relevant regions.

Benefits of technology

It effectively suppressed background noise interference, improved the alignment accuracy between visual features and semantic information, enhanced the performance of zero-shot scene classification, and improved the model's reliable inference ability for unknown categories.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122289786A_ABST
    Figure CN122289786A_ABST
Patent Text Reader

Abstract

This invention discloses a zero-shot remote sensing image scene classification method and system based on frequency decoupling, belonging to the field of image classification technology. It uses prompt words to guide a large language model to generate remote sensing attributes and attribute probabilities for each scene category, constructing a category × attribute probability matrix. Simultaneously, it utilizes a W2V model to extract attribute features using an improved ResNet101. Visual features are extracted from the remote sensing image and fed into a ViT encoder for feature extraction. To suppress large areas of background and redundant information reflected by low-frequency components, a frequency decoupling module is innovatively introduced within the encoder to reconstruct visual features from a frequency domain perspective, highlighting discriminative visual features highly correlated with semantic attributes. In the decoding stage, the decoder uses a cross-attention mechanism to semantically guide visual features using attribute features, enabling the model to focus more on image regions related to attributes, thereby improving the model's classification ability in zero-shot scenes.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image classification technology, and more specifically to a zero-sample remote sensing image scene classification method and system based on frequency decoupling. Background Technology

[0002] Remote sensing scene classification, as an important means of automatically extracting and classifying surface information from massive amounts of remote sensing images, plays a vital role in many fields such as land cover mapping, land use monitoring, resource and environmental assessment, and urban planning and management. Its classification accuracy and efficiency directly affect the quality of various related application tasks.

[0003] With the rapid development of remote sensing technology, the data scale of remote sensing images has exploded, and the types of surface scenes have become increasingly diverse. Traditional remote sensing scene classification methods rely on a large number of labeled samples for model training, which is not only costly and time-consuming, but also difficult to cover the constantly emerging new scene categories, resulting in a severe limitation on the generalization ability of the model, and failing to meet the needs of open category scene classification in practical applications.

[0004] The emergence of zero-shot learning technology provides an effective solution to the above problems. Its core idea is to use the semantic knowledge (such as attributes and feature descriptions) shared between known and unseen categories to achieve effective classification of unseen categories when there are no training samples for unseen categories. This significantly alleviates the challenges brought about by insufficient annotation and category openness in remote sensing scenes, and is therefore widely used in remote sensing image scene classification tasks.

[0005] However, remote sensing images have characteristics that significantly differ from natural images. They typically contain a large amount of background content unrelated to semantic information. This background content severely distracts the model from focusing on semantically relevant regions, making it difficult for the model to concentrate on truly discriminative image information. In zero-shot remote sensing image scene classification, when using shared semantic information to infer unknown categories, this irrelevant background information strongly interferes with the model's feature learning and semantic alignment processes. This leads to inaccurate alignment between visual features and semantic information, resulting in semantic inference bias and ultimately classification errors. This significantly weakens the model's reliable inference ability for unknown categories, reduces the performance of zero-shot remote sensing image scene classification, and limits the further promotion and application of zero-shot learning techniques in the field of remote sensing scene classification.

[0006] In summary, how to provide a zero-shot remote sensing image scene classification method and system that can effectively suppress irrelevant background interference and enhance semantically relevant feature learning is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0007] In view of this, the present invention provides a zero-shot remote sensing image scene classification method and system based on frequency decoupling. It utilizes frequency domain analysis to extract low-frequency components in the image to represent background information, and introduces low-frequency background information as a suppression signal into the attention mechanism, thereby reducing the interference of the background on the model's semantic judgment and highlighting the key information in the image that is semantically related. This effectively improves the model's attention to semantically related regions and significantly enhances the ability of zero-shot scene classification.

[0008] To achieve the above objectives, the present invention adopts the following technical solution: On the one hand, the present invention provides a zero-shot remote sensing image scene classification method based on frequency decoupling, comprising: Remote sensing attributes and attribute probabilities are generated for each scene category using a large language model, and a category × attribute probability matrix is ​​constructed. Attribute features are then extracted using a W2V model. An improved ResNet101 model and ViT encoder are used to extract visual features from remote sensing images to be classified. By using attribute features to semantically guide visual features, semantically guided visual features are obtained. The final classification result is obtained based on the semantically guided visual features and the probability matrix of category × attribute.

[0009] Preferably, the improved ResNet101 model is a ResNet101 model with the fully connected layers removed.

[0010] Preferably, an improved ResNet101 model and a ViT encoder are used to extract visual features from the remote sensing images to be classified, specifically including: Initial visual features of the remote sensing images to be classified are extracted using a pre-trained improved ResNet101 model and a ViT encoder. A frequency decoupling module based on a frequency attention mechanism is introduced inside the ViT encoder to initially perform discrete wavelet transform on the visual features, decomposing the initial visual features into low-frequency features and three sets of high-frequency features. The low-frequency features are reduced in dimensionality and expanded into one-dimensional vectors. The low-frequency attention map is then obtained based on the one-dimensional vectors. The low-frequency attention map is upsampled using bilinear interpolation to obtain a low-frequency attention map that includes background information; The background features in the original visual features are removed based on the low-frequency attention map to obtain the visual features.

[0011] Preferably, the visual features are obtained by removing background features from the original visual features based on the low-frequency attention map, specifically including: The original visual features are fed into the ReLU layer and Dropout layer to obtain the embedded features; Visual self-attention is extracted from embedded features using the self-attention layer of the ViT encoder; Visual features are obtained based on visual self-attention and low-frequency attention maps.

[0012] Preferably, semantic guidance is applied to visual features using attribute features to obtain semantically guided visual features, specifically including: By inputting attribute features and visual features into the multi-head attention layer and multi-layer perceptron layer of the ViT decoder, semantically guided visual features are calculated:

[0013] Where A is the attribute feature, A∈R N×D ; As a visual feature, ∈R (HW)×(HW) ; , , Represents a learnable matrix. This is the scaling factor.

[0014] Preferably, the method further includes: The model is optimized using the probability matrix and the mean squared error of the classification results as the loss function, and the model parameters are updated by minimizing the loss function.

[0015] Where M represents the total number of samples, Y i Let P be the probability value of the i-th sample. i This represents the classification result for the i-th sample.

[0016] On the other hand, the present invention provides a zero-shot remote sensing image scene classification system based on frequency decoupling, comprising: The attribute feature extraction module is used to generate remote sensing attributes and attribute probabilities for each scene category using a large language model, and to construct a category × attribute probability matrix, using the attribute features extracted by the W2V model. The visual feature extraction module is used to extract visual features of remote sensing images to be classified using an improved ResNet101 model and a ViT encoder. The semantic guidance module is used to semantically guide visual features using attribute features to obtain semantically guided visual features; The classification module is used to obtain the final classification result by combining semantically guided visual features with a probability matrix of category × attribute.

[0017] As can be seen from the above technical solution, compared with the prior art, this invention discloses a zero-shot remote sensing image scene classification method and system based on frequency decoupling. By introducing a frequency decoupling module inside the ViT encoder, low-frequency components in the image are extracted using frequency domain analysis to represent background information. This low-frequency background information is then introduced as a suppression signal into the attention mechanism, accurately identifying and removing low-frequency redundant information in visual features. This significantly reduces the interference of background content unrelated to category semantics on model judgment, making the alignment between visual features and semantic attributes more accurate. This solves the semantic inference shift problem caused by background noise in traditional zero-shot learning methods. In the decoding stage, the cross-attention mechanism is combined with attribute features to semantically guide visual features, further enhancing the model's ability to locate semantically relevant regions and effectively improving the discriminativeness and distinguishability of feature representation. Attached Figure Description

[0018] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0019] Figure 1 The overall framework diagram of the zero-sample remote sensing image scene classification method based on frequency decoupling provided by the present invention is shown.

[0020] Figure 2 This is a schematic diagram of the specific structure of the frequency decoupling module.

[0021] Figure 3 This is a schematic diagram of the overall structure of the zero-sample remote sensing image scene classification system based on frequency decoupling provided by the present invention. Detailed Implementation

[0022] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0023] This invention discloses a zero-shot remote sensing image scene classification method based on frequency decoupling, such as... Figure 1 As shown, it includes: Step 1: Generate remote sensing attributes and attribute probabilities for each scene category using a large language model, construct a probability matrix of category × attribute, and extract attribute features using a W2V model.

[0024] The process of extracting attribute features using a Large Language Model (LLM) is as follows: Input the prompt: "As a remote sensing interpretation expert, infer the possible visual attributes of a given remote sensing scene category in remote sensing imagery, and provide the probability of occurrence (0-1) for each attribute. Output a unified category × attribute probability matrix. Attributes must be directly observable features of the remote sensing imagery (at least 20, with all categories sharing the same attribute set)" into the Large Language Model, and then organize the output to obtain the category × attribute Y∈R. H×W×C The probability matrix.

[0025] Step 2: Extract visual features from the remote sensing images to be classified using an improved ResNet101 model and a ViT encoder; Step 3: Use attribute features to semantically guide visual features to obtain semantically guided visual features; Step 4: Based on semantically guided visual features The final classification result is obtained by combining the probability matrix of category × attribute.

[0026] Specifically, the improved ResNet101 model is a ResNet101 model with the fully connected layers removed.

[0027] Furthermore, an improved ResNet101 model and ViT encoder are used to extract visual features from the remote sensing images to be classified, specifically including: The remote sensing image X to be classified is input into a pre-trained ResNet101 model with fully connected layers removed. The pre-trained improved ResNet101 model and ViT encoder are used to extract the initial visual features X of the remote sensing image to be classified. out ∈R H ×W×C Where B is batch size, C is the number of channels, and H and W represent... The height and width.

[0028] To reduce background interference and enhance To address attribute-related features, a frequency decoupling module (FDM) based on a frequency attention mechanism is introduced within the Vision Transformer (ViT) encoder. Its structure is as follows: Figure 2 As shown, this module first performs a discrete wavelet transform on the initial visual features, decomposing them into low-frequency features F. ll With three sets of high-frequency features F lh F hl and F hh The calculation formula is shown in (1). Wherein, .

[0029]

[0030] Due to low frequency characteristics This contains a large amount of background information from the image, which may lead to the background dominating visual features and reducing the salience of attribute-related regions. Therefore, in extracting low-frequency features... Then, first, the low-frequency feature F ll The number of channels C is reduced to one dimension to obtain the low-frequency characteristics after dimensionality reduction. As shown in formula (2).

[0031]

[0032] Next, the low-frequency features after dimensionality reduction Expand into a one-dimensional vector And finally calculate the one-dimensional vector. and( ) T The pairwise similarity between them yields the low-frequency attention map. The process is shown in formulas (3)-(4).

[0033]

[0034] The low-frequency attention map is upsampled to (HW)×(HW) using bilinear interpolation to obtain a low-frequency attention map that includes background information. '∈R(HW)×(HW):

[0035] The background features in the original visual features are removed based on the low-frequency attention map to obtain the visual features.

[0036] Furthermore, based on the low-frequency attention map, background features are removed from the original visual features to obtain visual features, specifically including: When calculating the original ViT visual attention, the initial visual features are first... The embedded features I∈R are obtained by feeding them into a ReLU layer and a Dropout layer. H×W×C As shown in formula (6).

[0037]

[0038] Visual self-attention is extracted from embedded feature I using the self-attention layer of the ViT encoder. The calculation formula is shown in (7)-(8):

[0039] in, , , This represents the learnable matrix. and These represent the key matrix, query matrix, and value matrix, respectively.

[0040] Visual self-attention The low-frequency attention map calculated using formula (5) Subtraction yields the visual self-attention features after removing the low-frequency attention map. To highlight features related to the attribute. Its calculation formula is shown in (9).

[0041]

[0042] Furthermore, semantic guidance is applied to visual features using attribute features to obtain semantically guided visual features, specifically including: The visual semantic decoder consists of a ViT decoder with multi-head self-attention and a multilayer perceptron, which integrates attribute features A∈R. N×D and visual features ∈R (HW)×(HW) Inputting the multi-head attention layer and multi-layer perceptron layer of the ViT decoder, semantically guided visual features are calculated. :

[0043] Where A represents the attribute feature; Visual features; , , Represents a learnable matrix. This is the scaling factor.

[0044] In another embodiment, a learnable matrix W is used to transform semantically guided visual features. Mapping to attribute features The attribute dimensions are used to obtain the final classification result P∈RN. The calculation process is shown in (12).

[0045]

[0046] Furthermore, the method of the present invention also includes: The model is optimized using the probability matrix and the mean squared error of the classification results as the loss function, and the model parameters are updated by minimizing the loss function.

[0047] Where M represents the total number of samples, Y i Let P be the probability value of the i-th sample.i This represents the classification result for the i-th sample.

[0048] On the other hand, the present invention provides a zero-shot remote sensing image scene classification system based on frequency decoupling, such as Figure 3 As shown, it includes: The attribute feature extraction module is used to generate remote sensing attributes and attribute probabilities for each scene category using a large language model, and to construct a category × attribute probability matrix, using the attribute features extracted by the W2V model. The visual feature extraction module is used to extract visual features of remote sensing images to be classified using an improved ResNet101 model and a ViT encoder. The semantic guidance module is used to semantically guide visual features using attribute features to obtain semantically guided visual features; The classification module is used to obtain the final classification result based on semantically guided visual features and the probability matrix of category × attribute.

[0049] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to the method section.

[0050] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A zero-shot remote sensing image scene classification method based on frequency decoupling, characterized in that, include: Remote sensing attributes and attribute probabilities are generated for each scene category using a large language model, and a category × attribute probability matrix is ​​constructed. Attribute features are then extracted using a W2V model. An improved ResNet101 model and ViT encoder are used to extract visual features from remote sensing images to be classified. By using attribute features to semantically guide visual features, semantically guided visual features are obtained. The final classification result is obtained based on the semantically guided visual features and the probability matrix of category × attribute.

2. The zero-shot remote sensing image scene classification method based on frequency decoupling according to claim 1, characterized in that, The improved ResNet101 model is a ResNet101 model with the fully connected layers removed.

3. The zero-shot remote sensing image scene classification method based on frequency decoupling according to claim 1, characterized in that, An improved ResNet101 model and ViT encoder are used to extract visual features from the remote sensing images to be classified, specifically including: Initial visual features of the remote sensing images to be classified are extracted using a pre-trained improved ResNet101 model and a ViT encoder. A frequency decoupling module based on a frequency attention mechanism is introduced inside the ViT encoder to initially perform discrete wavelet transform on the visual features, decomposing the initial visual features into low-frequency features and three sets of high-frequency features. The low-frequency features are reduced in dimensionality and expanded into one-dimensional vectors. The low-frequency attention map is then obtained based on the one-dimensional vectors. The low-frequency attention map is upsampled using bilinear interpolation to obtain a low-frequency attention map that includes background information; The background features in the original visual features are removed based on the low-frequency attention map to obtain the visual features.

4. The zero-shot remote sensing image scene classification method based on frequency decoupling according to claim 3, characterized in that, Background features are removed from the original visual features based on the low-frequency attention map to obtain the visual features, specifically including: The original visual features are fed into the ReLU layer and Dropout layer to obtain the embedded features; Visual self-attention is extracted from embedded features using the self-attention layer of the ViT encoder; Visual features are obtained based on visual self-attention and low-frequency attention maps.

5. The zero-shot remote sensing image scene classification method based on frequency decoupling according to claim 1, characterized in that, Using attribute features to semantically guide visual features, semantically guided visual features are obtained, specifically including: By inputting attribute features and visual features into the multi-head attention layer and multi-layer perceptron layer of the ViT decoder, semantically guided visual features are calculated: Where A is the attribute feature, A∈R N×D ; As a visual feature, ∈R (HW)×(HW) ; , , Represents a learnable matrix. This is the scaling factor.

6. The zero-shot remote sensing image scene classification method based on frequency decoupling according to claim 1, characterized in that, The method further includes: The model is optimized using the probability matrix and the mean squared error of the classification results as the loss function, and the model parameters are updated by minimizing the loss function. Where M represents the total number of samples, Y i Let P be the probability value of the i-th sample. i This represents the classification result for the i-th sample.

7. A zero-shot remote sensing image scene classification system based on frequency decoupling, characterized in that, include: The attribute feature extraction module is used to generate remote sensing attributes and attribute probabilities for each scene category using a large language model, and to construct a category × attribute probability matrix, using the attribute features extracted by the W2V model. The visual feature extraction module is used to extract visual features of remote sensing images to be classified using an improved ResNet101 model and a ViT encoder. The semantic guidance module is used to semantically guide visual features using attribute features to obtain semantically guided visual features; The classification module is used to obtain the final classification result based on semantically guided visual features and the probability matrix of category × attribute.