Dual-stream spectral fusion method and device based on intermediate loss

By introducing an intermediate loss mechanism into the spatial-spectral fusion network, and applying corresponding losses to the spatial and spectral feature extraction subnetworks respectively, the problem of difficulty in balancing spatial detail and spectral fidelity in existing technologies is solved, and high-quality spatial-spectral fusion results are achieved.

CN122289017APending Publication Date: 2026-06-26AEROSPACE INFORMATION RES INST CAS

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
AEROSPACE INFORMATION RES INST CAS
Filing Date
2026-03-24
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing deep learning-based self-supervised spatial-spectral fusion methods lack direct guidance in the feature extraction stage, causing spatial and spectral features to shift during deep propagation, making it difficult to simultaneously guarantee the spatial detail clarity and spectral fidelity of the fusion result.

Method used

A two-stream spatial-spectral fusion method based on intermediate loss is adopted. By introducing a spatial feature extraction subnetwork and a spectral feature extraction subnetwork into the spatial-spectral fusion network, spatial similarity loss and spectral consistency loss are applied respectively, and combined with global consistency loss, the spatial structure and spectral characteristics consistency in the feature extraction stage are guided.

Benefits of technology

It stably generates high-quality fused images with clearer spatial details and more accurate spectral information, avoiding feature conflicts during final fusion and improving the quality and stability of the fused images.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122289017A_ABST
    Figure CN122289017A_ABST
Patent Text Reader

Abstract

This invention provides a dual-stream spatial-spectral fusion method and apparatus based on intermediate loss, comprising: inputting a panchromatic image into a spatial feature extraction subnetwork of a spatial-spectral fusion network to obtain intermediate spatial features output by the spatial feature extraction subnetwork; inputting an upsampled multispectral image into a spectral feature extraction subnetwork of the spatial-spectral fusion network to obtain intermediate spectral features output by the spectral feature extraction subnetwork; fusing the intermediate spatial features and intermediate spectral features and inputting the fused intermediate spatial features into a decoder of the spatial-spectral fusion network to obtain a spatial-spectral fused image output by the decoder; wherein, the loss function during the training phase of the spatial-spectral fusion network includes spatial similarity loss between the intermediate spatial features and the panchromatic image, and spectral consistency loss between the intermediate spectral features and the upsampled multispectral image. This invention solves the technical problem of the difficulty in simultaneously achieving spatial detail and spectral fidelity by introducing an intermediate loss mechanism during the training phase.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of satellite remote sensing technology, and in particular to a dual-stream spatial-spectral fusion method and apparatus based on intermediate loss. Background Technology

[0002] Spatial-spectral fusion of remote sensing images aims to combine the high spatial resolution of panchromatic images with the spectral information of multispectral images. Existing deep learning-based self-supervised spatial-spectral fusion methods typically construct a loss function at the final output of the network, using the original panchromatic and multispectral images as supervisory information to constrain the fusion result. However, this approach lacks direct guidance for the feature learning process of the intermediate layers of the network, easily introducing biases during the feature extraction stage.

[0003] To address the aforementioned issues, existing self-supervised methods construct a deep neural network, taking a panchromatic image and an upsampled multispectral image as input, and calculating the loss function only at the network's final output layer. The network is trained by minimizing this loss. This end-to-end training approach relies on the network's ability to automatically learn complex nonlinear mappings from input to output, aiming to generate a fused image that is consistent with the input source in both spatial structure and spectral information.

[0004] However, the aforementioned self-supervised methods that only apply constraints to the output layer are prone to causing spatial and spectral features to shift during deep propagation, leading to instability in the network training process and making it difficult to simultaneously guarantee the spatial detail clarity and spectral fidelity of the final fusion result. Therefore, how to balance spatial detail and spectral fidelity during spatial-spectral fusion has become an urgent problem to be solved in this field. Summary of the Invention

[0005] This invention provides a dual-stream spatial-spectral fusion method and apparatus based on intermediate loss, which solves the technical problem of how to balance spatial detail and spectral fidelity during spatial-spectral fusion.

[0006] This invention provides a dual-stream spatial-spectral fusion method based on intermediate loss, comprising: The panchromatic image is input into the spatial feature extraction subnetwork of the spatial-spectral fusion network to obtain the intermediate spatial features output by the spatial feature extraction subnetwork. The upsampled multispectral image is input into the spectral feature extraction subnetwork in the spatial-spectral fusion network to obtain the intermediate spectral features output by the spectral feature extraction subnetwork; The intermediate spatial features and the intermediate spectral features are fused and then input into the decoder in the spatial-spectral fusion network to obtain the spatial-spectral fusion image output by the decoder; The loss function during the training phase of the spatial-spectral fusion network includes the spatial similarity loss between the intermediate spatial features and the panchromatic image, and the spectral consistency loss between the intermediate spectral features and the upsampled multispectral image.

[0007] According to the present invention, a dual-stream spatial-spectral fusion method based on intermediate loss is provided, wherein the spatial feature extraction sub-network includes a first encoder and multiple cascaded spatial feature extraction modules; The first encoder is used to encode the panchromatic image; The first cascaded spatial feature extraction module is used to extract spatial features from the output of the first encoder; The second and subsequent cascaded spatial feature extraction modules are used to extract spatial features from the output of the previous spatial feature extraction module.

[0008] According to the present invention, a two-stream spatial-spectral fusion method based on intermediate loss is provided, wherein the spatial feature extraction module includes a cascaded first deep feature extraction module and a spatial attention module; The first deep feature extraction module is used to extract deep spatial features from the input; The spatial attention module is used to enhance the deep spatial features.

[0009] According to the present invention, a dual-stream spatial-spectral fusion method based on intermediate loss is provided, wherein the spectral feature extraction sub-network includes a second encoder and multiple cascaded spectral feature extraction modules; The second encoder is used to encode the upsampled multispectral image; The first cascaded spectral feature extraction module is used to extract spectral features from the output of the second encoder; The second and subsequent cascaded spectral feature extraction modules are used to extract spectral features from the fusion result of the output of the corresponding spatial feature extraction module and the output of the previous spectral feature extraction module.

[0010] According to the present invention, a two-stream spatial-spectral fusion method based on intermediate loss is provided, wherein the spectral feature extraction module includes a cascaded second deep feature extraction module and a spectral attention module; The second deep feature extraction module is used to extract deep spectral features from the input; The spectral attention module is used to enhance the deep spectral features.

[0011] According to the dual-stream spatial-spectral fusion method based on intermediate loss provided by the present invention, after obtaining the intermediate spatial features and the intermediate spectral features, the method further includes: The spatial similarity loss is determined based on the intermediate spatial features and the panchromatic image; The spectral consistency loss is determined based on the intermediate spectral features and the upsampled multispectral image; After fusing the intermediate spatial features and the intermediate spectral features and inputting them into the decoder in the spatial-spectral fusion network to obtain the spatial-spectral fusion image output by the decoder, the method further includes: The global consistency loss of the spatial-spectral fusion network is determined based on the panchromatic image, the upsampled multispectral image, and the spatial-spectral fusion image. The total loss of the spatial-spectral fusion network is determined based on the spatial similarity loss, the spectral consistency loss, and the global consistency loss. The parameters of the spatial spectrum fusion network are updated based on the total loss.

[0012] According to the present invention, a two-stream spatial-spectral fusion method based on intermediate loss is provided, wherein determining the spatial similarity loss based on the intermediate spatial features and the panchromatic image includes: Determine the structural similarity between the intermediate spatial features and the panchromatic image; Subtracting the structural similarity from 1 yields the spatial similarity loss.

[0013] According to the present invention, a dual-stream spatial-spectral fusion method based on intermediate loss is provided, wherein determining the spectral consistency loss based on the intermediate spectral features and the upsampled multispectral image includes: The spectral angle mapping between the intermediate spectral features and the upsampled multispectral image is determined to obtain the spectral consistency loss.

[0014] According to the present invention, a two-stream spatial-spectral fusion method based on intermediate loss is provided, wherein determining the global consistency loss of the spatial-spectral fusion network based on the panchromatic image, the upsampled multispectral image, and the spatial-spectral fusion image includes: Convert the upsampled multispectral image into a grayscale image; The first residual between the upsampled multispectral image and the grayscale image, and the second residual between the panchromatic image and the spatial-spectral fusion image are determined respectively; The KL divergence between the first residual and the second residual is determined to obtain the global consistency loss.

[0015] The present invention also provides a dual-stream spatial-spectral fusion device based on intermediate loss, comprising: The first input module is used to input the panchromatic image into the spatial feature extraction subnetwork in the spatial-spectral fusion network to obtain the intermediate spatial features output by the spatial feature extraction subnetwork. The second input module is used to input the upsampled multispectral image into the spectral feature extraction subnetwork in the spatial-spectral fusion network to obtain the intermediate spectral features output by the spectral feature extraction subnetwork. The third input module is used to fuse the intermediate spatial features and the intermediate spectral features and then input them into the decoder in the spatial-spectral fusion network to obtain the spatial-spectral fusion image output by the decoder. The loss function during the training phase of the spatial-spectral fusion network includes the spatial similarity loss between the intermediate spatial features and the panchromatic image, and the spectral consistency loss between the intermediate spectral features and the upsampled multispectral image.

[0016] The present invention provides a dual-stream spatial-spectral fusion method and apparatus based on intermediate loss. By introducing an intermediate loss mechanism during the training phase—applying spatial similarity loss to the output of the spatial feature extraction sub-network and spectral consistency loss to the output of the spectral feature extraction sub-network—this targeted intermediate constraint provides clear and independent guidance for the extraction processes of spatial and spectral features. It forces the network to simultaneously optimize the accuracy of spatial structure and the consistency of spectral characteristics during the feature extraction stage. This solves the technical problem in existing technologies where the lack of intermediate guidance makes it difficult to simultaneously achieve both spatial detail and spectral fidelity. By ensuring the quality of both aspects separately during the process, it avoids the predicament of discovering conflicts only during final fusion, thereby enabling the stable generation of high-quality fused images with clearer spatial details and higher spectral fidelity. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0018] Figure 1 This is a flowchart illustrating the dual-stream spatial spectrum fusion method based on intermediate loss provided by the present invention.

[0019] Figure 2 This is a schematic diagram illustrating the principle of the dual-stream spatial spectrum fusion method based on intermediate loss provided by the present invention.

[0020] Figure 3 This is a schematic diagram of the structure of MDREM provided by the present invention.

[0021] Figure 4 This is a schematic diagram of the spatial attention module provided by the present invention.

[0022] Figure 5This is a schematic diagram of the spectral attention module provided by the present invention.

[0023] Figure 6 This is a schematic diagram of the structure of the dual-stream spatial spectrum fusion device based on intermediate loss provided by the present invention.

[0024] Figure 7 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0025] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0026] Spatial-spectral fusion of remote sensing images aims to complementarily integrate the high spatial resolution information contained in panchromatic (PAN) images with the spectral information contained in multispectral (MS) images, thereby generating high-resolution multispectral (HRMS) images that possess both spatial detail and spectral fidelity. After years of development, spatial-spectral fusion methods can be divided into two categories: traditional methods and deep learning methods.

[0027] Traditional techniques are further divided into three main categories: component substitution (CS), multiresolution analysis (MRA), and variational optimization (VO). The CS method decomposes the spatial and spectral components of the MS image into spatial and spectral components, then replaces the corresponding components with spatial details extracted by the PAN to generate the HRMS image. The MRA method performs multi-scale decomposition on the MS and PAN images, fuses them at each scale, and finally reconstructs the HRMS image. The VO method typically transforms the spatial-spectral fusion problem into an optimization model based on variational principles. Based on the observation model and prior assumptions, the VO method constructs an energy functional and obtains the HRMS image through optimization. This type of method has good interpretability and can theoretically clarify the mechanism of spatial-spectral fusion.

[0028] However, most of the methods mentioned above are based on linear thinking and have certain limitations. Deep learning, due to its nonlinearity and deep feature learning capabilities, has been widely applied in the field of spatial-spectral fusion in recent years. Currently, the main structures used in deep learning-based spatial-spectral fusion networks include residual connections, dense connections, generative adversarial networks (GANs), transformers, diffsusion, and Mamba. Most of these networks rely on supervised learning training using simulated degradation datasets generated according to the Wald protocol. To address this issue, PanGAN and some referenceless spatial-spectral fusion methods have been proposed, directly inputting the original MS and PAN as labels into the network.

[0029] Furthermore, some remote sensing image fusion methods based on semi-supervised learning employ low-scale supervised learning and original-scale unsupervised learning. During supervised learning, low-resolution multispectral and panchromatic images are input into the fusion network to obtain a low-resolution fused image, and the low-scale loss between the reference image and the fusion result is calculated. Since there is no high-resolution reference image in the high-resolution region, this method establishes spectral and spatial degradation networks to constrain the spatial and spectral aspects of the multispectral image, and uses these networks to obtain the original-scale loss. This method employs supervised learning in the low-resolution region and unsupervised learning in the high-resolution region, training the network through semi-supervised learning to ensure consistent performance between the low-resolution and high-resolution images.

[0030] Existing spatial-spectral fusion methods often rely on approximate linear models, which struggle to characterize the nonlinear spatial-spectral coupling between panchromatic (PAN) and multispectral (MS) images under complex terrain conditions. This leads to issues such as reduced spatial detail or decreased spectral consistency. Supervised deep learning methods typically rely on downscaled simulated data constructed using the Wald protocol for training. However, the downscaling process fails to accurately reflect actual imaging conditions, limiting the model's generalization ability on real-scale remote sensing data. While self-supervised spatial-spectral fusion methods have emerged in recent years, eliminating the dependence on real high-resolution multispectral reference images, their loss functions often only apply to the fusion result layer, lacking targeted guidance for the intermediate feature learning process. This can easily cause spatial and spectral features to shift during deep propagation, affecting fusion stability and quality.

[0031] The following is combined with Figures 1 to 7 The present invention describes a dual-stream spatial spectrum fusion method and apparatus based on intermediate loss.

[0032] Figure 1 This is a flowchart illustrating the dual-stream spatial-spectral fusion method based on intermediate loss provided by the present invention, as shown below. Figure 1 As shown, the method includes, but is not limited to, steps S1, S2 and S3.

[0033] Step S1: Input the panchromatic image into the spatial feature extraction subnetwork of the spatial-spectral fusion network to obtain the intermediate spatial features output by the spatial feature extraction subnetwork.

[0034] A panchromatic image is a single-band grayscale image with high spatial resolution, capable of clearly displaying details such as the geometry, edges, and textures of ground features.

[0035] The spatial spectrum fusion network of the present invention is as follows: Figure 2 As shown, a dual-stream parallel architecture is adopted, mainly consisting of two parts: spatial stream and spectral stream. The spatial feature extraction subnetwork is... Figure 2 The spatial flow shown is used to extract structural details and edge textures from a panchromatic (PAN) image. The intermediate spatial features (MidPAN) are not the final fusion result, but rather a high-dimensional feature map generated during intermediate processes of the network, which encodes key spatial structure and texture details of the original panchromatic image.

[0036] Step S1 can separate and refine the spatial information of the panchromatic image.

[0037] Step S2: Input the upsampled multispectral image into the spectral feature extraction subnetwork in the spatial-spectral fusion network to obtain the intermediate spectral features output by the spectral feature extraction subnetwork.

[0038] Upsampled multispectral images are multispectral images that have been spatially magnified and contain multiple spectral bands (such as red, green, blue, and near-infrared). They retain rich spectral reflectance information of ground objects, but their spatial details are relatively blurred.

[0039] Upsampling enlarges the size of the original low-spatial-resolution multispectral image to match the spatial resolution of the panchromatic image. Upsampling can be achieved using various image interpolation algorithms, such as nearest-neighbor interpolation, bilinear interpolation, or... Figure 2 The bi-triplexing method is shown.

[0040] Spectral feature extraction subnetwork Figure 2 The spectral stream shown is used to extract multispectral information from upsampled multispectral images. Mid-MS features are a high-dimensional feature representation that encodes the original spectral information.

[0041] Step S2 can effectively extract and preserve the spectral characteristics of ground features, avoiding spectral distortion during the fusion process.

[0042] Step S3: After fusing the intermediate spatial features and intermediate spectral features, input them into the decoder in the spatial-spectral fusion network to obtain the spatial-spectral fusion image output by the decoder.

[0043] The loss function in the training phase of the spatial-spectral fusion network includes spatial similarity loss between intermediate spatial features and panchromatic images, and spectral consistency loss between intermediate spectral features and upsampled multispectral images.

[0044] The decoder integrates intermediate spatial features (MidPAN) from the spatial stream and intermediate spectral features (MidMS) from the spectral stream. It first performs feature alignment and fusion, then maps the number of channels to C output bands through a convolutional layer, and employs sigmoid activation to reconstruct an HRMS image (i.e., a spatial-spectral fusion image) that possesses both high spatial resolution and multispectral information. This reconstruction process can be described as follows: ; in, Represents convolution. This indicates that Sigmoid is activated.

[0045] The overall process of steps S1-S3 can be described as follows: ; Where P is a panchromatic image. For upsampling multispectral images; Ω represents the values ​​related to different parameter settings in the network, such as the weights and bias terms of the convolution kernel; f(.) represents the spatial-spectral fusion network; This is a spatial spectrum fusion image.

[0046] This invention uses a large number of images to optimize network parameters during the training phase, with the training objective being to minimize the value of the loss function.

[0047] Spatial similarity loss is used to constrain the spatial feature extraction sub-network, ensuring that its output intermediate spatial features are structurally as similar as possible to the original panchromatic image. Spectral consistency loss is used to constrain the spectral feature extraction sub-network, ensuring that its output intermediate spectral features are as consistent as possible with the original upsampled multispectral image in spectral properties.

[0048] The loss function during the training phase of the spatial-spectral fusion network includes spatial similarity loss between intermediate spatial features and the panchromatic image, and spectral consistency loss between intermediate spectral features and the upsampled multispectral image. This indicates that the present invention provides guidance not only at the final output but also during the feature extraction stage within the network. This intermediate loss mechanism ensures that the spatial flow focuses on extracting accurate spatial structure, while the spectral flow focuses on preserving the original spectral information as much as possible, fundamentally preventing feature shift and confusion during deep network propagation.

[0049] As described above, this invention introduces an intermediate loss mechanism during the training phase, specifically applying spatial similarity loss to the output of the spatial feature extraction sub-network and spectral consistency loss to the output of the spectral feature extraction sub-network. This targeted intermediate constraint provides clear and independent guidance for the extraction processes of spatial and spectral features, forcing the network to simultaneously optimize the accuracy of spatial structure and the consistency of spectral characteristics during the feature extraction stage. This solves the technical problem in existing technologies where spatial detail and spectral fidelity are difficult to balance due to the lack of intermediate guidance. By ensuring the quality of both during the process, it avoids the predicament of discovering conflicts between the two only during final fusion, thereby enabling the stable generation of high-quality fused images with clearer spatial details and higher spectral fidelity.

[0050] In one embodiment, the spatial feature extraction subnetwork of the present invention may include a first encoder and a plurality of cascaded spatial feature extraction modules; The first encoder is used to encode the panchromatic image; The first cascaded spatial feature extraction module is used to extract spatial features from the output of the first encoder; The second and subsequent cascaded spatial feature extraction modules are used to extract spatial features from the output of the previous spatial feature extraction module.

[0051] Figure 2 In this diagram, Encoder is the encoder, and SPAEM is the spatial feature extraction module. The input of the first SPAEM is the output of the first encoder, the input of the second and subsequent SPAEMs is the output of the previous SPAEM, and the output of the last SPAEM is the intermediate spatial feature (MidPAN).

[0052] Considering the different number of spectral channels in the spatial stream and spectral stream inputs (PAN is a single channel, and upsampling MS is a C-channel), this invention sets up independent encoders in the spatial stream and spectral stream respectively.

[0053] Each encoder consists of a single convolutional layer and PReLU activation, projecting the input into a high-dimensional feature space. In practical applications, the encoder output channel number can be set to 32. When the input is PAN or upsampled MS, the encoder outputs PAN features or upsampled MS features respectively. The encoding process can be described as follows: ; in, Indicates the input (i.e., PAN or upsampled MS); Represents the encoder function; This indicates the output of the encoder. The output of the first encoder. .

[0054] This invention effectively maps the original panchromatic image to a more information-rich feature space through an encoder. Multiple cascaded spatial feature extraction modules progressively refine spatial features from low-level edge information to high-level structural information. This design enables the network to capture spatial details in the panchromatic image more comprehensively and precisely, providing a superior feature foundation for generating high-quality fused images with clear edges and rich textures.

[0055] In one embodiment, the spatial feature extraction module of the present invention may include a cascaded first depth feature extraction module and a spatial attention module; The first deep feature extraction module is used to extract deep spatial features from the input; The spatial attention module is used to enhance deep spatial features.

[0056] The first deep feature extraction module of this invention can be a multi-depth residual feature extraction module (MDREM), and each SPAEM consists of an MDREM and a spatial attention module cascaded together. The first deep feature extraction module is used to extract deep spatial features from the input, and the spatial attention module is used to further highlight key spatial structural regions. The calculation form of the nth SPAEM is as follows: ; in, This is the output of the previous SPAEM. The output of the first encoder , This indicates a spatial attention module. Let N be the output of the nth SPAEM, where N is the number of SPAEMs. This is a mid-space feature (MidPAN).

[0057] like Figure 3 As shown, MDREM can be constructed by cascading multiple Deep Residual Layers (DRLs) to extract hierarchical deep features. MDREM refines features layer by layer by stacking multiple residual blocks, aiming to maintain high spatial detail and spectral fidelity during the fusion process. The stacking of multiple residual blocks helps the network learn multi-level representations, while the residual structure effectively alleviates the gradient vanishing problem and enhances feature representation by learning the residual between the input and the expected output, thus simultaneously characterizing fine-grained details and high-level semantic patterns. The overall feature extraction process of MDREM can be represented as: ; in, This represents the input to MDREM. This represents the deep features extracted by MDREM, where MDREM(·) denotes a feature extraction process consisting of multiple deep residual layers. Each deep residual layer consists of two 3×3 convolutional layers, two batch normalization (Bn) layers, and two PReLU activation functions. The output of the nth deep residual layer can be expressed as: ; in, This represents the weight parameters of the nth layer. This is the output of the previous depth residual layer. This is the bias parameter.

[0058] The Spatial Attention Module (SAM) adaptively enhances spatial information in features by highlighting key spatial structural regions and suppressing redundant responses, thereby improving spatial representation capabilities. For example... Figure 4 As shown, the output F of MDREM is subjected to Global Max Pooling (GMP) and Global Average Pooling (GAP) to characterize different spatial response distributions. Then, the two are concatenated along the channel dimension and fused with complementary spatial information via a 3×3 convolution. Finally, a spatial attention map S is generated using a Sigmoid activation function. The final spatial enhancement feature is represented as: ; ; in, This indicates a splicing operation. Represents convolution. This represents the Sigmoid activation function. This represents element-wise multiplication. This is the output of the spatial attention module. The spatial attention module can effectively enhance the response to significant spatial details such as edges and textures.

[0059] This invention constructs a spatial feature extraction module that possesses both depth and focus by combining deep feature extraction with an attention mechanism (spatial attention module). MDREM ensures the network has sufficient capacity to learn complex, multi-layered spatial features, while the spatial attention module enables the network to intelligently allocate computational resources, prioritizing the processing of the most information-rich spatial regions. This combination ensures that the extracted spatial features are not only rich in layers but also prominent in focus and have a high signal-to-noise ratio, further enhancing the feature extraction capability of the spatial feature extraction module and resulting in clearer and more accurate spatial details in the final fused image.

[0060] In one embodiment, the spectral feature extraction subnetwork of the present invention may include a second encoder and a plurality of cascaded spectral feature extraction modules; The second encoder is used to encode the upsampled multispectral image; The first cascaded spectral feature extraction module is used to extract spectral features from the output of the second encoder; The second and subsequent cascaded spectral feature extraction modules are used to extract spectral features from the fusion result of the output of the corresponding spatial feature extraction module and the output of the previous spectral feature extraction module.

[0061] The principle of the encoder has been introduced above; the output of the second encoder... The number of spectral feature extraction modules is the same as the number of spatial feature extraction modules. Figure 2 SPEEM in the code is the spectral feature extraction module. The input of the first SPEEM is the output of the second encoder. The input of the second and subsequent SPEEMs is the fusion result of the output of the previous SPEEM and the output of the previous SPEEM. The output of the last SPEEM is the intermediate spectral features (MidMS).

[0062] This invention upgrades the originally parallel two-stream network into an interactive network architecture by progressively introducing spatial features into the spectral feature extraction sub-network. This design breaks the complete isolation between spatial and spectral information during extraction, allowing spectral feature extraction to be guided by spatial context. This enables more intelligent processing of spectral information in spatially complex regions, effectively reducing spectral distortion and spatial blur. This cross-stream information fusion mechanism helps extract spectral features highly corresponding to spatial structure, providing crucial support for generating fused images with higher spectral fidelity.

[0063] In one embodiment, the spectral feature extraction module of the present invention may include a cascaded second deep feature extraction module and a spectral attention module; The second deep feature extraction module is used to extract deep spectral features from the input; The spectral attention module is used to enhance deep spectral features.

[0064] The second deep feature extraction module of the present invention can also be a multiple residual deep feature extraction module (MDREM), in which each SPEEM is composed of an MDREM and a spectral attention module cascaded together. The MDREM is used to extract deep spectral characterization from the input, and the spectral attention module is used to recalibrate the channel response to maintain spectral consistency.

[0065] The calculation form when n=1 (the first SPEEM) is as follows: ; in, This is the output of the previous SPEEM. For the output of the second encoder ; Represents the spectral attention module; This is the output of the nth SPEEM.

[0066] The calculation form when n > 1 (the nth SPEEM) is as follows: ; Where N is the number of SPEEMs. Then... MidMS is a mid-spectral feature. The Spectral Attention Module is used to recalibrate feature channels, thereby highlighting informative bands and suppressing redundant channels, enhancing cross-band correlation expression. For example... Figure 5 As shown, the output F of MDREM is first subjected to global average pooling (GAP), then two stages of 1×1 convolution and a sigmoid activation function to generate channel weights Q. The final spectral enhancement feature is represented as follows: ; ; in, This is the output of the spectral attention module.

[0067] This invention constructs a highly efficient spectral feature extraction module by combining deep feature extraction and a spectral attention module. By explicitly assigning different weights to different feature channels, the network can more effectively model the correlation between spectral channels, thereby more accurately preserving and refining the original multispectral information. This is crucial for preventing spectral distortion during deep network propagation and fusion, ensuring that the color and spectral characteristics of the final fused image remain faithful to the original multispectral image.

[0068] In one embodiment, after obtaining the intermediate spatial features and intermediate spectral features, the dual-stream spatial-spectral fusion method based on intermediate loss of the present invention may further include: Spatial similarity loss is determined based on intermediate spatial features and panchromatic images; The spectral consistency loss is determined based on intermediate spectral features and upsampled multispectral images; After step S3, the dual-stream spatial-spectral fusion method based on intermediate loss of the present invention may further include: The global consistency loss of the spatial-spectral fusion network is determined based on the panchromatic image, the upsampled multispectral image, and the spatial-spectral fusion image. The total loss of the spatial-spectral fusion network is determined based on spatial similarity loss, spectral consistency loss, and global consistency loss. The parameters of the spatial spectrum fusion network are updated based on the total loss.

[0069] This invention constructs a composite loss function consisting of intermediate loss and global constraint loss during the training phase to constrain the feature extraction process of spatial flow and spectral flow, as well as the final fusion result. The loss function may include spatial similarity loss (…). ), spectral consistency loss ( ) and global consistency loss ( ).

[0070] This is used to constrain the structural correspondence between the spatial stream output features and the panchromatic image. By comparing the structural information of the intermediate spatial features and the original panchromatic image, the quality of spatial detail extraction is ensured.

[0071] This is used to constrain the consistency of the spectral stream output features with the upsampled multispectral image in terms of spectral angles. The preservation of spectral characteristics is ensured by comparing the intermediate spectral features with the spectral information of the upsampled multispectral image.

[0072] This constraint is used to constrain the consistency between the final fusion result and the input observation data in terms of residual information distribution. From a global, holistic perspective, it ensures that the final generated spatial-spectral fused image maintains consistency in information with the two input source images. This is a supplementary constraint used to guarantee the overall fusion effect.

[0073] During the training phase, the spatial similarity loss, spectral consistency loss, and global consistency loss are weighted and combined to construct the network's total loss function: ; in , and , which are loss weight coefficients used to adjust the relative contribution of each loss term during network training.

[0074] Then, the parameters of the spatial-spectral fusion network are updated based on the total loss. The gradient of the total loss with respect to all trainable parameters of the network can be calculated, and then the parameter values ​​are fine-tuned along the inverse direction of the gradient with a certain learning rate, aiming to gradually reduce the total loss. This process is repeated iteratively on a large number of images until the network converges.

[0075] This invention employs a "GF series training / validation + WV-2 generalization test" approach for data selection. The training and validation sets are derived from the Gaofen (GF) series of multi-sensor data (GF-1, GF-1B, GF-1C, GF-1D, GF-2, GF-6) to improve the model's adaptability to different imaging parameters and noise characteristics. To evaluate cross-sensor transfer performance, WorldView-2 (WV-2) is selected as the independent test set.

[0076] This invention provides a training strategy driven by a composite loss function. Compared to training methods that rely solely on a single loss, this invention constructs a multi-layered, multi-faceted constraint system by combining intermediate losses (spatial similarity loss and spectral consistency loss) and global losses. The intermediate losses ensure the correctness of the feature extraction process, guaranteeing quality from a process perspective; the global losses constrain the final result, guaranteeing quality from a result perspective. This constraint approach, which emphasizes both process and result, makes network training more stable and comprehensive, avoiding the instability or local optima problems that may result from relying solely on final result constraints. Therefore, it can more reliably train a fusion network that performs excellently in spatial, spectral, and global characteristics.

[0077] In one embodiment, the present invention determines spatial similarity loss based on intermediate spatial features and a panchromatic image, and may further include: Determine the structural similarity between intermediate spatial features and panchromatic images; Subtracting the structural similarity from 1 yields the spatial similarity loss.

[0078] Structural Similarity Index Measure (SSIM) can be used as a spatial loss function to measure the structural similarity between intermediate spatial features and the original panchromatic image. ; in, It is a structural similarity evaluation index.

[0079] This invention clarifies the specific calculation method for spatial similarity loss. Using SSIM as the core metric offers significant advantages over pixel-wise difference losses such as mean squared error (MSE). SSIM better aligns with how the human visual system perceives image structure; it focuses not only on differences in pixel values ​​but also on the local structure, texture, and edge information of the image. Therefore, by minimizing the 1-SSIM loss, the spatial feature extraction subnetwork can be more effectively driven to learn and preserve the true spatial structure in the panchromatic image, rather than simply fitting pixel values. This contributes to generating visually more natural, edge-sharper, and texture-rich fused images.

[0080] In one embodiment, the present invention determines the spectral consistency loss based on intermediate spectral features and upsampled multispectral images, which may further include: The spectral angle mapping between intermediate spectral features and upsampled multispectral images is determined to obtain the spectral consistency loss.

[0081] A spectral loss function can be constructed based on Spectral Angle Mapper (SAM) to measure the spectral similarity between intermediate spectral features and upsampled multispectral images. ; in, This represents the spectral angle mapping function.

[0082] This invention clarifies the specific calculation method for spectral consistency loss. Using SAM as the core metric, compared to losses measured by L1 or L2 norms, SAM is insensitive to overall pixel brightness changes caused by illumination variations; it only focuses on the direction of the spectral vector, i.e., the shape of the spectral curve. This allows SAM to more purely measure the similarity of spectral characteristics. By minimizing the SAM loss, the spectral feature extraction sub-network can be effectively driven to learn and maintain the spectral feature shape of the original multispectral image, thereby better preserving the true material and color information of ground objects and generating a fused image with higher spectral fidelity.

[0083] In one embodiment, the present invention determines the global consistency loss of the spatial-spectral fusion network based on the panchromatic image, the upsampled multispectral image, and the spatial-spectral fusion image, which may further include: Convert the upsampled multispectral image into a grayscale image; The first residual between the upsampled multispectral image and the grayscale image, and the second residual between the panchromatic image and the spatial-spectral fusion image are determined respectively; Determine the KL divergence between the first and second residuals to obtain the global consistency loss.

[0084] A global consistency loss can be constructed based on the Kullback-Leibler (KL) divergence. First, the upsampled MS image is converted to a grayscale image (MS↑) and copied along the spectral dimension to match the number of multispectral channels; the PAN image is copied along the spectral dimension to match the channel dimension; then, the upsampled multispectral residual distribution and the fusion result residual distribution are constructed separately, and the residual maps are softmax normalized to obtain the channel-level probability distribution. Finally, the KL divergence between the two is calculated. ; in Indicates grayscale processing; This indicates stacking along the spectral dimension; This represents the normalized mapping function; This indicates the calculation of KL divergence. For the first residual, This is the second residual.

[0085] This invention provides a sophisticated method for calculating global consistency loss. Instead of directly comparing the pixel values ​​of the fused image and the input image, it imposes constraints by comparing the more abstract quantity of residual information distribution. This method is based on the physical assumption that "the fusion process injects pure spectral information from the multispectral image into the panchromatic image," thus constructing a very reasonable constraint. By minimizing the KL divergence, it can be ensured that the spectral information learned by the fused image truly originates from the multispectral image, and that its distribution pattern remains consistent. This helps guarantee the correctness of the fusion from an information theory perspective, allowing the fused image to better inherit the characteristics of the source image as a whole, thereby improving the global quality and fidelity of the fusion result.

[0086] In summary, this invention addresses the problems of existing self-supervised spatial-spectral fusion methods, such as applying loss constraints only to the output layer, lacking directional guidance for the learning process of intermediate network features, leading to training instability, and difficulty in simultaneously ensuring spatial details and spectral consistency, through a dual-stream spatial-spectral fusion network based on intermediate loss.

[0087] The core idea of ​​this invention is to move the observation constraints forward to the intermediate feature layer. The target high-resolution multispectral HRMS image is regarded as the result of the spatial structure features from the PAN image and the spectral features from the MS image. The PAN and upsampled multispectral UPMS images are used as the sources of observation constraints for HRMS.

[0088] At the network structure level, a dual-stream fusion architecture consisting of spatial and spectral streams is constructed to extract spatial structure information and spectral representation information, respectively, and feature fusion is completed to generate HRMS during the decoding stage. During training, not only is the final output constrained, but an intermediate loss mechanism is also introduced at the intermediate output nodes of the spatial and spectral feature streams, so that the network is explicitly guided during the feature extraction stage. At the loss design level, spatial intermediate loss and spectral intermediate loss are constructed: the spatial intermediate loss uses PAN as a reference to constrain intermediate spatial features from the perspective of spatial structure consistency (e.g., using SSIM metric); the spectral intermediate loss uses upsampled MS as a reference to constrain intermediate spectral features from the perspective of multi-band spectral vector direction consistency (e.g., using SAM metric). In terms of training paradigm, this invention adopts a self-supervised training method, which does not require real HRMS labels and only relies on PAN and upsampled MS to drive the network to converge stably, thereby improving the model's generalization ability in complex scenes and cross-sensor conditions while ensuring the quality of real-scale reconstruction.

[0089] This invention moves the constraints forward to the intermediate feature layer between the spatial and spectral flows, rather than applying them only to the final fusion result. This allows for constraints on spatial structure consistency and spectral consistency during the feature extraction stage, thereby reducing the offset between the network's internal representations and the target result, and improving the stability of the training process and the consistency of the fusion result. Compared to existing methods that only apply weak constraints to the output, this intermediate loss mechanism can more directly guide the extraction and optimization of spatial and spectral features.

[0090] This invention matches a dual-stream network structure with an intermediate loss mechanism, achieving information fusion through the hierarchical feature interaction of spatial and spectral streams. It also incorporates a composite loss form of "spatial intermediate loss + spectral intermediate loss + global consistency loss" to jointly constrain intermediate representations and final outputs, avoiding instability issues caused by relying solely on a single loss to constrain the final result. This results in better adaptability under different imaging conditions and sensor configurations.

[0091] The intermediate loss-based spatial-spectral fusion device provided by the present invention will be described below. The intermediate loss-based spatial-spectral fusion device described below can be referred to in correspondence with the intermediate loss-based dual-stream spatial-spectral fusion method described above.

[0092] The present invention also provides a dual-stream spatial-spectral fusion device based on intermediate loss, such as Figure 6 As shown, it includes: The first input module is used to input the panchromatic image into the spatial feature extraction subnetwork in the spatial-spectral fusion network to obtain the intermediate spatial features output by the spatial feature extraction subnetwork. The second input module is used to input the upsampled multispectral image into the spectral feature extraction subnetwork in the spatial-spectral fusion network to obtain the intermediate spectral features output by the spectral feature extraction subnetwork. The third input module is used to fuse the intermediate spatial features and intermediate spectral features and then input them into the decoder in the spatial-spectral fusion network to obtain the spatial-spectral fusion image output by the decoder. The loss function in the training phase of the spatial-spectral fusion network includes spatial similarity loss between intermediate spatial features and panchromatic images, and spectral consistency loss between intermediate spectral features and upsampled multispectral images.

[0093] Figure 7 A schematic diagram of the physical structure of an electronic device is provided. This device may include a processor, a communications interface, memory, and a communication bus. The processor, communications interface, and memory communicate with each other via the communication bus. The processor can invoke logical instructions from the memory to execute a dual-stream spatial-spectral fusion method based on intermediate loss.

[0094] Furthermore, the logical instructions in the aforementioned memory can be implemented as software functional units and sold or used as independent products, and can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0095] On the other hand, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer-readable storage medium, and when the computer program is executed by a processor, the computer is able to execute the dual-stream spatial spectrum fusion method based on intermediate loss provided by the above methods.

[0096] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to perform the intermediate loss-based dual-stream spatial spectrum fusion method provided by the above methods.

[0097] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0098] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0099] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A two-stream spatial-spectral fusion method based on intermediate loss, characterized in that, include: The panchromatic image is input into the spatial feature extraction subnetwork of the spatial-spectral fusion network to obtain the intermediate spatial features output by the spatial feature extraction subnetwork. The upsampled multispectral image is input into the spectral feature extraction subnetwork in the spatial-spectral fusion network to obtain the intermediate spectral features output by the spectral feature extraction subnetwork; The intermediate spatial features and the intermediate spectral features are fused and then input into the decoder in the spatial-spectral fusion network to obtain the spatial-spectral fusion image output by the decoder; The loss function during the training phase of the spatial-spectral fusion network includes the spatial similarity loss between the intermediate spatial features and the panchromatic image, and the spectral consistency loss between the intermediate spectral features and the upsampled multispectral image.

2. The dual-stream spatial-spectral fusion method based on intermediate loss according to claim 1, characterized in that, The spatial feature extraction subnetwork includes a first encoder and multiple cascaded spatial feature extraction modules; The first encoder is used to encode the panchromatic image; The first cascaded spatial feature extraction module is used to extract spatial features from the output of the first encoder; The second and subsequent cascaded spatial feature extraction modules are used to extract spatial features from the output of the previous spatial feature extraction module.

3. The dual-stream spatial-spectral fusion method based on intermediate loss according to claim 2, characterized in that, The spatial feature extraction module includes a cascaded first depth feature extraction module and a spatial attention module; The first deep feature extraction module is used to extract deep spatial features from the input; The spatial attention module is used to enhance the deep spatial features.

4. The dual-stream spatial-spectral fusion method based on intermediate loss according to claim 2, characterized in that, The spectral feature extraction subnetwork includes a second encoder and multiple cascaded spectral feature extraction modules; The second encoder is used to encode the upsampled multispectral image; The first cascaded spectral feature extraction module is used to extract spectral features from the output of the second encoder; The second and subsequent cascaded spectral feature extraction modules are used to extract spectral features from the fusion result of the output of the corresponding spatial feature extraction module and the output of the previous spectral feature extraction module.

5. The dual-stream spatial-spectral fusion method based on intermediate loss according to claim 4, characterized in that, The spectral feature extraction module includes a cascaded second deep feature extraction module and a spectral attention module; The second deep feature extraction module is used to extract deep spectral features from the input; The spectral attention module is used to enhance the deep spectral features.

6. The dual-stream spatial-spectral fusion method based on intermediate loss according to claim 1, characterized in that, After obtaining the intermediate spatial features and the intermediate spectral features, the method further includes: The spatial similarity loss is determined based on the intermediate spatial features and the panchromatic image; The spectral consistency loss is determined based on the intermediate spectral features and the upsampled multispectral image; After fusing the intermediate spatial features and the intermediate spectral features and inputting them into the decoder in the spatial-spectral fusion network to obtain the spatial-spectral fusion image output by the decoder, the method further includes: The global consistency loss of the spatial-spectral fusion network is determined based on the panchromatic image, the upsampled multispectral image, and the spatial-spectral fusion image. The total loss of the spatial-spectral fusion network is determined based on the spatial similarity loss, the spectral consistency loss, and the global consistency loss. The parameters of the spatial spectrum fusion network are updated based on the total loss.

7. The dual-stream spatial-spectral fusion method based on intermediate loss according to claim 6, characterized in that, Determining the spatial similarity loss based on the intermediate spatial features and the panchromatic image includes: Determine the structural similarity between the intermediate spatial features and the panchromatic image; Subtracting the structural similarity from 1 yields the spatial similarity loss.

8. The dual-stream spatial-spectral fusion method based on intermediate loss according to claim 6, characterized in that, Determining the spectral consistency loss based on the intermediate spectral features and the upsampled multispectral image includes: The spectral angle mapping between the intermediate spectral features and the upsampled multispectral image is determined to obtain the spectral consistency loss.

9. The dual-stream spatial-spectral fusion method based on intermediate loss according to claim 6, characterized in that, The step of determining the global consistency loss of the spatial-spectral fusion network based on the panchromatic image, the upsampled multispectral image, and the spatial-spectral fusion image includes: Convert the upsampled multispectral image into a grayscale image; The first residual between the upsampled multispectral image and the grayscale image, and the second residual between the panchromatic image and the spatial-spectral fusion image are determined respectively; The KL divergence between the first residual and the second residual is determined to obtain the global consistency loss.

10. A dual-stream spatial-spectral fusion device based on intermediate loss, characterized in that, include: The first input module is used to input the panchromatic image into the spatial feature extraction subnetwork in the spatial-spectral fusion network to obtain the intermediate spatial features output by the spatial feature extraction subnetwork. The second input module is used to input the upsampled multispectral image into the spectral feature extraction subnetwork in the spatial-spectral fusion network to obtain the intermediate spectral features output by the spectral feature extraction subnetwork. The third input module is used to fuse the intermediate spatial features and the intermediate spectral features and then input them into the decoder in the spatial-spectral fusion network to obtain the spatial-spectral fusion image output by the decoder. The loss function during the training phase of the spatial-spectral fusion network includes the spatial similarity loss between the intermediate spatial features and the panchromatic image, and the spectral consistency loss between the intermediate spectral features and the upsampled multispectral image.