Method for training variational autoencoder of multi-channel latent space, method for training remote sensing graph model, and electronic device
By using a multi-channel latent space variational autoencoder, which combines red, green, and blue channels, depth map channel, and semantic channel, to encode and decode multimodal data of remote sensing images, depth maps, and ground cover maps, the problem of insufficient detail and accuracy in remote sensing image generation is solved, and cross-modal consistency and geographic authenticity of multimodal data are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HANGZHOU HIGH-TECH ZONE (BINJIANG) INSTITUTE OF BLOCKCHAIN & DATA SECURITY
- Filing Date
- 2026-02-05
- Publication Date
- 2026-06-19
AI Technical Summary
The remote sensing images generated by existing technologies have low detail richness and accuracy, and cannot effectively capture geospatial correlation information, resulting in a lack of geographical authenticity and scene consistency in the generated images.
A multi-channel latent space variational autoencoder is adopted. By combining the red, green and blue three channels, the depth map channel and the semantic channel, it encodes and decodes multimodal data of remote sensing images, depth maps and ground cover maps to generate a unified latent space code. The model parameters are updated by the target training loss to optimize the encoding and decoding accuracy.
It improves the detail richness and accuracy of remote sensing images, ensures cross-modal consistency of multimodal output data, solves the limitations of feature expression in single-modal coding, and meets the deep needs of precise remote sensing application scenarios.
Smart Images

Figure CN122242586A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of neural network model technology, and particularly relates to a training method for a multi-channel latent space variational autoencoder, a training method for a remote sensing image model, and an electronic device. Background Technology
[0002] With the widespread application of remote sensing technology in fields such as land surveying, environmental monitoring, and disaster assessment, relying solely on the visual features of single-modal remote sensing images is no longer sufficient to meet the needs of precise scenarios. The industry has put forward an urgent need for multimodal remote sensing data processing technology that integrates multi-dimensional visual features, and multi-channel collaborative encoding and decoding has become the core direction for improving the application value of remote sensing data.
[0003] Variational autoencoders (VAEs), as classic generative models, have been applied to the encoding and reconstruction of remote sensing data. In the early stages, because the needs of remote sensing data processing focused on basic image reconstruction, restoration, and resolution enhancement, existing technologies naturally focused on visual feature processing of single-modal data (such as RGB remote sensing images). For example, an encoder extracts single-dimensional visual features such as color and texture from an image, maps them to a single latent space for feature compression, and then a decoder encodes and reverse-engineers the target remote sensing image based on this latent space. This approach is simple and easy to implement while meeting basic visual generation requirements, and therefore has long been the mainstream solution for remote sensing image generation.
[0004] However, the aforementioned single-modal generation schemes rely solely on the encoding and generation of visual features, failing to capture the implicit geospatial relationships behind remote sensing images. This results in generated remote sensing images that, while visually acceptable, lack geographic realism and scene consistency. Furthermore, the limited feature representation capabilities of a single latent space make it difficult to support complementary relationships across multiple dimensions, restricting the detail richness and accuracy of the generated images and failing to meet the deeper needs of precision remote sensing applications. Summary of the Invention
[0005] This application provides a training method for a multi-channel latent space variational autoencoder, a training method for a remote sensing image model, and an electronic device, which can solve the problem of low detail richness and accuracy of remote sensing images generated by existing technologies.
[0006] In a first aspect, embodiments of this application provide a training method for a variational autoencoder with a multi-channel latent space, the method comprising:
[0007] Acquire first training data; the first training data includes a first remote sensing image, a first depth map of the first remote sensing image, and a first land feature map of the first remote sensing image; the first depth map is used to describe the topographic elevation undulation information of each pixel in the first remote sensing image; the first land feature map is used to describe the land feature units present in each pixel of the first remote sensing image; The first training data is input into the encoder in the variational autoencoder to obtain the first latent space code. The variational autoencoder includes a pre-trained red, green and blue three-channel, a depth map channel to be trained, and a semantic channel. The red, green and blue three-channel is used to encode and decode the three-channel image features of the first remote sensing image. The depth map channel is used to encode and decode the terrain elevation relief information in the first depth map. The semantic channel is used to encode and decode the land cover units in the first land cover map. The first latent space code is generated by the encoder after encoding the first training data based on the encoding capabilities of the red, green and blue three-channel, the depth map channel, and the semantic channel. The first latent space code is input into the decoder in the variational autoencoder to obtain the output data; the output data includes the second remote sensing image, the second depth map, and the second ground feature map. Calculate the target training loss for the first training data and the output data; The depth map channel, semantic channel, and model parameters in the decoder are updated based on the target training loss to obtain the updated variational autoencoder.
[0008] Secondly, embodiments of this application provide a training method for a remote sensing image model, the remote sensing image model including a denoising backbone network and a variational autoencoder as described in the first aspect; the method includes: Acquire second training data; the second training data includes global text describing the third remote sensing image, regional text describing the corresponding area of the third remote sensing image, and text describing the land feature units in the third remote sensing image. The second training data is input into the denoising backbone network to generate the second latent space code of the third remote sensing image; The third remote sensing image, the third depth map of the third remote sensing image, and the third land cover map of the third remote sensing image are input into the encoder in the variational autoencoder to obtain the third latent space code. The model parameters of the denoising backbone network are updated based on the second and third latent space codes to obtain the trained remote sensing image model.
[0009] Thirdly, embodiments of this application provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the methods described in the first or second aspect above.
[0010] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the methods described in the first or second aspect above.
[0011] Fifthly, embodiments of this application provide a computer program product that, when run on an electronic device, causes the electronic device to perform the methods described in the first or second aspect.
[0012] The beneficial effects of this application embodiment compared with the prior art are as follows: By acquiring first training data including a first remote sensing image, a first depth map, and a first feature map, and inputting it into the encoder of a variational autoencoder, and combining the specialized coding capabilities of the red-green-blue three-channel, depth map channel, and semantic channel, the inherent relationship between visual features, terrain elevation information, and the semantics of feature units is fully explored. This design directly compensates for the deficiency of single-modal schemes that only focus on visual features and cannot capture geospatial relationships, solving the problem of lack of geographic realism and scene consistency in generated remote sensing images from the source, and breaking through the feature expression limitations of single-modal coding. Subsequently, the co-coding of multimodal data is completed with the help of a unified latent space, which not only avoids the information fragmentation caused by independent coding, but also reduces the training difficulty of the depth map channel and semantic channel by being compatible with the pre-trained red-green-blue three-channel parameters. At the same time, it solves the problem of latent space distribution drift that is easily caused when directly extending multi-channel architectures in related technologies, eliminates the hidden dangers of model fine-tuning failure or generation collapse, and provides stable technical support for multi-channel extension. Subsequently, the decoder synchronously outputs data including a second remote sensing image, a second depth map, and a second feature map, ensuring cross-modal consistency of the multimodal output data. This resolves the scenario logic conflict issue that easily occurs when traditional models generate multiple types of data separately, further enhancing the rationality of geospatial association. Finally, based on the first training data and the output data, the target training loss is calculated, and the depth map channel, semantic channel, and decoder parameters are updated accordingly. This continuously optimizes the encoding and decoding accuracy of the multimodal data, improves the restoration accuracy of terrain elevation undulation features and feature unit semantics, effectively breaks through the bottleneck of detail richness and accuracy limited by a single latent space, and fully meets the deep needs of precise remote sensing application scenarios. Attached Figure Description
[0013] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0014] Figure 1This is a flowchart illustrating the implementation of a training method for a multi-channel latent space variational autoencoder according to an embodiment of this application. Figure 2 This is a schematic diagram illustrating one implementation method for obtaining first training data in a training method for a multi-channel latent space variational autoencoder provided in an embodiment of this application; Figure 3 This is a schematic diagram illustrating an implementation method for calculating the target training loss in a training method for a multi-channel latent space variational autoencoder provided in an embodiment of this application. Figure 4 This is a schematic diagram illustrating an implementation of a method for calculating the geomorphic structure fidelity loss in a training method for a multi-channel latent space variational autoencoder provided in an embodiment of this application. Figure 5 This is a schematic diagram of the variational autoencoder model structure in a training method for a multi-channel latent space variational autoencoder provided in an embodiment of this application. Figure 6 This is a flowchart illustrating the implementation of a training method for a remote sensing image model according to an embodiment of this application. Figure 7 This is a schematic diagram illustrating one implementation method for obtaining second training data in a training method for a remote sensing image model provided in an embodiment of this application; Figure 8 This is a schematic diagram illustrating one implementation of a remote sensing image model training method for generating a second latent space code, provided in an embodiment of this application. Figure 9 This is a schematic diagram illustrating one implementation of generating a second latent space code in a training method for a remote sensing image model provided in another embodiment of this application; Figure 10 This is a schematic diagram of the model structure during the training stage of a remote sensing image model training method provided in an embodiment of this application; Figure 11 This is a schematic diagram of the model structure in the application stage of a remote sensing image model training method provided in an embodiment of this application; Figure 12 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0015] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this application with unnecessary detail.
[0016] It should be understood that, when used in this application specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or a collection thereof.
[0017] It should be noted that the information collection process (such as patient information collection process, physiological information collection process, etc.) / feature extraction process involved in this application is carried out with the user's knowledge and permission. That is, the information collection process / feature extraction process complies with the requirements of laws and regulations and does not constitute an act that harms the public interest.
[0018] Furthermore, in the description of this application and the appended claims, the terms "first," "second," "third," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0019] One embodiment of this application provides a training method for a multi-channel latent space variational autoencoder. This method can be applied to electronic devices such as laptops, ultra-mobile personal computers (UMPCs), and netbooks. This application embodiment does not impose any restrictions on the specific type of electronic device.
[0020] Please see Figure 1 , Figure 1 The following is a flowchart illustrating the implementation of a training method for a multi-channel latent space variational autoencoder according to an embodiment of this application. The method includes the following steps: S101. Obtain the first training data.
[0021] The first training data includes a first remote sensing image, a first depth map of the first remote sensing image, and a first land feature map of the first remote sensing image; the first depth map is used to describe the topographic elevation undulation information of each pixel in the first remote sensing image; the first land feature map is used to describe the land feature units present in each pixel of the first remote sensing image.
[0022] In one embodiment, the first remote sensing image is the original remote sensing data source for model training. It can be generated by sensors carried by remote sensing platforms such as satellites and drones, and is presented in the form of RGB three-channel images. It includes visual features such as the color and texture of the ground surface (such as the green texture of vegetation and the grayscale features of buildings), and serves as the carrier for the subsequent generation of depth maps and feature maps.
[0023] The first remote sensing image can be downloaded from a public remote sensing data platform (such as Landsat or Sentinel satellite data sharing platform) or directly acquired by a customized remote sensing mission (such as drone aerial photography). It is necessary to ensure that the image contains the complete target area and that the resolution meets the training requirements.
[0024] The aforementioned first depth map is a pixel-level elevation data map derived from the first remote sensing image, which is equivalent to a digital elevation model (DEM). Each pixel value directly corresponds to the absolute or relative elevation of the corresponding location on the ground. It is a concrete representation of the terrain elevation undulation information and provides data support for landform classification (such as the 10 landform classifications of the Geomorphon algorithm).
[0025] The first depth map can be generated by retrieving DEM data from a global DEM database that matches the geographic coordinates of the first remote sensing image, and then aligning the coordinates and unifying the resolution. Alternatively, it can be generated based on the first remote sensing image by inferring pixel-level elevation data using a neural network model (such as the Monodepth series or a remote sensing-specific depth estimation model).
[0026] For example, an electronic device can calculate local undulations on a DEM using a sliding window approach, which is the difference between the maximum and minimum elevations within the window. The window side length can be preset, for example, 33 pixels. To adapt to data with different spatial resolutions, a resolution-related scaling factor (scale) can be introduced. For example, at a resolution of 30m, scale=1, normalizing the local undulation threshold and classifying it according to a preset absolute elevation classification rule table, thereby classifying each pixel (or each region) into landform types such as ocean, plain, plateau, hills, low mountains, and high mountains.
[0027] The classification rule table can be shown in Table 1 below: Table 1:
[0028] The first land feature map mentioned above is a semantic distribution map of tangible land features that actually exist on the land surface. It clearly defines the land feature at each location in the form of pixel-level annotation. It is a visual representation of land feature units and can be directly used for encoding training of the semantic channels of the model.
[0029] The first land cover map can be generated by extracting corresponding regional data from an existing land cover semantic database, merging categories (e.g., merging the original multiple categories into the target training of 7 land cover categories), and calibrating coordinates. Alternatively, semantic segmentation can be performed on the first remote sensing image (e.g., using U-Net or SegNet models), combined with manual correction and annotation, to obtain pixel-level land cover classification results.
[0030] The aforementioned topographic elevation undulation information refers to the altitude differences and derived characteristics of different locations on the Earth's surface, including but not limited to absolute elevation (altitude), relative elevation difference (elevation difference between adjacent pixels), slope, aspect, and other information. It is a key basis for distinguishing landform types such as flat land, mountain peaks, and valleys.
[0031] The aforementioned land feature units refer to natural or artificial objects / areas on the land surface that have clear attributes and boundaries. They are the basic semantic units that constitute a land feature map. Common types include water bodies (rivers, lakes), vegetation (forests, farmland), buildings (urban building complexes), bare land, roads, etc.
[0032] It should be noted that by simultaneously acquiring the first remote sensing image (visual), the first depth map (terrain elevation), and the first land cover map (land cover semantics), targeted training data can be provided for the RGB three channels, depth map channel, and semantic channel of the variational autoencoder, respectively. This ensures that the subsequent encoder can fully explore the intrinsic relationship between the three types of features, avoid information loss caused by single-modal data, and ultimately support the model to achieve accurate encoding and decoding of multimodal data.
[0033] In one embodiment, the first remote sensing image, the first depth map, and the first land cover map must meet the constraint of strict pixel-level alignment. Specifically, the spatial coordinate systems of the first remote sensing image, the first depth map, and the first land cover map are consistent, and corresponding pixels form a one-to-one relationship. For example, if a pixel in the first remote sensing image exhibits the visual texture features of a water body, the same pixel in the first depth map corresponds to a low elevation value representing a water body area, and the same pixel in the first land cover map is explicitly labeled as a "water body" land cover category, ensuring that the three types of data are accurately matched in spatial location and providing a consistent spatial reference for subsequent multi-channel collaborative coding.
[0034] S102. Input the first training data into the encoder in the variational autoencoder to obtain the first latent space code.
[0035] The aforementioned variational autoencoder includes pre-trained red, green, and blue three-channel encoding, a depth map channel to be trained, and a semantic channel. The red, green, and blue three-channel encoding is used to encode and decode the three-channel image features of the first remote sensing image. The depth map channel is used to encode and decode the terrain elevation undulation information in the first depth map. The semantic channel is used to encode and decode the land cover units in the first land cover map. The first latent space encoding is generated by the encoder after encoding the first training data based on the encoding capabilities of the red, green, and blue three-channel encoding, the depth map channel, and the semantic channel.
[0036] In one embodiment, the original design of the variational autoencoder focuses only on single-modal RGB image processing, including pre-trained red, green, and blue (RGB) three channels. The network structure of this RGB three-channel architecture (such as convolutional blocks, activation layers, pooling layers, etc.) has been finalized through large-scale RGB remote sensing image training. The parameters (weights, biases) formed after training are completely preserved. In subsequent expansion processes, its internal structure and pre-trained parameters can be unchanged, and it can be integrated into the multi-channel architecture as a basic coding module, reusing its mature visual feature extraction capabilities.
[0037] The aforementioned encoder is a component of a variational autoencoder, essentially a multimodal feature compression and fusion network. It consists of parallel encoding branches for RGB three channels, depth map channels, and semantic channels, as well as a feature fusion layer. The encoder can receive multi-source input data and extract specific features through each channel branch. Then, the fusion layer maps the multi-dimensional features to a low-dimensional, continuous latent space, generating a unified first latent space encoding. This provides a compact and information-rich feature representation for the subsequent reconstruction task of the decoder.
[0038] The aforementioned three-channel image features refer to the visual dimension features extracted from the first remote sensing image using the RGB three channels, including pixel color distribution (such as the blue features of water bodies and the green features of vegetation), texture details (such as the regular texture of buildings and the rough texture of mountains), brightness gradient, etc., which are information that characterizes the visual appearance of remote sensing images.
[0039] The aforementioned depth map channel is a newly added channel to be trained to adapt to the encoding of terrain elevation information. It maintains compatibility with the network block structure of the RGB three-channel (such as the same convolutional kernel size and feature map dimension), but its initial parameters have not been trained. The function of the depth map is to specifically extract terrain elevation undulation information (such as elevation values, height differences between adjacent pixels, slope relationships, etc.) from the first depth map, and convert it into numerical features that can be coordinated with visual features to achieve specialized encoding of terrain information.
[0040] The aforementioned semantic channels are also newly added channels to be trained. Their structure is compatible with the RGB three-channel and depth map channels, and their initial parameters have not been trained. The core semantic function is to focus on land cover units (such as water bodies, vegetation, buildings, etc.) in the first land cover map, extract semantic features such as the category attributes and spatial distribution relationships of land cover, complete the specialized encoding of land cover semantic information, and make up for the lack of semantic expression in pure visual features.
[0041] The first latent space encoding, the final output of the encoder, is a unified multimodal latent space representation that integrates RGB three-channel visual features, depth map channel terrain features, and semantic channel semantic features. This first latent space encoding can be a concatenation of encoding results from multiple channels, or it can be a low-dimensional, structured feature vector generated after mining the intrinsic relationships between the three types of features through a feature fusion layer. Furthermore, while preserving information from each modality, it also achieves the synergistic association of multimodal information, providing a unified feature foundation for the decoder to simultaneously reconstruct multiple types of output data.
[0042] As an example, an encoder may contain convolutional layers (for feature extraction), batch normalization layers (for stable training), activation function layers (to introduce non-linearity), partially pooling layers (to compress feature dimensions), and fusion layers (to output the first latent space encoding).
[0043] It should be noted that by reusing the mature visual feature extraction capabilities of the pre-trained RGB three-channel system, the inefficiency of training a visual encoding module from scratch can be avoided. Furthermore, by adding two specialized channels to specifically encode terrain elevation information and semantic information of ground features, the limitation of the original VAE in processing only visual features is overcome. Finally, through the encoder's fusion mechanism, the three types of heterogeneous features are mapped to a unified latent space to generate the first latent space encoding. This not only solves the information fragmentation problem of multimodal data but also reduces the training difficulty of the new channels by leveraging pre-trained parameters. Simultaneously, it avoids the risk of latent space distribution drift caused by directly expanding multiple channels, laying the foundation for the reconstruction of subsequent multimodal data.
[0044] S103. Input the first latent space code into the decoder in the variational autoencoder to obtain output data; the output data includes the second remote sensing image, the second depth map, and the second ground feature map.
[0045] In one embodiment, the decoder is the output component of a variational autoencoder (VAE), structurally symmetrical with and functionally complementary to the encoder. Essentially, it is a multimodal feature decompression and specialized reconstruction network. It includes multiple parallel decoding branches (RGB three-channel decoding branch, depth map channel decoding branch, and semantic channel decoding branch) corresponding to the encoder, used to receive a unified first latent space encoding. Through decompression (such as deconvolution and upsampling) and feature reconstruction, it restores the low-dimensional fused features to the corresponding high-dimensional output data of the modality, ensuring that the structure and dimension of the multimodal output are consistent with the input data.
[0046] As an example, a decoder may include deconvolutional / transposed convolutional layers (decompression units), upsampling layers (to achieve multi-scale feature recovery), batch normalization layers (to stabilize gradient distribution during training), activation function layers (to introduce non-linear feature transformation), and multi-channel decoding branch layers (output data).
[0047] The aforementioned second remote sensing image is the output of the RGB three-channel branch in the decoder. It is a reconstruction product of the first remote sensing image, restoring the RGB three-channel visual features (color, texture, brightness, etc.) of the original first remote sensing image. It is completely consistent with the dimension and spatial resolution of the first remote sensing image and is used for subsequent calculation of the loss of pixel-level visual feature restoration accuracy.
[0048] The second depth map mentioned above is the output of the depth map channel branch of the decoder. In essence, it is a reconstructed digital elevation model. Each pixel value corresponds to the surface elevation information. It is a restoration of the terrain elevation undulation information in the first depth map and provides predictive data for calculating the fidelity loss of landform structure.
[0049] The second land cover map mentioned above is the output of the semantic channel branch of the decoder. It is a reconstructed pixel-level land cover semantic distribution map, which restores the land cover unit category (such as water body, vegetation, building, etc.) of each pixel in the first land cover map. It is used for subsequent semantic consistency loss calculation to ensure the accuracy of land cover semantic restoration.
[0050] It should be noted that by using the multi-parallel branch channels of the decoder, the first latent space encoding, which incorporates multimodal information, is restored into three types of output data: visual (second remote sensing image), terrain (second depth map), and semantic (second feature map). This achieves accurate restoration results for all three types with a single encoding. Furthermore, it maintains the pixel-level alignment between the multimodal output and input data (pixels at the same coordinates still correspond one-to-one), ensuring cross-modal consistency. Simultaneously generating these three types of data provides complete prediction samples for subsequent target training loss, supporting targeted updates to model parameters. Moreover, multi-branch decoding maximizes the preservation of the restoration accuracy of specific features from each modality, avoiding information distortion caused by multimodal fusion, and providing output-level assurance to address the lack of geographic realism and scene consistency.
[0051] S104. Calculate the target training loss for the first training data and the output data.
[0052] In one embodiment, the electronic device can calculate the similarity between the first remote sensing image and the second remote sensing image, the similarity between the first depth map and the second depth map, and the similarity between the first ground feature map and the second ground feature map, and calculate the target training loss based on each similarity.
[0053] S105. Update the depth map channel, semantic channel, and model parameters in the decoder according to the target training loss to obtain the updated variational autoencoder.
[0054] In one embodiment, the above-mentioned model parameter update method can be gradient descent, conjugate gradient, etc., and there is no limitation thereto. For example, the electronic device can adjust the model parameters (weights, biases) along the negative gradient direction according to a preset learning rate (or an adaptively adjusted step size) until the target training loss converges and the update stops.
[0055] The initial learning weights for the red, green, and blue channels can be the parameters formed after training as described above, while the initial learning weights for the depth map channel and the semantic channel can be random.
[0056] In another embodiment, the electronic device may also calculate the average weight of preset initial learning weights in the red, green, and blue channels. Then, the average weight is set as the initial learning weight for the depth map channel and the semantic channel, respectively.
[0057] In one embodiment, the preset initial learning weights in the red, green, and blue channels can be the parameters formed after training with large-scale RGB remote sensing images as described above.
[0058] Furthermore, by calculating the average weight of the preset initial learning weights in the red, green, and blue channels, the depth map channel and semantic channel to be trained can be assigned as their initial learning weights, ensuring that the initial learning weights of the depth map channel and semantic channel match the pre-trained channel parameters in terms of magnitude.
[0059] In this embodiment, by calculating the average weight of the preset initial learning weights in the red, green, and blue channels, and setting the average weight as the initial learning weights for the depth map channel and the semantic channel respectively, the depth map channel and the semantic channel can be adapted to the parameter distribution characteristics of the red, green, and blue channels from the early stage of training. This avoids training oscillations or latent space distribution drift caused by random initialization of the depth map channel and the semantic channel, reduces training difficulty and convergence cost, and can quickly establish a parameter coordination foundation among multiple channels. This ensures that the encoding capabilities of RGB visual features, terrain elevation features, and ground feature semantic features are optimized simultaneously, providing stable initial conditions for unified latent space fusion of multimodal data, and further improving the efficiency of model training and the consistency of cross-modal output.
[0060] It should be noted that updating the depth map channels can gradually enhance the encoding and decoding capabilities of terrain elevation undulation information, and updating the semantic channels can optimize the semantic representation accuracy of ground feature units. Furthermore, the decoder branches can improve the consistency of multimodal data reconstruction and avoid latent space distribution drift. Additionally, updating only the newly added depth map channels, semantic channels, and decoder parameters can preserve the RGB three-channel pre-training parameters, ensuring compatibility between the latent space distribution and the original pre-training distribution during multi-channel fusion, thus resolving the issue of fine-tuning failure caused by directly expanding multiple channels. Simultaneously, updating the model parameters using the multi-dimensional target training loss can guide the collaborative optimization of parameters across modules, ensuring the intrinsic correlation and matching of visual features, terrain features, and semantic features, strengthening cross-modal collaboration, and ultimately improving the geographic realism and scene consistency of the variational autoencoder output.
[0061] In this embodiment, first training data, including a first remote sensing image, a first depth map, and a first feature map, is acquired and input into the encoder of a variational autoencoder. Combining the specialized encoding capabilities of the red-green-blue three-channel, depth map channel, and semantic channel, the intrinsic relationship between visual features, terrain elevation information, and the semantics of feature units is fully explored. This design directly compensates for the shortcomings of single-modal schemes that only focus on visual features and cannot capture geospatial relationships, solving the problem of lack of geographic realism and scene consistency in generated remote sensing images from the source, and breaking through the feature expression limitations of single-modal encoding. Subsequently, a unified latent space is used to complete the collaborative encoding of multimodal data, avoiding information fragmentation caused by independent encoding and reducing the training difficulty of the depth map channel and semantic channel by being compatible with pre-trained red-green-blue three-channel parameters. Simultaneously, it solves the problem of latent space distribution drift that is easily caused when directly extending multi-channel architectures in related technologies, eliminating the hidden dangers of model fine-tuning failure or generation collapse, and providing stable technical support for multi-channel expansion. Subsequently, the decoder synchronously outputs data including a second remote sensing image, a second depth map, and a second feature map, ensuring cross-modal consistency of the multimodal output data. This resolves the scenario logic conflict issue that easily occurs when traditional models generate multiple types of data separately, further enhancing the rationality of geospatial association. Finally, based on the first training data and the output data, the target training loss is calculated, and the depth map channel, semantic channel, and decoder parameters are updated accordingly. This continuously optimizes the encoding and decoding accuracy of the multimodal data, improves the restoration accuracy of terrain elevation undulation features and feature unit semantics, effectively breaks through the bottleneck of detail richness and accuracy limited by a single latent space, and fully meets the deep needs of precise remote sensing application scenarios.
[0062] In another embodiment, the electronic device may also be based on, for example... Figure 2 The steps S201-S202 shown below are for obtaining the first training data, detailed as follows: S201. Acquire the first remote sensing image and generate the first depth map of the first remote sensing image.
[0063] S202. The preset water body information in the first remote sensing image is fused with the first depth map to obtain the first ground feature map.
[0064] In one embodiment, the method of acquiring the first remote sensing image and the first depth map can refer to the example description of S101 above, and will not be explained further.
[0065] The aforementioned preset water body information is high-precision water body benchmark data acquired in advance. The source may include publicly available water body datasets, water body boundaries marked on thematic maps, or water body areas pre-determined through the spectral characteristics of remote sensing images. It clearly knows the reliable spatial location and extent of water bodies and is used to accurately calibrate the water body categories in the ground feature map.
[0066] As an example, an electronic device can extract topographic and land feature features from the first depth map and preliminarily filter potential land feature categories based on these features (e.g., low-elevation flat areas are marked as suspected water bodies / plains, and mountainous areas are marked as mountains). These topographic and land feature features are derived features strongly correlated with land feature categories, extracted from the first depth map. They can include geomorphic structural features (e.g., flatlands, valleys, depressions) and elevation-related features (e.g., absolute elevation values, relative elevation differences, slope, and the extent of low-elevation flat areas). These topographic and land feature features directly determine the spatial distribution logic of land features (e.g., depressions and low-elevation flat areas are prone to forming water bodies, while mountainous areas are mostly vegetated or bare land). Then, through spatial coordinate alignment, the preset water body information is overlaid with the preliminary classification results. Overlapping areas are directly labeled as water bodies, while non-overlapping areas retain the land feature categories derived from topographic features. Finally, for conflicting areas (e.g., terrain determined to be mountains but preset as reservoirs), the land feature categories are corrected based on the preset water body information, ultimately forming pixel-level land feature classification results.
[0067] In this embodiment, a first depth map is first generated based on a first remote sensing image. Then, topographic and land feature features strongly correlated with landform structure and elevation are extracted from the first depth map, and high-precision water body information preset in the first remote sensing image is fused. This approach not only ensures the geospatial logical rationality of land feature classification by leveraging topographic and land feature features (such as the correspondence between low-elevation flat areas and water bodies and plains), but also accurately corrects the water body misclassification problem that is prone to occur with single data by using preset water body information (avoiding classification bias caused by vegetation occlusion and shadows). Finally, a high-quality first land feature map is generated that is pixel-level aligned with the first remote sensing image and the first depth map. This provides reliable semantic training data for the semantic channel of the variational autoencoder, ensuring the accuracy of semantic encoding and decoding. At the same time, it strengthens the intrinsic correlation and consistency of multimodal training data, laying the foundation for subsequent model optimization of semantic consistency loss and improvement of the geographic realism of cross-modal output.
[0068] In another embodiment, the electronic device may also be based on, for example... Figure 3 The steps S301-S305 shown below are detailed in their calculation of the target training loss: S301. Calculate the pixel reconstruction loss based on the first depth map and the second depth map.
[0069] In one embodiment, the aforementioned pixel reconstruction loss is a numerical deviation metric designed for the first depth map and the second depth map. It is used to quantify the element-wise numerical difference between the two in the pixel space, which can constrain the loss term of the depth map channel encoding and decoding accuracy and determine the accuracy of the reconstructed depth map at the elevation numerical level.
[0070] The pixel reconstruction loss includes, but is not limited to, L1 loss (mean absolute error MAE), L2 loss (mean square error MSE), and Huber loss (smoothing L1 loss), and there are no restrictions on these.
[0071] As an example, an electronic device can perform pixel-level spatial alignment between a first depth map and a second depth map to ensure a one-to-one correspondence between pixels with the same coordinates. Then, the absolute difference in elevation values of corresponding pixels in the two depth maps can be calculated element-by-element, and the average of all absolute differences can be taken to obtain the final L1 loss value.
[0072] It should be noted that by effectively constraining the deviation between the overall elevation amplitude (such as the average elevation of the region) and local details (such as the small height difference between adjacent pixels), the phenomenon of excessive smoothing during the reconstruction process is avoided, thereby improving the accuracy and robustness of the reconstructed depth map at the numerical level.
[0073] It should be noted that by calculating the pixel reconstruction loss, the numerical restoration error of the depth map is converted into a backpropagable gradient signal, providing precise guidance for subsequent parameter updates of the depth map channels and decoder. The larger the loss value, the greater the numerical deviation between the reconstructed second depth map and the first depth map. During parameter updates, adjustments will be prioritized in the direction of reducing this deviation, ultimately ensuring that the depth map channels can accurately encode and decode terrain elevation undulation information, thereby improving the numerical fidelity of the reconstructed depth map.
[0074] S302. Calculate the geomorphic structure fidelity loss based on the first depth map and the second depth map.
[0075] In one embodiment, the aforementioned terrain structure fidelity loss is used to describe the degree of difference in terrain structure features output by the variational autoencoder during training.
[0076] The calculation methods for the above-mentioned geomorphic structure guarantee loss include, but are not limited to, cross-entropy loss, classification accuracy loss, structural similarity loss, etc., and there are no restrictions on these methods.
[0077] It should be noted that by performing pixel-level spatial alignment between the first and second depth maps, extracting terrain structure information from the depth maps, and calculating the structural difference value to obtain the terrain structure fidelity loss, this loss can be incorporated into the target training loss and used to output gradient feedback. This can compensate for the limitation of pixel reconstruction loss, which only constrains elevation values. Furthermore, it avoids the problem of numerical fit but distorted terrain structure, strengthens the terrain structure encoding capability of the depth map channels, ensures the geographic authenticity of the reconstructed depth map, and indirectly improves the cross-modal matching degree between the depth map and the feature map, providing structural constraints for the overall optimization of multi-channel VAEs.
[0078] S303. Calculate the semantic consistency loss based on the first and second feature maps.
[0079] In one embodiment, the aforementioned semantic consistency loss is used as a loss function to measure the degree of matching between the first and second land cover maps in terms of land cover unit category distribution and boundary contours. It is used to constrain the encoding and decoding accuracy of the semantic channel of the variational autoencoder, ensuring that the land cover categories (such as water bodies, vegetation, buildings, etc.) of the reconstructed land cover map are consistent with the real situation, while enhancing the clarity of land cover boundaries and improving the synergy of multimodal data at the semantic level.
[0080] As an example, electronic devices may use cross-entropy loss, Dice loss, or a weighted fusion of the two loss methods for calculation, without limitation.
[0081] It should be noted that by calculating the semantic consistency loss mentioned above and guiding the semantic channel to accurately learn the feature representation of ground features, it is ensured that the categories and boundaries of the reconstructed ground feature map are highly consistent with the real situation, thereby strengthening the collaborative matching degree of visual-terrain-semantic multimodal data and improving the geographic realism and scene rationality of the model output.
[0082] S304. Calculate the high-frequency detail constraint loss based on the first training data and the output data.
[0083] In one embodiment, the high-frequency detail constraint loss is a fine-grained structure loss designed for the first training data (the first remote sensing image, the first depth map, and the first ground cover map in RGB) and the output data (the second remote sensing image, the second depth map, and the second ground cover map). It is used to focus on the quantification of differences between the two in high-frequency information such as texture, edge, and contour, and to explicitly constrain the restoration accuracy of fine-grained structure and boundary information during the model reconstruction process, so as to avoid problems such as image blurring and loss of details, and improve the clarity and structural fidelity of the multimodal output data.
[0084] In one embodiment, the calculation method for the above-mentioned high-frequency detail constraint loss includes, but is not limited to, wavelet transform high-frequency subband loss, high-frequency component distribution dispersion loss, etc., and there is no limitation on this method.
[0085] As an example, an electronic device can perform Fast Fourier Transform (FFT) on the normalized first training data and output data for each channel, filtering low-frequency components (such as overall brightness and macroscopic terrain trends) and retaining high-pass components that reflect texture and details, generating frequency domain high-frequency component maps for each channel. Furthermore, it can apply the Sobel operator (in the x and y directions) to the first training data and output data for each channel to calculate pixel edge gradient values, generating spatial gradient maps reflecting edge contours. Additionally, for each of the red, green, and blue channels, the depth map channel to be trained, and the semantic channel, it can calculate the ground truth-predicted frequency domain high-pass component map MSE loss and the ground truth-predicted spatial gradient map MSE loss, resulting in two MSE sub-losses (frequency domain MSE and spatial MSE) for each channel. Then, the two MSE sub-losses for each channel are summed to obtain the high-frequency detail sub-loss for that channel. Finally, weights are assigned to each channel, and the sub-losses of the three channels are weighted and summed to obtain the overall high-frequency detail constraint loss.
[0086] It should be noted that by quantifying the restoration deviation of high-frequency details in multimodal data from both the frequency domain (texture) and spatial domain (edges), the limitations of pixel reconstruction loss and terrain structure fidelity loss, which tend to focus on overall numerical / macroscopic structures, can be compensated for, thus solving the problems of overly smooth reconstruction results and blurred textures / edges. Furthermore, by constraining the restoration accuracy of RGB visual textures, depth terrain edges (such as ridgelines and depression boundaries), and semantic feature boundaries (such as water-land boundaries), the consistency of cross-modal details in multimodal data can be enhanced. Moreover, by calculating the high-frequency detail constraint loss using the aforementioned weighting method, fine-grained structural and boundary information can be explicitly constrained, reducing blurring and detail loss during reconstruction and improving the clarity and structural fidelity of the results.
[0087] S305. Calculate the target training loss based on pixel reconstruction loss, terrain structure fidelity, semantic consistency loss, and high-frequency detail constraint loss.
[0088] In one embodiment, the target training loss can be obtained by weighted summation of multiple pixel reconstruction losses, terrain structure fidelity losses, semantic consistency losses, and high-frequency detail constraint losses. The weights of each loss can be the same or different; this is not limited.
[0089] In this embodiment, by separately calculating the pixel reconstruction loss (constraining the accuracy of elevation numerical level restoration) and topographic structure fidelity loss (quantifying the degree of difference in macroscopic terrain structure) of the first and second depth maps, the semantic consistency loss (ensuring the accuracy of feature category distribution and boundaries) of the first and second feature maps, and the high-frequency detail constraint loss (strengthening the consistency of fine-grained information such as texture and edges) of the first training data and the output data, and then fusing the four types of losses to obtain the target training loss, a multi-dimensional and comprehensive loss supervision system can be formed. This system not only makes up for the limitations of a single loss focusing only on local optimization targets (such as pixel reconstruction loss easily ignoring terrain structure and semantic consistency loss having difficulty constraining detailed textures), but also achieves synergistic optimization of numerical accuracy, macroscopic structure, semantic categories, and fine-grained details. Furthermore, it can simultaneously enhance the terrain encoding capability of the depth map channel, the feature representation capability of the semantic channel, and the multimodal restoration capability of the decoder, ultimately guiding the output second remote sensing image, second depth map, and second feature map of the model to achieve significant improvements in geographic realism, cross-modal consistency, and detail fidelity, meeting the application requirements of precise remote sensing image processing.
[0090] In another embodiment, the electronic device may also be based on, for example... Figure 4 The steps S401-S406 shown are for calculating the geomorphic structure guarantee loss. Details are as follows: S401. For any center pixel in the target depth map, calculate the elevation difference between the center pixel and multiple neighboring pixels respectively.
[0091] Among them, the neighboring pixels are the pixels adjacent to the center pixel, and the target depth map is the first depth map and the second depth map.
[0092] In one embodiment, the first depth map and the second depth map must complete the above steps and the following two steps independently, so that the elevation difference distribution generated by the two depth maps can be used to compare the consistency of local terrain structure (such as landform classification and neighborhood elevation relationship matching) as a prerequisite for calculating the fidelity loss of landform structure.
[0093] In one embodiment, the aforementioned center pixel is an arbitrarily selected pixel in the target depth map, serving as a reference point for local terrain analysis. In practice, it is necessary to traverse all pixels in the depth map (boundary pixels can be processed as needed, such as filling or removing them), allowing each pixel to be used as the center pixel in turn, in order to achieve comprehensive local terrain feature extraction across the entire map.
[0094] In this context, neighboring pixels are those that are directly adjacent to the center pixel in spatial location. A common selection method is an 8-neighborhood (i.e., pixels above, below, left, right, and four diagonal directions of the center pixel), but a 4-neighborhood (only pixels in the top, bottom, left, and right directions) can also be selected as needed. The range of neighboring pixels determines the scale of local terrain analysis, and an 8-neighborhood is more suitable for extracting micro-topographic features of remote sensing terrain (such as the neighborhood elevation relationships of ridges and depressions).
[0095] The elevation difference mentioned above is the difference between the elevation value of a single neighboring pixel and the elevation value of the center pixel. The calculation formula is: Elevation Difference = Neighboring Pixel Elevation Value - Center Pixel Elevation Value. The elevation difference can be positive or negative. A positive number indicates that the neighboring pixel's elevation is higher than the center pixel's, a negative number indicates that it is lower than the center pixel's, and zero indicates that both have the same elevation. The absolute value and sign of the elevation difference directly reflect the undulation of the local terrain (e.g., a low center and high surrounding areas indicate a depression, while a high center and low surrounding areas indicate a mountain peak).
[0096] It should be noted that for each depth map, each pixel is sequentially selected as the center pixel (boundary pixels can be filled with neighboring pixels using zero-padding, mirror filling, or direct removal to avoid edge effects). A specified range of neighboring pixels (e.g., an 8-neighborhood) is matched for the current center pixel to define the computation set. Then, for each neighboring pixel, its elevation difference with the center pixel is generated and the results are recorded. After the entire map is traversed, each depth map will obtain an elevation difference set matrix with the same size as the original image (each center pixel corresponds to a set of neighboring elevation difference data). Furthermore, the elevation values of the depth map can be transformed into structural features that characterize local terrain morphology, providing data support for subsequent comparisons of the geomorphic structure differences between the ground truth and reconstructed depth maps, and for calculating the fidelity loss of geomorphic structure.
[0097] S402. Generate neighborhood state coding for multiple neighboring pixels based on elevation difference.
[0098] In one embodiment, the aforementioned neighborhood state encoding transforms the continuous elevation differences between the center pixel and multiple neighboring pixels into structured features of a discrete state identifier sequence through a preset threshold. Each identifier corresponds to the relative elevation relationship (high, low, flat) of the neighboring pixels relative to the center pixel. This is used to transform the continuous numerical differences of local terrain into standardized undulation pattern identifiers, providing matching local feature inputs for subsequent geomorphic structure classification.
[0099] As an example, for each center pixel pc in the depth map, we can select its eight neighboring pixels pi in eight directions (E, SE, S, SW, W, NW, N, NE) and calculate the elevation difference between the center pixel and each neighboring pixel. Then, we set a preset threshold. The state si of each neighboring pixel is divided into the following 3 categories: If Δhi> If the elevation of the neighboring pixel is higher than the center pixel, then the state is 1, and the marker can be "high", indicating that the neighboring pixel's elevation is higher than the center pixel; if Δhi < If the state is 2, the label can be "low", indicating that the elevation of the neighboring pixel is lower than that of the center pixel; if |Δhi| ≤ If the state is 0, it is marked as flat, indicating that the neighboring pixels are close to the center pixel in elevation.
[0100] Then, the electronic device can arrange the states si corresponding to the eight neighboring pixels in the direction E→SE→S→SW→W→NW→N→NE to form a discrete sequence of length 8 (such as 1,0,2,0,1,1,2,0). This sequence is the neighborhood state code of the current center pixel.
[0101] It should be noted that, for each center pixel in each depth map, multiple neighboring pixels in various directions are matched and the elevation difference is calculated; then a threshold is applied. Discrete state classification is performed on the elevation differences. Finally, the states from multiple directions are sequentially combined into neighborhood state codes. This eliminates the interference of absolute differences in elevation values, focuses on the relative undulations of local terrain, and transforms continuous numerical information into structured, standardized discrete features, providing feature identifiers that can be directly used for pattern matching in subsequent neighborhood-based geomorphological classification.
[0102] S403. Integrate state codes from multiple domains to obtain landform type codes.
[0103] In one embodiment, the aforementioned landform type encoding is a unique numerical identifier obtained by fusing the neighborhood state encodings (discrete 0 / 1 / 2 states) of the central pixel in eight directions. Essentially, it is a numerical and standardized representation of the local terrain undulation pattern in eight directions. Each landform type encoding corresponds to a unique combination of relative elevation relationships in the neighborhood in eight directions.
[0104] Among them, the landform type encoding is an upward neighboring state encoding that connects to 8 directions (merging scattered states into a unified value), and a downward mapping relationship between the encoding and landform type for retrieval and matching. Ultimately, it transforms the abstract local terrain pattern into an interpretable macro-landform type (such as flat land, mountain peaks, depressions, etc.).
[0105] As an example, after obtaining the neighborhood state code for each neighboring pixel, the terrain type code can be calculated using a ternary formula. The ternary formula is as follows: ; Where i represents the 8 directions corresponding to 0-7, and code represents the landform type code.
[0106] S404. Based on the preset mapping relationship between landforms and codes, determine the predicted landform type corresponding to the landform type code.
[0107] In one embodiment, the mapping relationship is a pre-set correspondence table between landform type codes and 10 landform types. Specifically, the mapping relationship can be set based on the statistical characteristics of the 8-directional states (the number of flat / high / low states, i.e., nf, nh, nl), spatial distribution patterns (such as opposing distribution, continuous arc distribution), and supplemented by heuristic backoff rules to cover all 6561 (the 8-directional state combinations form 3^8) possible landform type codes, ensuring that each local terrain state can correspond to one of the 10 landform types.
[0108] The predicted landform type mentioned above refers to the standardized landform category to which the local terrain corresponding to the central pixel belongs, obtained through landform type coding mapping. It is a classification result of the macroscopic morphology of the local terrain, used to uniformly describe the geomorphic structural characteristics of the terrain. The landform type can include 10 categories, such as flat, peak, ridge, shoulder, pit, valley, hollow, slope, spur, and footslope.
[0109] Next, the number of flat (nf), high (nh), and low (nl) states in the 8 directions corresponding to the landform type code can be counted, and the spatial distribution pattern of the state can be identified (e.g., "opposite distribution" means that 2 flat states are located in opposite directions, "continuous arc distribution" means that 5 identical states are continuously arranged on the 8-directional ring), and the landform type determination conditions can be matched (e.g., nf=8 corresponds to "flat land", nl=8 corresponds to "mountain peak"); heuristic backoff rules are executed: if the current landform type code does not meet the explicit determination conditions, it is classified according to the numerical advantage of nh and nl (e.g., nl>nh is determined to be "spur", nh>nl is determined to be "hollow"). Finally, the landform type code is mapped to the corresponding predicted landform type c∈{1,2,...,10} through the mapping relationship.
[0110] The mapping relationship can be set according to the actual situation, and there are no restrictions on it.
[0111] It should be noted that a unique landform type code is first generated based on the neighborhood state code. Then, by combining the mapping relationship with statistical features, spatial distribution patterns, and backtracking rules, the abstract state combination is transformed into 10 interpretable landform types. Furthermore, discrete neighborhood state sequences can be transformed into geographically significant landform categories, realizing the correlation between local elevation relative relationships and macroscopic landform structure. This provides standardized classification labels for subsequent comparison of landform structure differences between the first and second depth maps and for calculating landform structure fidelity loss.
[0112] S405. Based on the predicted landform type and the actual landform type corresponding to the center pixel, determine the initial landform structure fidelity loss corresponding to the center pixel.
[0113] S406. Calculate the mean of multiple initial geomorphic structure fidelity losses to obtain the geomorphic structure fidelity loss.
[0114] In one embodiment, the aforementioned true landform type is the landform category obtained for each center pixel in the first depth map through a process completely consistent with the "predicted landform type" (neighborhood state encoding → pattern encoding → landform mapping), which serves as the benchmark for predicting the landform type and reflects the true terrain and landform attributes corresponding to the center pixel (e.g., if the true landform is "mountain peak", then this is the true landform type).
[0115] Among them, the initial landform structure fidelity loss at the injury site is a local measure of the degree of difference between the predicted landform type and the actual landform type of a single central pixel, and is the basic unit constituting the overall landform structure fidelity loss.
[0116] As an example, the initial terrain structure fidelity loss described above can be determined using a binary matching metric. For instance, if the predicted terrain type of the current center pixel is the same as the actual terrain type, the initial terrain structure fidelity loss corresponding to that center pixel is set to 1 (representing "match"); if the predicted terrain type of the center pixel is the same as but different from the actual terrain type, the initial terrain structure fidelity loss is set to 0 (representing "mismatch"). Finally, the average of all initial terrain structure fidelity losses can be determined as the terrain structure fidelity loss described above.
[0117] It should be noted that the accuracy of the terrain classification of a single pixel is aggregated into the overall matching degree of the terrain structure of the whole image. The resulting terrain structure fidelity loss can quantify the difference in the distribution of macro-terrain types between the first depth map and the second depth map, and provide a structural loss signal for updating the model parameters.
[0118] In this embodiment, by first calculating the elevation difference between neighboring pixels for each center pixel in the first and second depth maps, the local terrain undulation relationship can be captured. Then, the continuous elevation differences are transformed into discrete neighborhood state codes, eliminating absolute numerical interference and focusing on relative terrain patterns. These codes are then fused to generate a unique landform type code, achieving a standardized numerical representation of the local terrain pattern. Subsequently, the predicted landform type is obtained through mapping relationships, transforming the abstract pattern into an interpretable macroscopic landform classification. Finally, based on the predicted landform type and the real landform type corresponding to the center pixel, the initial landform structure fidelity loss corresponding to the center pixel is determined, and the average of multiple initial landform structure fidelity losses is calculated to obtain the landform structure fidelity loss. This not only transforms the elevation values of the depth map into geographically significant landform structure features and quantifies the difference between the ground truth and the reconstructed depth map in the distribution of macro-landform types, thus compensating for the deficiency of relying solely on pixel value loss and easily overlooking the rationality of terrain structure, but also provides the model with loss supervision signals that are more in line with the logic of real terrain. This guides the depth map channel to accurately learn the structural correlation rules of the terrain, improves the landform structure fidelity and geographical realism of the reconstructed depth map, and lays the matching foundation at the terrain structure level for cross-modal collaborative optimization of multimodal data, meeting the application needs of precise remote sensing terrain analysis.
[0119] In another embodiment, reference is made to Figure 5 , Figure 5 This is a schematic diagram of the variational autoencoder model structure in a training method for a multi-channel latent space variational autoencoder provided in an embodiment of this application. The input portion (A1, B1, C1 on the left) is the first training data. For example, A1 is a first remote sensing image (adapted to the red, green, and blue channels), B1 is a first depth map (adapted to the depth map channel to be trained), and C1 is a first ground feature map (adapted to the semantic channel to be trained). These three together constitute the multimodal input data for training. The Latent Encoder corresponds to the encoder in the variational autoencoder, integrating the encoding capabilities of the pre-trained red, green, and blue channels, the depth map channel to be trained, and the semantic channel, encoding the first training data composed of A1, B1, and C1 into a first latent space code. The Latent Decoder corresponds to the decoder in the variational autoencoder, receiving the output data after the first latent space encoding (the results on the right corresponding to A1, B1, and C1. For example, the second remote sensing image A2, the second depth map B2, and the second ground feature map C2). Then, through the above... Figure 3 The steps shown calculate the hybrid loss (target training loss). The entire diagram illustrates the complete process: inputting the first training data → encoder generating latent space encoding → decoder outputting data → calculating the target training loss → updating model parameters. When updating model parameters, the model parameters in the decoder, as well as the model parameters in the depth map channel and semantic channel of the encoder, can be updated.
[0120] In another embodiment, the latent space distribution and network structure of mainstream remote sensing image models (e.g., the pre-trained diffusion model of SD3.5) are fixed. If the trained variational autoencoder is directly replaced with the variational autoencoder in the remote sensing image model, a latent space distribution drift problem will occur, which will lead to model fine-tuning failure or generation collapse.
[0121] Specifically, variational autoencoders in remote sensing images are limited in their ability to collaboratively encode multimodal remote sensing visual features. They also face the problem of latent space distribution drift when expanding remote sensing image models to multiple channels. Furthermore, they lack alignment strategies for multi-channel variational autoencoders that are "compatible" with pre-trained latent spaces, as well as methods to inject layout constraints into the Transformer attention of SD3.5. Ultimately, these factors affect the stability and output quality of multimodal remote sensing data processing.
[0122] Based on this, in order to improve the stability and output effect of multimodal remote sensing data processing, one embodiment of this application provides a training method for a remote sensing image model, which can also be applied to... Narrative Regarding electronic devices, this application embodiment does not impose any restrictions on the specific type of electronic device.
[0123] Please see Figure 6 , Figure 6 The following is a flowchart illustrating the implementation of a training method for a remote sensing image model according to an embodiment of this application. The method includes the following steps: S601, Obtain the second training data.
[0124] The second training data includes global text describing the third remote sensing image, regional text describing the corresponding area of the third remote sensing image, and text describing the land feature units in the third remote sensing image.
[0125] In one embodiment, the third remote sensing image is similar to the first remote sensing image, and will not be described further.
[0126] The global text is a macroscopic description of the overall extent of the third remote sensing image, focusing on the global geographic attributes of the image. For example, the remote sensing image covers a mountainous and canyon area in southwest China, with the overall terrain mainly consisting of medium and high mountains, and includes a perennial river running through the area. The core information is a summary of the global scene type, main topography, and distribution of large-scale land features in the third remote sensing image.
[0127] The aforementioned regional text refers to descriptive text for local sub-regions within a third-party remote sensing image. These sub-regions can be divided into spatial blocks or geomorphic units. For example, they could be divided into five spatial regions: upper left, upper right, lower left, lower right, and center. Each regional text can be a single sentence or a short description of the local area. For instance, the upper left region of the image might be a river valley plain with contiguous farmland, while the lower right region might be a steep hillside with over 85% vegetation coverage. This regional text can provide a detailed description of the terrain features and land cover distribution characteristics of the local area.
[0128] The feature unit text is a description of the attributes of a single feature unit in a third-party remote sensing image. For example, a body of water in a third-party remote sensing image may be a seasonal river 50-80 meters wide; buildings may be rural settlements concentrated on the edge of river valleys and plains. This text is used to clarify the type, attributes, spatial details, and other information of specific features.
[0129] The second training data can be generated through methods such as professional manual annotation, cross-dataset matching, and multimodal large models, and there are no restrictions on this.
[0130] It should be noted that by acquiring multi-granular text descriptions of global, regional, and ground features corresponding to the third remote sensing image, the modal dimensions of the training data can be enriched, enabling the remote sensing image model to learn cross-modal associations of remote sensing image-depth-ground features-text, achieving latent spatial alignment between text and multi-channel visual terrain / semantic features, strengthening the model's semantic understanding of remote sensing scenes, and improving the geographic semantic rationality of the output results.
[0131] S602. Input the second training data into the denoising backbone network to generate the second latent space code of the third remote sensing image.
[0132] In one embodiment, the aforementioned denoising backbone network is a generative feature encoding network adapted to multimodal (e.g., text, image) → latent space encoding transformation. Unlike the encoder in a variational autoencoder (VAE), it maps text-type training data (second training data) to latent space features consistent with the first latent space encoding dimension through an iterative process of adding noise and progressively denoising. Simultaneously, it improves the robustness and semantic fit of the encoding through denoising logic, making it the core network module connecting text semantics and the latent space of remote sensing images.
[0133] As an example, a denoising backbone network may include the following network blocks: Time Embedding Block: Injects denoising time-step information into the network, transforming discrete time steps into high-dimensional feature embeddings and controlling the denoising intensity at different iteration stages; Cross-Attention Block: Enables cross-modal fusion of multimodal features (text features, image features) and latent space noise features, capturing semantic information of global / regional / land cover unit text and mapping it to latent space features; ResNet / ConvNeXtBlock: Extracts spatial structure information of latent space features, adapting to the terrain / land cover spatial distribution features of remote sensing image latent space, avoiding loss of spatial correlation in text encoding; Transformer Block: Models long-distance dependencies of text semantics through a self-attention mechanism, ensuring semantic integrity of the encoding; Denoiser Block: The core denoising module, based on time steps and text features, gradually eliminates random noise in the latent space, outputting stable semantic feature encodings. In this embodiment, the number of each network block is not limited.
[0134] As an example, the denoising backbone network can first initialize the noise latent space, generating a random Gaussian noise tensor with the same encoding dimension as the first latent space (simulating the initial disordered state of the latent space). Then, a random time step is sampled, and the time step t is transformed into a high-dimensional feature embedding through the Time Embedding Block and input into the network. Additionally, the global text, region text, and feature unit text from the second training data are transformed into text semantic features of a unified dimension using a text encoder (such as BERT). Next, fusion denoising is performed, fusing the noise tensor and text semantic features through a Cross-Attention Block, allowing the noise tensor to carry text semantic information. The fused features are then processed by ResNet / Transformer Block to extract spatial-semantic association features, and the noise in the tensor is gradually eliminated using the Denoser Block based on the embedding information at time step t. Finally, multiple time steps are iterated, repeating the above steps to iterate denoising at different time steps, ultimately outputting a converged and stable numerical feature tensor. This is the second latent space encoding.
[0135] It should be noted that inputting the second training data (text class) into the denoising backbone network for processing generates a second latent space encoding with the same dimension as the first latent space encoding. This leverages the iterative denoising logic of the denoising backbone network to improve the robustness of the text encoding and avoid mapping deviations between text semantics and latent space features. Furthermore, it enables latent space alignment between the text modality and visual / terrain / semantic modalities, allowing the model to learn text descriptions. Cross-modal association of remote sensing images / depth maps / land cover maps.
[0136] S603. Input the third remote sensing image, the third depth map of the third remote sensing image, and the third land feature map of the third remote sensing image into the encoder in the variational autoencoder to obtain the third latent space code.
[0137] In one embodiment, the encoder in the variational autoencoder has been explained above, and the method for generating the third latent space code is similar to the method for generating the second latent space code, so it will not be explained again.
[0138] It should be noted that the third latent space coding is generated based on the encoder of the variational autoencoder that has been trained. The RGB three channels of the encoder have the ability to stably extract visual features of remote sensing images. The depth map channel and semantic channel have also been trained to extract terrain elevation and semantic features of land features. It can accurately integrate the visual, terrain and semantic features of the third remote sensing image, the third depth map and the third land feature map to generate a unified dimensional feature tensor that can truly reflect the core geographic attributes of the data set.
[0139] Meanwhile, the third latent space code, serving as a ground truth reference code for text-driven remote sensing data generation tasks, can be used to subsequently locate the source of error in the text-to-second latent space code link by comparing its feature similarity with that of the second latent space code generated by the text through the denoising backbone network. Furthermore, the model parameters of the denoising backbone network are updated in reverse, optimizing the cross-modal mapping capability of the remote sensing image model and ensuring that the generated remote sensing-related data not only conforms to the semantics of the text description but also matches the geographical features of the real terrain and land features.
[0140] S604. Update the model parameters of the denoising backbone network according to the second and third latent space codes to obtain the trained remote sensing image model.
[0141] In one embodiment, the electronic device can calculate the feature difference loss between the second and third latent space codes, and then iteratively optimize the network weights using a backpropagation algorithm. For example, updates can be performed using gradient descent-type optimization algorithms, loss function-based backpropagation strategies, etc., which will not be described in detail here.
[0142] It should be noted that the goal of the denoising backbone network is to establish a precise mapping relationship between text semantics and remote sensing latent space features. However, the second latent space code generated by the denoising backbone network in its initial state may deviate significantly from the third latent space code (the ground truth features of the real remote sensing data). For example, the text describes "river valley plain + seasonal river", but the features corresponding to the generated second latent space code are closer to the latent space features of "mountainous area".
[0143] Based on this, the essence of updating the model parameters of the denoising backbone network is to reduce the feature difference between the second and third latent space codes through an iterative process of "loss calculation → gradient backpropagation → weight adjustment". This enables the denoising backbone network to learn the correspondence between text semantics and remote sensing latent space features, and ultimately achieves the goal of generating latent space codes that highly match the features of real remote sensing data when a specific text is input.
[0144] It's important to note that in practical applications, the remote sensing image model is not a single network, but rather a combination of a trained denoising backbone network and a pre-trained variational autoencoder (VAE) decoder. The denoising backbone network handles the cross-modal conversion from text to latent space encoding, while the pre-trained VAE decoder is responsible for decoding and generating multimodal remote sensing data (remote sensing images, depth maps, and feature maps) from latent space encoding. Together, they achieve an end-to-end generation task from text input to data output.
[0145] Specifically, during the model deployment phase, global text (e.g., a hilly area with 60% vegetation coverage), block text (e.g., the upper left area is a reservoir, the lower right area is terraced fields), and feature unit text (e.g., the reservoir area is approximately 5 km², and the terraced fields are distributed in strips) from the actual business scenario can be input into the trained denoising backbone network. This network then generates an actual latent space code that matches the features of the real remote sensing data. Finally, this actual latent space code is input into a pre-trained VAE decoder, which maps the latent space features to the visualized data, ultimately outputting the predicted remote sensing image, the predicted depth map, and the predicted feature map, thus achieving text-driven remote sensing multimodal data generation.
[0146] Based on the above description, the remote sensing image model in this embodiment can be built upon the existing SD3.5 diffusion model, by training the denoising backbone network using a pre-trained variational autoencoder. Furthermore, in practical application, a decoder of the variational autoencoder is also used to complete the latent space encoding decoding output.
[0147] In this embodiment, by acquiring second training data covering global text, regional text, and feature unit text, multi-granularity and highly fitting semantic support can be provided for cross-modal training. This allows the text description to accurately correspond to the global scene attributes, local regional features, and specific feature unit information of the third remote sensing image, providing an accurate semantic foundation for subsequent mapping of text semantics to latent space features. Subsequently, the second training data is input into a denoising backbone network to generate a second latent space code. Based on the diffusion model characteristics of the denoising backbone network, the generated code naturally possesses the distribution characteristics of the latent space of the pre-trained diffusion model, simultaneously achieving a preliminary transformation of text semantics into latent space features. Then, the third remote sensing image, third depth map, and third feature map are input into a pre-trained VAE encoder to generate a third latent space code. Leveraging the encoding capabilities of multi-channel VAE for multimodal remote sensing data, a third latent space code that truly reflects the visual, topographic, and semantic features of the remote sensing data is generated. This code serves as a ground truth reference code, and its latent variable distribution is a stable feature distribution after multi-channel VAE training, providing a reliable benchmark for latent space distribution. Finally, the parameters of the denoising backbone network are updated based on the difference between the second and third latent space encodings. This anchors the latent variable distribution of the multi-channel VAE to the latent space of the remote sensing image model to be trained, achieving stable connection and distribution alignment between the two latent spaces. Based on this, not only can training collapse or data distortion caused by latent variable distribution drift be effectively avoided, but the denoising backbone network also learns the accurate mapping relationship between text semantics and the latent space features of real remote sensing data. Consequently, the final generated remote sensing image model possesses both text-driven cross-modal generation capabilities and ensures that the generated remote sensing images, depth maps, and feature maps achieve the expected results in terms of geographic realism and semantic consistency.
[0148] In another embodiment, the electronic device may also be based on, for example... Figure 7 The steps S701-S702 shown below are for obtaining the second training data. Details are as follows: S701. Process the third remote sensing image, the third depth map, and the third land cover map respectively to obtain the global information, regional detail information, land cover distribution information, and elevation undulation information of the third remote sensing image.
[0149] In one embodiment, global information is a set of macroscopic attributes extracted from the integrated third remote sensing image, third depth map, and third feature map. It typically covers one or more types of information, such as the overall color appearance, dominant color tone and visual texture of the third remote sensing image, the overall terrain type (e.g., plains, mountains, hills), average elevation and overall undulation trend, the proportion and spatial distribution pattern of the dominant land cover types (e.g., cultivated land, forest land, water bodies) of the overall area.
[0150] The aforementioned regional details are the set of local features for each sub-region after the third remote sensing image, third depth map, and third feature map are divided according to fixed spatial division rules. For example, corresponding to the five preset spatial blocks (upper left, upper right, lower left, lower right, and center), the information for each block includes local visual features, local topographic relief, and local dominant feature types.
[0151] The above-mentioned land feature distribution information is a set of refined semantic features extracted from the third land feature map. It may include one or more types of information such as land feature categories (e.g., farmland, reservoirs, buildings), spatial locations of various land features, combination relationships of adjacent land features, and binary masks corresponding to each type of land feature (i.e., the pixel-level distribution range of land features in the image).
[0152] It should be noted that the above information can be obtained without relying on generative inference of a large model. Instead, it can be obtained by performing color statistics and texture extraction on the third remote sensing image; calculating the mean elevation and standard deviation of undulation of blocks on the third depth map; and performing semantic category statistics and binary mask generation on the third land cover map to obtain global information, regional detail information, land cover distribution information and elevation undulation information.
[0153] S702. Extract global text from global information, regional text from regional detail information, and feature unit text from feature distribution information and elevation undulation information, respectively.
[0154] In one embodiment, the electronic device can predefine the text generation template, field combination logic, and length constraints (such as the number of valid tokens for the CLIP encoder not exceeding 75). Then, based on preset rules, key fields are extracted from the above information, filled into the text template, and global text, regional text, and feature unit text are generated. If necessary, an automatic simplification mechanism is triggered to ensure that the text meets the encoder input requirements.
[0155] For example, five core area segments (such as overall terrain type, proportion of dominant features, dominant color, average elevation, and combination of core features) are extracted from the global information. These five segments are then combined into a complete summary using a preset template, while length constraint validation is initiated. If the number of tokens exceeds the threshold, secondary fields (such as omitting the proportion of non-core features) are automatically removed to ensure that the text length is within the encoder's usable range, achieving accurate condensation of global information.
[0156] Furthermore, local feature fields (such as "top left block: mountainous terrain + coniferous forest cover") are extracted from the regional details for each of the five fixed spatial blocks, and an independent local description is generated for each block to strengthen the binding relationship between text and spatial location and meet the requirements of the joint attention mechanism for accurate regional semantics.
[0157] Furthermore, from the distribution information of ground features and the elevation undulation information, for each type of target or each independent entity region in the semantic mask of the third ground feature map, key attribute fields (such as ground feature category, spatial location, and elevation features of the corresponding block) are extracted to generate accurate descriptive text of ground feature attributes + elevation features. At the same time, the text is output in pairs with the binary mask of the corresponding ground feature to achieve fine control over specific ground feature categories or local terrain structures.
[0158] Understandably, generating the aforementioned text using a large model would require multiple time-consuming steps, including data input, model inference, and result verification, with inference speed limited by hardware performance. However, the procedural generation method described above, which directly assembles fields from preprocessed information based on preset rules, eliminates the need for complex calculations and can batch-process text generation tasks from massive amounts of remote sensing data, significantly improving text generation efficiency.
[0159] In this embodiment, by procedurally processing the third remote sensing image, the third depth map, and the third feature map, global information covering image color appearance, overall terrain features, and dominant feature patterns is accurately extracted. This includes regional detail information on local visual, terrain, and feature attributes for multiple areas, as well as feature category, spatial location, binary mask distribution information, and elevation undulation information. Global text, regional text, and feature unit text are then extracted from this information according to preset rules. Furthermore, this method eliminates the need for large-scale model generation, avoiding the semantic distortion problems often associated with large models. It achieves accurate and efficient batch text generation while ensuring high consistency between the generated three types of text and the visual, terrain, and semantic features of the remote sensing multimodal data. This provides high-quality, highly relevant text training data for the denoising backbone network.
[0160] In another embodiment, the second training data also includes a semantic mask map of a third remote sensing image, the denoising backbone network includes an attention layer, and the electronic device can further be configured according to, for example... Figure 8 The steps S801-S803 shown generate the second latent space code. Details are as follows: S801, respectively generate the image encoding of the semantic mask map, the global text encoding of the global text, the regional text encoding of the regional text, and the feature unit text encoding of the feature unit text.
[0161] In one embodiment, the aforementioned semantic mask map is a pixel-level or region-level annotation map for a remote sensing scene (including land feature distribution and terrain structure). It can be drawn by the user or provided externally. It is used to accurately identify the land feature categories and terrain units corresponding to remote sensing images and depth maps. It is the basic annotation data for subsequent extraction of land feature distribution information, generation of land feature unit text, and cross-modal feature encoding.
[0162] The semantic mask map can contain one or more types of semantic masks and labels. For example, a multi-category semantic mask is a pixel-level fine-grained annotation map, where each pixel corresponds to a unique category identifier (distinguished by number, color, or channel value). It can be used for pixel-level division of land cover types and terrain units, and its size is completely consistent with the corresponding remote sensing image and depth map, supporting subsequent pixel-level feature matching and encoding. A label-type semantic mask is a region-level coarse-grained annotation map. It does not require labeling each pixel individually; instead, it divides the remote sensing image into several continuous regions and assigns a category label to each region, resulting in higher annotation efficiency and suitability for macro-regional terrain units or large-scale land cover block division.
[0163] In one embodiment, the image encoding is a spatial semantic feature tensor obtained after feature extraction of a semantic mask image, representing structural information such as the distribution of land cover categories, spatial contour morphology, and adjacency relationships of different land cover categories in the semantic mask, rather than the visual texture information of the original image. Furthermore, the global text encoding is a high-dimensional semantic tensor obtained after semantic feature extraction of the global text, representing the global attributes of the entire remote sensing data (such as global terrain type, dominant land cover pattern, and overall visual features). Additionally, the regional text encoding is a high-dimensional semantic tensor obtained after semantic feature extraction of the regional text, representing the local terrain, land cover, and visual features of each sub-region, with each regional text encoding having a mapping relationship with the location information of the corresponding spatial block. Finally, the land cover unit text encoding is a high-dimensional semantic tensor obtained after semantic feature extraction of the land cover unit text, primarily representing information such as the category attributes, elevation features, and spatial distribution details of a specific land cover.
[0164] In the denoising backbone network, the second training data mentioned above can be encoded through convolutional layers and / or text encoders to obtain the text codes for the image encoding, which will not be described in detail.
[0165] S802. Based on attention layer stitching image coding, global text coding, regional text coding and land feature unit text coding, fused features are obtained.
[0166] In one embodiment, the attention layer is a cross-modal attention layer adapted for multimodal feature fusion, used to model the semantic association between different modal codes and dynamically allocate feature attention weights.
[0167] For example, for the input image encoding, global text encoding, region text encoding, and feature unit text encoding, the attention layer can calculate the correlation between various encodings (e.g., the matching degree between the feature unit text encoding and the corresponding feature region in the image encoding, and the correlation degree between the region text encoding and the corresponding spatial block in the image encoding). Then, higher weights are assigned to features more relevant to the remote sensing cross-modal task, and lower weights are assigned to secondary features, thereby achieving the effect of strengthening key features and weakening redundant features.
[0168] The aforementioned fusion feature is a high-dimensional feature tensor obtained by weighting and fusing image encoding, global text encoding, region text encoding, and land cover unit text encoding through an attention layer. Typically, the dimension of the fusion feature tensor matches the input dimension of the subsequent denoising backbone network or VAE encoder, and can be directly used as input features for downstream tasks.
[0169] In one embodiment, the above-mentioned methods for splicing the various codes include, but are not limited to, multi-head cross-modal attention fusion splicing, attention residual connection splicing, etc., and are not limited thereto.
[0170] The purpose is to obtain fused features based on attention layer splicing image encoding, global text encoding, regional text encoding, and land feature unit text encoding. This can solve the heterogeneity problem of different modal features (image spatial features and text semantic features), so that the generated fused features not only retain the spatial structure information of the semantic mask image, but also integrate the accurate semantic description of the three types of text.
[0171] S803. Denoise the fused features to generate the second latent space code.
[0172] In one embodiment, the noise reduction method can refer to the above. Figure 6 The example is illustrated and will not be explained further.
[0173] In this embodiment, a semantic mask map is introduced into the second training data. This is combined with global text, regional text, and feature unit text to generate corresponding image codes and three types of text codes. Then, the attention layer of the denoising backbone network models cross-modal association and concatenates the four types of codes to obtain fused features. This not only integrates the spatial structural features of the semantic mask map with the semantic features of the text codes, compensating for the information limitations of single-modal features, but also strengthens the association of key features and weakens redundant information through the attention mechanism. Subsequently, a second latent space code is generated after denoising processing. This second latent space code possesses both spatial structural consistency and semantic accuracy, better aligning with the feature distribution of the third latent space code, effectively improving the accuracy of cross-modal mapping.
[0174] In another embodiment, the electronic device may also be based on, for example... Figure 9The S901-S903 steps shown yield the fusion features. Details are as follows: S901. In the attention layer, the connection relationships between global text encoding, regional text encoding, feature unit text encoding and image encoding are constructed respectively to obtain the connection matrix.
[0175] The aforementioned connection matrix is used to constrain the influence range of different texts on the corresponding regions in the third remote sensing image.
[0176] In one embodiment, the connection matrix is a two-dimensional matrix constructed within the attention layer to quantify the correlation strength between global text encoding, regional text encoding, feature unit text encoding, and image encoding. Each element value in the matrix represents the influence weight of the corresponding text encoding on a certain location or semantic region in the image encoding (generally, the larger the value, the stronger the influence; the smaller the value, the weaker the influence). Through the numerical distribution of the matrix elements, the influence range of different texts on corresponding regions in the third remote sensing image can be precisely constrained (for example, ensuring that the text "farmland in the central block" only affects the central region of the image encoding, avoiding invalid cross-regional influence), achieving precise binding between text semantics and image space.
[0177] The connection matrix can be either sparse or dense; there is no limitation on this. The methods for constructing sparse and dense matrices are not described in detail here.
[0178] It should be noted that in the attention layer, spatial association matching is performed between the three types of text encoding—global, regional, and feature unit—and image encoding, respectively. A connection matrix is constructed by calculating association weights to clarify the image regions and their intensity of influence for each type of text, thus avoiding interference from text semantics with irrelevant image regions. Specifically, through the constraints of the connection matrix, the semantic information of each text segment is precisely applied to the preset spatial region, achieving a strong binding between text semantics and image spatial layout. This improves the controllability of spatial layout during the generation of remote sensing images, depth maps, and feature maps, ensuring a high degree of consistency between the spatial structure of the generated result and the regional features described in the text.
[0179] S902. Based on the connection matrix, modulate the preset attention distribution to obtain the modulated target attention distribution matrix.
[0180] The aforementioned target attention distribution matrix is used to describe the attention association strength between global text encoding, regional text encoding, and feature unit text encoding and image encoding, respectively.
[0181] In one embodiment, the preset attention distribution is an initial attention weight distribution matrix calculated by the attention layer based solely on the original semantic similarity between text encoding and image encoding before the introduction of connection matrix constraints. Its dimension is identical to the connection matrix, and each element in the matrix represents the initial association strength between a certain text encoding feature and a certain image encoding region. However, this distribution only reflects the semantic matching degree and does not consider the layout constraints of specific spatial regions corresponding to text segments, thus failing to directly achieve spatial layout controllability.
[0182] As an example, an electronic device can perform numerical stabilization on the connection matrix to obtain an attention bias matrix. Then, based on the attention bias matrix, a preset attention distribution is modulated to obtain the modulated target attention distribution matrix.
[0183] Since the connection matrix contains a large number of zero values, directly using it for modulation can easily lead to vanishing gradients and abnormal attention weight distribution. Therefore, numerical stabilization is required first. Numerical stabilization can be achieved by adding a minimal constant to all elements, or by performing Softmax normalization on the matrix elements, etc., and there are no specific limitations on the methods used.
[0184] The attention bias matrix mentioned above is a general term for sparse connection matrices after numerical stabilization. Essentially, it is a bias term matrix with spatial constraints, which retains the core characteristics of the connection matrix.
[0185] In one embodiment, the method of modulating the preset attention distribution based on the attention bias matrix includes, but is not limited to, multiplicative modulation, additive modulation, etc.
[0186] It should be noted that by transforming the spatial layout constraints of the connection matrix into the distribution rules of attention weights to obtain the target attention distribution matrix, the unconstrained semantic associations of the initial attention distribution can be avoided. This, in turn, enables stronger layout controllability, providing an attention-level foundation for the accurate generation of subsequent fused features.
[0187] S903. Based on the target attention distribution matrix, image encoding, global text encoding, regional text encoding, and land feature unit text encoding are combined to obtain fused features.
[0188] In one embodiment, the electronic device can assign dynamic attention weights to the four types of codes based on the attention association strength of the global text code, regional text code, feature unit text code, and image code quantified in the target attention distribution matrix (multiplying the weight value at the corresponding position in the matrix with each code feature). Then, the global text code is assigned a uniform and stable weight across the entire effective area of the image. The weighted four types of codes are then dimensionally concatenated to form a unified fusion feature. Furthermore, through this concatenation method, not only can the spatial structural features of the image code and the semantic features of the three types of text codes be integrated, but the precise control of attention weights also strengthens the association between text semantics and the corresponding spatial region of the image, avoiding irrelevant text from ineffectively interfering with the image region, thus achieving precise spatial layout of the remote sensing image.
[0189] In this embodiment, connection matrices are constructed for three types of text encoding (global, regional, and feature unit) and image encoding in the attention layer. These connection matrices constrain the influence range of regional and feature unit text on specific image regions, avoiding invalid cross-regional interference. Then, a preset attention distribution is modulated using the connection matrices, injecting spatial constraints into the attention weight allocation process to obtain a target attention distribution matrix that combines semantic matching and spatial orientation. Finally, the four types of encoding are weighted and concatenated based on the target attention distribution matrix to generate fusion features. This not only integrates the spatial structural features of image encoding with the semantic features of text encoding but also achieves precise binding between text semantics and image spatial regions, improving the effectiveness and relevance of the fusion features. This makes the subsequently generated second latent spatial encoding more closely match the feature distribution of real remote sensing data, thereby enhancing the layout controllability, geographic realism, and semantic consistency of remote sensing images, depth maps, and feature maps generated by the remote sensing image model.
[0190] In another embodiment, reference is made to Figure 10 , Figure 10 This is a schematic diagram of the model structure during the training phase of a remote sensing image model training method provided in one embodiment of this application. The Latent Encoder in the image is the encoder of the aforementioned variational autoencoder, and the N×DIT Block + denoising module is the core execution component of the aforementioned denoising backbone network. During the training phase, the third remote sensing image A3, the third depth map B3, and the third land cover map C3 can be acquired first, while simultaneously acquiring second training data: global text, regional text, and land cover unit text (corresponding to the three text prompts in the image). In another embodiment, a semantic mask map of the third remote sensing image can also be acquired, corresponding to the "downsampled" image in the image, and model training can be performed synchronously.
[0191] During training, the electronic device can input the second training data, semantic mask map, diffusion time step (timesetp), and random noise into the core component N×DIT Block of the denoising backbone network to generate the second latent space code Z2. Additionally, the third remote sensing image A3, the third depth map B3, and the third land cover map C3 are input into the latent encoder of the variational autoencoder to generate the third latent space code Z1. Finally, Z1 is input into the denoising backbone network to update its model parameters in conjunction with Z2, resulting in the trained remote sensing image model.
[0192] in, Figure 10 The attention map modulation diagram is a visual grid matrix representing the working logic of the attention layer within the DIT Block. It consists of three layers: a semantic mask map, local text (regional text and feature unit text), and global text. Yellow squares represent cross-attention between text and image, while orange squares represent self-attention within the image itself. In the semantic mask map, orange self-attention (where, if no semantic mask exists during training or application, this part can be filled with 0 or other values to participate in processing) is dominant, ensuring the continuity of the image's spatial structure. The yellow cross-attention in the local text layer focuses on specific areas, reflecting the constraint that local text only affects the corresponding image region. The yellow cross-attention in the global text layer covers most areas, corresponding to the rule that global text affects the entire image. This diagram visually presents the constraints on the influence range of text on image regions, achieving precise binding between text semantics and image spatial regions, providing attention-level layout control for subsequent generation of latent space encoding that fits the text instructions.
[0193] In another embodiment, reference is made to Figure 11 , Figure 11This is a schematic diagram of the model structure in the application stage of a remote sensing image model training method provided in an embodiment of this application. The three text prompts in the image correspond to the global text, regional text, and land cover unit text input by the user in the actual scene. The downsampled image is a semantic mask map drawn by the user, which can be selected as input or not. timesetp and Noise are the standard inputs of the denoising backbone network in the usage stage. After receiving the above inputs, the N×DIT Block can complete the cross-modal conversion from text semantics to actual latent space encoding, and generate stable actual latent space encoding that fits the text semantics through denoising processing. The Latent Decoder corresponds to the decoder in the pre-trained variational autoencoder, which can receive the actual latent space encoding output from the DIT Block and perform decoding processing. A4, B4, and C4 are the results output by the decoder, corresponding to the predicted remote sensing image (A4), the predicted depth map (B4), and the predicted land cover map (C4), respectively. That is, in actual use, the multimodal remote sensing data generated by text driving.
[0194] Figure 12 This is a schematic diagram of the structure of an electronic device provided in one embodiment of this application. Figure 12 As shown, the electronic device 1200 of this embodiment includes: a processor 1210, a memory 1220, and a computer program 1230 stored in the memory 1220 and executable on the processor 1210, such as a program for training a multi-channel latent space variational autoencoder and / or a remote sensing image model. When the processor 1210 executes the computer program 1230, it implements the steps of each embodiment of the above-described multi-channel latent space variational autoencoder training method and / or remote sensing image model training method, for example... Figure 1 S101 to S105 as shown, or Figure 6 S601-S604 are shown.
[0195] For example, the computer program 1230 can be divided into one or more modules, one or more of which are stored in the memory 1220 and executed by the processor 1210 to implement the training method for the multi-channel latent space variational autoencoder and / or the training method for the remote sensing image model provided in the embodiments of this application. One or more modules can be a series of computer program instruction segments capable of performing specific functions, which describe the execution process of the computer program 1230 in the electronic device 1200. For example, the computer program 1230 can implement the training method for the multi-channel latent space variational autoencoder and / or the training method for the remote sensing image model provided in the embodiments of this application.
[0196] Electronic device 1200 may include, but is not limited to, processor 1210 and memory 1220. Those skilled in the art will understand that... Figure 12 This is merely an example of electronic device 1200 and does not constitute a limitation on electronic device 1200. It may include more or fewer components than shown, or combine certain components, or different components. For example, electronic device may also include input / output devices, network access devices, buses, etc.
[0197] The processor 1210 may be a central processing unit, or it may be other general-purpose processors, digital signal processors, application-specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor, etc.
[0198] The memory 1220 can be an internal storage unit of the electronic device 1200, such as a hard disk or memory of the electronic device 1200. The memory 1220 can also be an external storage device of the electronic device 1200, such as a plug-in hard disk, smart memory card, flash memory card, etc. equipped on the electronic device 1200. Furthermore, the memory 1220 can include both internal storage units and external storage devices of the electronic device 1200.
[0199] This application provides a computer-readable storage medium, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the training method of the variational autoencoder of the multi-channel latent space and / or the training method of the remote sensing image model as described in the above embodiments.
[0200] This application provides a computer program product that, when run on an electronic device, causes the electronic device to execute the training method of the multi-channel latent space variational autoencoder and / or the training method of the remote sensing image model in the above embodiments.
[0201] The above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.
Claims
1. A training method for a variational autoencoder with a multi-channel latent space, characterized in that, The method includes: Acquire first training data; the first training data includes a first remote sensing image, a first depth map of the first remote sensing image, and a first land feature map of the first remote sensing image; the first depth map is used to describe the terrain elevation undulation information of each pixel in the first remote sensing image; the first land feature map is used to describe the land feature units present in each pixel of the first remote sensing image. The first training data is input into the encoder of the variational autoencoder to obtain the first latent space code. The variational autoencoder includes a pre-trained red-green-blue three-channel, a depth map channel to be trained, and a semantic channel. The red-green-blue three-channel is used to encode and decode the three-channel image features of the first remote sensing image. The depth map channel is used to encode and decode the terrain elevation relief information in the first depth map. The semantic channel is used to encode and decode the land cover units in the first land cover map. The first latent space code is generated by the encoder after encoding the first training data based on the encoding capabilities of the red-green-blue three-channel, the depth map channel, and the semantic channel. The first latent space code is input into the decoder in the variational autoencoder to obtain output data; the output data includes a second remote sensing image, a second depth map, and a second ground feature map. Calculate the target training loss for the first training data and the output data; The depth map channel, the semantic channel, and the model parameters in the decoder are updated based on the target training loss to obtain the updated variational autoencoder.
2. The method according to claim 1, characterized in that, The process of obtaining the first training data further includes: Acquire a first remote sensing image and generate the first depth map of the first remote sensing image; The preset water body information in the first remote sensing image is fused with the first depth map to obtain the first ground feature map.
3. The method according to claim 1, characterized in that, The method further includes: Calculate the average weight of the preset initial learning weights in the red, green and blue channels; The average weights are set as the initial learning weights for the depth map channel and the semantic channel, respectively.
4. The method according to any one of claims 1-3, characterized in that, The calculation of the target training loss for the first training data and the output data includes: Calculate pixel reconstruction loss based on the first depth map and the second depth map; The terrain structure fidelity loss is calculated based on the first depth map and the second depth map; the terrain structure fidelity loss is used to describe the degree of difference in the terrain structure features output by the variational autoencoder during training. Calculate semantic consistency loss based on the first and second feature maps; Calculate the high-frequency detail constraint loss based on the first training data and the output data; The target training loss is calculated based on the pixel reconstruction loss, the terrain structure fidelity loss, the semantic consistency loss, and the high-frequency detail constraint loss.
5. The method according to claim 4, characterized in that, The calculation of terrain structure fidelity loss based on the first depth map and the second depth map includes: For any central pixel in the target depth map, calculate the elevation differences between the central pixel and multiple neighboring pixels; the neighboring pixels are the pixels adjacent to the central pixel; the target depth map is the first depth map and the second depth map. Based on the elevation difference, a neighborhood state code is generated for multiple neighboring pixels; The topography type code is obtained by integrating state codes from multiple domains; Based on the preset mapping relationship between landforms and codes, the predicted landform type corresponding to the landform type code is determined; Based on the predicted landform type and the actual landform type corresponding to the center pixel, determine the initial landform structure fidelity loss corresponding to the center pixel; The average value of the initial terrain structure fidelity loss is calculated to obtain the terrain structure fidelity loss.
6. A training method for a remote sensing image model, characterized in that, The remote sensing image model includes a denoising backbone network and a variational autoencoder as described in any one of claims 1-5; the method includes: Acquire second training data; the second training data includes global text describing the third remote sensing image, regional text describing the corresponding area of the third remote sensing image, and text describing the land feature units in the third remote sensing image. The second training data is input into the denoising backbone network to generate the second latent space code of the third remote sensing image; The third remote sensing image, the third depth map of the third remote sensing image, and the third land feature map of the third remote sensing image are input into the encoder in the variational autoencoder to obtain the third latent space code; The model parameters of the denoising backbone network are updated based on the second latent space coding and the third latent space coding to obtain the trained remote sensing image model.
7. The method according to claim 6, characterized in that, The acquisition of the second training data includes: The third remote sensing image, the third depth map, and the third land cover map are processed respectively to obtain the global information, regional detail information, land cover distribution information, and elevation relief information of the third remote sensing image; The global text is extracted from the global information, the regional text is extracted from the regional detail information, and the feature unit text is extracted from the feature distribution information and the elevation relief information.
8. The method according to claim 6 or 7, characterized in that, The second training data also includes a semantic mask map of the third remote sensing image. The denoising backbone network includes an attention layer. The step of inputting the second training data into the denoising backbone network to generate a second latent space code for the third remote sensing image includes: The image encoding of the semantic mask image, the global text encoding of the global text, the regional text encoding of the regional text, and the feature unit text encoding of the feature unit text are generated respectively. Based on the attention layer, the image encoding, the global text encoding, the region text encoding, and the land feature unit text encoding are concatenated to obtain the fused feature; The fused features are denoised to generate the second latent space code.
9. The method according to claim 8, characterized in that, The process of concatenating the image encoding, the global text encoding, the region text encoding, and the land feature unit text encoding based on the attention layer to obtain fused features includes: In the attention layer, the connection relationships between the global text encoding, the regional text encoding, the land feature unit text encoding, and the image encoding are constructed respectively to obtain a connection matrix; the connection matrix is used to constrain the influence range of different texts on the corresponding regions in the third remote sensing image. Based on the connection matrix, a preset attention distribution is modulated to obtain a modulated target attention distribution matrix; the target attention distribution matrix is used to describe the attention association strength between the global text encoding, the regional text encoding, the feature unit text encoding, and the image encoding, respectively. The fused feature is obtained by concatenating the image encoding, the global text encoding, the regional text encoding, and the ground feature unit text encoding based on the target attention distribution matrix.
10. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the method as described in any one of claims 1 to 9.