A terrain generation method, device, storage medium and program product

By combining DEM and RGB images, and based on terrain structure sketches and text control information, this method solves the problems of low efficiency and insufficient accuracy in existing terrain generation technologies, achieving efficient and accurate terrain generation and improving the realism and consistency of the terrain.

CN122244362APending Publication Date: 2026-06-19BEIJING FORESTRY UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING FORESTRY UNIVERSITY
Filing Date
2026-05-15
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing 3D terrain generation methods are inefficient and error propagation leads to insufficient generation accuracy and realism, making it difficult to guarantee the structural and visual consistency between DEM and RGB images.

Method used

By acquiring terrain structure sketches and text control information, we extract potential elevation and texture features respectively, and use a generative network to jointly generate target elevation and texture results. Spatial structure constraints are achieved based on the same sketch, and visual style constraints are achieved through text control information, thus avoiding error propagation in serial processing.

Benefits of technology

It improves the efficiency and quality of terrain generation, ensures spatial and visual style consistency between DEM and RGB images, and generates target terrain with controllable visual style and high precision.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244362A_ABST
    Figure CN122244362A_ABST
Patent Text Reader

Abstract

This application provides a terrain generation method, apparatus, storage medium, and program product. The method includes: acquiring a terrain structure sketch and text control information; determining a first latent feature and a second latent feature corresponding to the terrain structure sketch, where the first latent feature characterizes the elevation information of the terrain structure sketch, and the second latent feature characterizes the texture information of the terrain structure sketch; processing the first and second latent features based on the text control information to obtain target elevation features and target texture features matching the text control information; decoding the target elevation features and target texture features respectively to obtain target elevation results and target texture results; and generating target terrain based on the target elevation results and target texture results. By collaboratively controlling the joint generation of the target elevation results and target texture results using the terrain structure sketch and text control information, the efficiency, realism, and overall consistency of terrain generation are improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a terrain generation method, device, storage medium, and program product. Background Technology

[0002] 3D virtual terrain has wide applications in fields such as games, film and television, virtual reality, scientific research, education and military. Its realism directly affects the immersive experience and is also an important foundation for constructing complex scenes, scientific simulations and spatial displays.

[0003] As application scenarios continue to expand, higher demands are being placed on the accuracy and realism of terrain generation. Currently, 3D terrain models are typically generated through joint rendering of a Digital Elevation Model (DEM) and an RGB image (RGB color mode, Red Green Blue). The DEM depicts the continuous changes in elevation, while the RGB image presents color and texture; both describe the same geographic entity from geometric and visual dimensions, respectively. For example, a mountain: the DEM records the altitude (1000 meters at the summit, 200 meters at the foot), and the RGB image shows snow on the summit and grass at the foot. Only when the two are strictly aligned can the natural law that "snow only accumulates at high altitudes" be accurately expressed. However, existing methods mostly employ a sequential generation strategy: first, the DEM is acquired to determine the terrain's geometric skeleton; then, the corresponding RGB image is generated based on the DEM; finally, the RGB image is used as a texture map and mapped onto the 3D mesh surface constructed from the DEM. This serial approach is not only inefficient overall, but the performance of each subsequent step is also highly dependent on the accuracy of the preceding step. Any errors introduced by the DEM during acquisition or interpolation will propagate downstream along the processing chain, causing the RGB image and terrain geometry to be inaccurately aligned, ultimately reducing the overall realism and consistency of the terrain model. Summary of the Invention

[0004] This application provides a terrain generation method, device, storage medium, and program product to improve the efficiency and quality of terrain generation.

[0005] In a first aspect, embodiments of this application provide a terrain generation method, including: Obtain terrain structure sketches and text control information, wherein the text control information is used to characterize the style of the target terrain; A first latent feature and a second latent feature corresponding to the terrain structure sketch are determined. The first latent feature is used to characterize the elevation information of the terrain structure sketch, and the second latent feature is used to characterize the texture information of the terrain structure sketch. Based on the text control information, the first latent feature and the second latent feature are processed to obtain target elevation features and target texture features that match the text control information; The target elevation features and the target texture features are decoded respectively to obtain the target elevation result and the target texture result; Based on the target elevation result and the target texture result, a target terrain corresponding to the terrain structure sketch and the text control information is generated.

[0006] Secondly, embodiments of this application provide a terrain generation apparatus, comprising: The acquisition module is used to acquire terrain structure sketches and text control information, wherein the text control information is used to characterize the style of the target terrain; The determination module is used to determine a first potential feature and a second potential feature corresponding to the terrain structure sketch. The first potential feature is used to characterize the elevation information of the terrain structure sketch, and the second potential feature is used to characterize the texture information of the terrain structure sketch. The processing module is used to process the first latent feature and the second latent feature based on the text control information to obtain target elevation features and target texture features that match the text control information; The decoding module is used to decode the target elevation feature and the target texture feature respectively to obtain the target elevation result and the target texture result; The generation module is used to generate a target terrain corresponding to the terrain structure sketch and the text control information based on the target elevation result and the target texture result.

[0007] Thirdly, embodiments of this application provide an electronic device, including: a memory, a processor, and a communication interface; wherein, the memory stores executable code, and when the executable code is executed by the processor, the processor performs the method as described in the first aspect.

[0008] Fourthly, embodiments of this application provide a non-transitory machine-readable storage medium storing executable code, which, when executed by a processor of an electronic device, enables the processor to at least implement the method described in the first aspect.

[0009] Fifthly, embodiments of this application provide a computer program product, the computer program product including a computer program, which, when executed by a processor, can implement the method described in the first aspect.

[0010] In the terrain generation scheme provided in this application embodiment, the generation of target elevation results and target texture results is collaboratively controlled based on terrain structure sketches and text control information, and a target terrain is constructed based on the jointly generated target elevation results and target texture structure. Specifically, by acquiring terrain structure sketches and text control information, a first latent feature for characterizing the elevation information of the terrain structure sketch and a second latent feature for characterizing the texture information of the terrain structure sketch are generated based on the terrain structure sketch. Then, the first latent feature and the second latent feature are processed based on the text control information to obtain target elevation features and target texture features that match the text control information. Subsequently, the target elevation features and target texture features are decoded respectively to obtain target elevation results and target texture results, and a target terrain corresponding to the terrain structure sketch and the text control information is generated based on the target elevation results and target texture results. This effectively enables the generation of target terrain based on joint control of sketches and text. Furthermore, since the target terrain is generated by jointly generating the target elevation and target texture results, and the generation processes of the target elevation and target texture results are independent of each other, it overcomes the shortcomings of related technologies that generate RGB images based on DEM rendering, such as low terrain generation efficiency and low accuracy and poor realism of terrain data due to the propagation of errors along the terrain generation chain. This results in the generated target terrain not only having the geometric structure of the generated elevation and texture results constrained by the terrain structure sketch, but also having the visual style of the two constrained by the text control information. This not only effectively improves the generation efficiency and quality of the target terrain, but also effectively ensures the structural consistency between elevation and texture, thereby improving the overall quality and efficiency of terrain generation. Attached Figure Description

[0011] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0012] Figure 1 A flowchart illustrating a terrain generation method provided in this application embodiment; Figure 2 A flowchart illustrating a terrain generation method provided in this application embodiment; Figure 3 This is a schematic diagram of a sliding window overlay sampling provided in an embodiment of this application; Figure 4 A flowchart illustrating a terrain generation method provided in this application embodiment; Figure 5This application provides a schematic diagram of overlapping local control information regions. Figure 6 A schematic diagram illustrating the principle of a terrain generation method provided in an embodiment of this application; Figure 7 A flowchart illustrating a terrain generation model training method provided in this application embodiment; Figure 8 This is a schematic diagram of the principle of a terrain generation discriminator provided in an embodiment of this application; Figure 9 This is a schematic diagram of the structure of a terrain generation device provided in an embodiment of this application; Figure 10 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0013] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application. In addition, the timing of the steps in the following method embodiments is only an example and not a strict limitation.

[0014] It should be noted that, in the cases involving user information in the embodiments of this application, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in the embodiments of this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use, and processing of related data must comply with the relevant laws, regulations, and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse. In addition, the various models involved in this application (including but not limited to problem analysis models or other models) comply with relevant laws and standards.

[0015] The following explains the terms or concepts involved in the embodiments of this application: Digital Elevation Model (DEM): A type of terrain surface elevation data stored in the form of a raster matrix. Each raster cell records the ground elevation at that location, and is used to quantitatively describe the three-dimensional undulations and spatial distribution characteristics of the terrain.

[0016] Red-Green-Blue Image (RGB Image): In this application, it specifically refers to a surface texture image that is consistent with the spatial extent of the digital elevation model. This image is synthesized using the red, green, and blue color channels and is used to characterize the color, texture, and surface cover type (such as vegetation, water bodies, bare soil, snow, etc.) of ground features, reflecting the visual appearance information of the ground surface.

[0017] Existing terrain generation methods can be divided into two categories: one is the traditional method represented by procedural modeling, physical erosion simulation, sample reuse and sketch control, and the other is the deep learning method represented by generative adversarial networks and diffusion models.

[0018] Traditional terrain generation methods, such as procedural modeling, physical erosion simulation, sample reuse, and sketch control, primarily aim to generate a DEM, which is the geometric skeleton of the terrain. These methods typically output the DEM directly, and RGB images are not part of the generation process. When RGB images are needed, pseudo-color can only be generated through elevation-based rule mapping (such as assigning white to high altitudes), or textures can be extracted from external data sources and manually aligned, making it difficult to guarantee the efficiency and quality of terrain generation.

[0019] Deep learning methods, represented by generative adversarial networks (GANs) and diffusion models, primarily achieve one-way translation or single-modal conditional generation, such as generating RGB images from sketches or DEMs. When users need to generate terrain based on DEMs and RGB images, they can only run two independent models sequentially: first generate the DEM and then translate it into an RGB image, or vice versa. This sequential approach makes it difficult to precisely align the outputs of the two models, and errors from the previous step are propagated to the next, reducing the realism and consistency of the terrain model.

[0020] Both traditional and deep learning methods follow a sequential generation strategy when generating 3D terrain based on DEM and RGB images: either geometry first, then texture, or texture first, then geometry. This sequential approach is inefficient, and it's difficult to guarantee structural consistency between the independently generated and superimposed images. Therefore, how to improve generation efficiency while ensuring the quality of the generated DEM and RGB images and their structural consistency has become a pressing technical problem.

[0021] This application provides a terrain generation method that addresses the aforementioned problems by jointly generating a DEM and RGB image of the target terrain based on the same terrain structure sketch. Since both share spatial structure information from the same sketch, strict geometric alignment between the DEM and RGB images is ensured, thus achieving spatial consistency constraints. Furthermore, the joint generation of DEM and RGB images avoids error propagation in traditional serial processing, effectively improving generation efficiency and quality. In addition, consistent structural constraints between the DEM and RGB images are achieved through the terrain structure sketch, and consistent style constraints are achieved through text control information. This ensures accurate structural alignment while giving the generated terrain a controllable and harmonious visual style, ultimately improving the overall quality of terrain generation.

[0022] Figure 1 A flowchart of a terrain generation method provided in this application embodiment is attached. Figure 1 As shown, this embodiment provides a terrain generation method. The execution subject of this method can be a terminal device such as a PC, laptop, or smartphone, or a server. The server can be a physical server containing an independent host, a virtual server, or a cloud server or server cluster; no limitation is made here. Figure 1 As shown, the method includes the following steps: 101. Obtain terrain structure sketches and text control information. The text control information is used to characterize the style of the target terrain.

[0023] In this embodiment, a terrain structure sketch (hereinafter referred to as a sketch) refers to input information used to represent the overall or local structure of the target terrain. The sketch can be a raster image, vector lines, hand-drawn strokes, edge maps, contour maps, or other structured input forms that can be processed by an encoder. The sketch primarily carries the geometric constraints of the terrain, including but not limited to terrain features such as ridgelines, valley lines, river courses, mountain peak locations, and basin outlines.

[0024] In practical applications, sketches can be obtained in the following ways: First, user hand-drawn input: Users can directly draw terrain outlines, ridgelines, valley lines, etc. on the graphical interface using interactive devices such as touch screens, graphics tablets, and mice. The terrain generation system collects the handwriting coordinates in real time and generates corresponding binary images or vector paths. Second, template or sample import: Users can select existing terrain structure templates from a preset sketch library or import external image files (such as hand-drawn scans or CAD line drawings). Third, semantic segmentation map conversion: Users can first draw or generate a semantic segmentation map (e.g., labeling "mountainous area," "river area," and "flat area"), and then extract boundary lines (the intersections of different colored areas) from this map to obtain structural lines such as ridgelines, valley lines, and shorelines, which can be used as sketch input.

[0025] In some embodiments, the sketch includes at least ridgelines and valley lines. Ridgelines are used to constrain the connection between the local highest points of the terrain, and valley lines are used to constrain the catchment paths of water systems. These two basic lines allow users to quickly express the macroscopic skeleton of the terrain, and subsequent models will generate continuous, natural elevation fields and corresponding texture images based on these lines.

[0026] To further improve the accuracy of terrain generation, in addition to obtaining terrain structure sketches, text control information can also be acquired to assist in terrain generation and characterize the style of the pre-generated target terrain. In some embodiments, the text control information can be implemented as natural language text, specifically a natural language string containing ecological type and landform type information, formatted using a preset template or flexible arrangement. Alternatively, the text control information can be a field composed of keywords. In this case, to ensure the terrain generation model clearly understands the viewing angle of the RGB image to be generated, the text control information can include the key field "A satellite view of" as a prefix to fix it as a remote sensing overhead view. For example, the user-selected ecological type and landform type can be concatenated in the format "A satellite view of {ecological type} and {landform type}" to reliably obtain the text control information used for terrain generation. By adding the prefix "A satellite view of" and using a unified template to stitch together ecological and landform types, the generated RGB image can be clearly constrained to have a remote sensing overhead view, eliminating view ambiguity and simplifying user input steps, while enhancing the realism and controllability of terrain generation.

[0027] The text control information is input and selected by the user. The text encoder maps the text control information into a dense text embedding vector, extracts semantic features related to ecological type, landform type and viewpoint, and generates corresponding text encoding features for use in subsequent terrain generation steps.

[0028] To improve the accuracy and efficiency of the text encoder, a data tag library containing various landform and ecological types can be pre-built for users to filter. The data tags selected by the user generate text control information through a unified template.

[0029] 102. Determine the first and second potential features corresponding to the terrain structure sketch. The first potential feature is used to characterize the elevation information of the terrain structure sketch, and the second potential feature is used to characterize the texture information of the terrain structure sketch.

[0030] After obtaining the sketch, the first latent feature and the second latent feature corresponding to the sketch can be determined. The first latent feature is used to characterize the elevation information of the terrain structure sketch, and the second latent feature is used to characterize the texture information of the terrain structure sketch.

[0031] In practical applications, the first latent feature and the second latent feature can be obtained by the elevation encoder and the texture encoder, respectively. Optionally, determining the first latent feature and the second latent feature corresponding to the terrain structure sketch includes: simultaneously inputting the terrain structure sketch into the elevation encoder and the texture encoder to obtain the first latent feature output by the elevation encoder and the second latent feature output by the texture encoder.

[0032] Specifically, the elevation encoder maps the input sketch to elevation feature representations in the latent space. This encoder can employ a convolutional neural network or a visual Transformer architecture, and its output is a small-sized elevation latent feature map that preserves the elevation structure information in the sketch, such as the relative height relationships of ridges and valleys. The texture encoder maps the input sketch to texture feature representations in the latent space. This encoder can have a similar structure to the elevation encoder but with independent parameters, and its output is a texture latent feature map that encodes the implicit texture distribution information in the sketch, such as visual patterns of vegetation, water bodies, and bare soil that should be present in different areas.

[0033] In this embodiment, the sketch is simultaneously input into both the elevation encoder and the texture encoder. To facilitate subsequent joint processing, the latent feature sizes output by the two encoders are identical. For example, assuming the sketch size is 2048×2048×3, when input into both the elevation encoder and the texture encoder, the elevation encoder outputs a first latent feature with a size of 256×256×1, which contains the elevation structure information from the sketch. The texture encoder outputs a second latent feature with a size of 256×256×3, which contains the implicit texture distribution information from the sketch. Since the two latent features have the same spatial size (both 256×256), and because they use the same sketch as input, they are naturally aligned in spatial structure, laying the foundation for subsequent joint generation and consistency constraints.

[0034] In some embodiments, the first and second latent features corresponding to the terrain structure sketch can also be implemented using a single encoder's joint output. Specifically, the terrain structure sketch can be set as the input of a shared encoder (e.g., a convolutional neural network or a visual Transformer). This shared encoder branches into two different output heads in the middle or at the end of the network: one output head generates the first latent feature representing elevation information, and the other output head generates the second latent feature representing texture information. The number of encoding layers corresponding to the two output heads can be different. For example, the first few layers of the shared encoder can be set to extract common features from the sketch, and then the extracted shared feature maps can be input into the elevation prediction branch and the texture prediction branch, respectively. The two branches pass through different numbers of convolutional layers or fully connected layers, and finally output the first and second latent features, respectively. This application does not limit this approach, as long as it achieves effective decoupling of elevation and texture information in the terrain structure sketch and maps them separately into the latent feature space.

[0035] 103. Based on the text control information, process the first latent feature and the second latent feature to obtain the target elevation feature and target texture feature that match the text control information.

[0036] Since text control information is used to identify the style of the pre-generated target terrain, the style of the terrain can generally be represented as a unified description of the overall appearance of the DEM and RGB images. That is, the style of the target terrain is jointly determined by the landform type and the ecological type. Among them, the landform type mainly reflects the overall morphological characteristics of the DEM, such as mountains, hills, plains, plateaus and basins. These landforms have obvious differences in morphology: plains are low and open, hills are gently undulating, mountains have significant elevation differences, plateaus have a relatively high overall elevation and a relatively flat surface, and basins have a relatively low overall elevation and more convergent local undulations. The ecological type mainly reflects the surface cover characteristics of the RGB image, such as arid deserts, wetlands, alpine zones, shrublands and tundra. Different ecological types also have significant differences in color distribution, texture organization and spatial coverage.

[0037] Specifically, the user-specified ecological type and landform type can be obtained through the obtained text control information. In order to accurately generate the target terrain, the first and second potential features can be processed based on the text control information to obtain the target elevation features and target texture features that match the text control information. That is, the target elevation features match the ecological type and the target texture features match the landform type.

[0038] In some instances, target elevation features and target texture features can be determined based on a pre-trained generative network. In this case, if the text control information includes the prefix "A satellite view of", the aforementioned text control information can be first input into a text encoder (such as the text encoder of the CLIP model) to obtain the encoded text control information.

[0039] Then, the first latent feature, the second latent feature, and the encoded text control information can be input into the generator network so that the generator network can output target elevation features and target texture features that match the text control information.

[0040] In some embodiments, the generative network can employ various deep generative models, including but not limited to Generative Adversarial Networks (GANs), diffusion models, or Transformer-based generative models. Diffusion models generate data through progressive denoising, resulting in high-quality generation. To further improve generation efficiency, a one-step diffusion network (such as a Pix2Pix-Turbo or SD-Turbo-based generative model) can be used. This network can directly output target elevation and texture features in a single forward propagation, avoiding the multi-step iterative sampling of traditional diffusion models and thus increasing the speed of terrain generation.

[0041] In other instances, processing the first latent feature and the second latent feature based on the text control information to obtain target elevation features and target texture features that match the text control information may include: fusing the first latent feature and the second latent feature to obtain fused latent features; and inputting the fused latent features and the text control information into the denoising network of a single-step diffusion model to obtain target elevation features and target texture features that match the text control information.

[0042] The fusion of latent features can be achieved by concatenating the first and second latent features along the channel dimension, resulting in fused latent features with double the number of channels while maintaining the same spatial size. Correspondingly, the original single-step diffusion model can be expanded in channels, allowing the fused latent features to be smoothly input into the denoising network of the single-step diffusion model through the expanded channels. This denoising network typically employs a U-net architecture. Simultaneously, textual control information is input as conditional information into the denoising network. It is worth noting that the textual control information needs to be first converted into text-encoded features by a text encoder before being input into the denoising network, ensuring that the textual control information can be effectively understood and used to constrain terrain generation. Subsequently, based on the latent fusion features and the encoded textual control information, the denoising network directly predicts the target latent features that simultaneously satisfy structural and style constraints in a single forward propagation, and decomposes them along the channel dimension into target elevation features and target texture features.

[0043] The fusion of latent features incorporates the elevation structure information from the sketch included in the first latent feature and the implicit texture distribution information from the sketch included in the second latent feature. This constrains the topographic geometry (e.g., elevation undulation, ridge and valley orientation) and spatial layout of land cover types (e.g., regional distribution of vegetation, water bodies, and bare soil) during the generation of the target latent features. The text-encoded features, on the other hand, contain the style features of the target terrain, constraining the macroscopic geometric style of the terrain (e.g., mountains, plains, basins) and the visual texture style of the surface (e.g., forests, snowfields, deserts). Through joint inference of the fusion of latent features and text-encoded features within the single-step diffusion model, the elevation structure information, texture distribution information, and text style information mutually constrain each other, ensuring the generation quality and structural style consistency of the target elevation and texture features.

[0044] 104. Decode the target elevation features and target texture features respectively to obtain the target elevation results and target texture results.

[0045] After obtaining the target elevation features and target texture features that match the text control information, the target elevation features and target texture features are decoded respectively to obtain the target elevation result and target texture result. Specifically, the target elevation features can be input into a pre-trained elevation decoder to obtain the target elevation result; the target texture features can be input into a pre-trained texture decoder to obtain the target texture result. The target elevation result is a digital elevation model (DEM) corresponding to the target terrain, that is, the target elevation result is a two-dimensional raster data, where each pixel records the ground elevation at that location, used to quantitatively describe the geometric relief and spatial distribution characteristics of the terrain; the target texture result is an RGB image corresponding to the target terrain, that is, the target texture result is a three-channel color image that is completely consistent with the spatial range of the DEM, used to characterize the color, texture and cover type of the surface, reflecting the visual appearance information of the terrain.

[0046] In practice, the process of an encoder encoding a sketch involves multiple downsampling steps. During these downsampling processes, the spatial resolution and high-frequency detail information of the feature map gradually decrease. This results in the output latent features lacking fine structural information from the sketch (such as sharp ridge lines and local texture edges). Therefore, to preserve this detail, in some embodiments, the encoder and decoder can be skip-connected. This allows the decoder to directly access high-resolution feature maps from the encoder's intermediate layers that have not yet been over-downsampled during the decoding process, thereby effectively recovering the geometric and texture details in the generated target.

[0047] Optionally, decoding the target elevation feature and the target texture feature respectively to obtain the target elevation result and the target texture result may include: obtaining multi-scale elevation features output by the elevation encoder during the process of encoding the terrain structure sketch by the elevation encoder; obtaining multi-scale texture features output by the texture encoder during the process of encoding the terrain structure sketch by the texture encoder; inputting the multi-scale elevation features and the target elevation feature into the elevation decoder to obtain the target elevation result output by the elevation decoder, wherein the elevation decoder and the elevation encoder are connected in a skip connection; inputting the multi-scale texture features and the target texture feature into the texture decoder to obtain the target texture result output by the texture decoder, wherein the texture decoder and the texture encoder are connected in a skip connection.

[0048] The skip connection approach can directly inject detailed information from the sketch into the decoder, imposing stronger spatial constraints on the generation of target elevation and target texture results, and enhancing their spatial consistency.

[0049] In some embodiments, to quickly view the effect of the joint generation of DEM and RGB images, the target elevation result (DEM) and target texture result (RGB image) can be previewed. For the DEM, it can first be displayed as a grayscale image, allowing users to intuitively understand the elevation distribution. For the RGB image, it can be displayed as a two-dimensional color image, facilitating users to examine the surface color, texture, and overlay type. Furthermore, to more intuitively perceive the fusion effect of terrain and texture, a 2.5D rendered view can be generated by jointly using the DEM and RGB images: first, a 2.5D terrain model (such as a mountain shadow map) is generated based on the DEM data through perspective projection and simulated lighting; then, the RGB image is overlaid as a texture onto the surface of this 2.5D model, allowing users to simultaneously perceive the steepness of the terrain, the direction of ridges and valleys, and the surface overlay color. Since 2.5D rendering does not require constructing a complete 3D mesh, the computational overhead is low, making it suitable for real-time feedback during interactive adjustments. After viewing the DEM grayscale image, RGB color image, and the 2.5D rendering preview result combining the two through the display interface, if the user finds that the elevation structure or texture distribution does not meet expectations, they can immediately modify the terrain structure sketch or text control information and regenerate the target terrain that meets the requirements based on the aforementioned terrain generation method.

[0050] 105. Based on the target elevation and target texture results, generate the target terrain corresponding to the terrain structure sketch and text control information.

[0051] After obtaining the target elevation and target texture results, a target terrain corresponding to the terrain structure sketch and text control information can be generated based on these results. Specifically, the target elevation results can be used as the terrain geometry skeleton, and the target texture results can be used as the surface visual map. These can be fused together using a 3D rendering engine or graphics pipeline to generate the target terrain, and then a 3D view of the target terrain can be displayed in the display interface.

[0052] In this embodiment, a first latent feature and a second latent feature are obtained based on the same terrain structure sketch. The sketch enables spatial structural constraints on the target elevation and texture results. Text control information is used to process the first and second latent features, enabling visual style constraints on the target elevation and texture results. Through the collaborative control of the sketch and text control information, the target elevation and texture results are jointly generated independently, overcoming the shortcomings of related technologies such as low generation efficiency due to DEM-based terrain rendering and low generation accuracy and poor realism caused by error propagation along the terrain generation chain. This application improves terrain generation efficiency while effectively ensuring the generation quality and spatial consistency of the target elevation and texture results. The generated target terrain not only has a geometric structure constrained by the sketch but also a visual style constrained by text control information, significantly improving the realism and overall consistency of the target terrain. Figure 2 A flowchart of a terrain generation method provided in an embodiment of this application is shown below. Figure 2 As shown, this embodiment proposes a method for generating target terrain by combining a feature segmentation strategy. Specifically, the method includes the following steps: 201. Obtain terrain structure sketches and text control information. The text control information is used to characterize the style of the target terrain.

[0053] 202. Determine the first and second potential features corresponding to the terrain structure sketch. The first potential feature is used to characterize the elevation information of the terrain structure sketch, and the second potential feature is used to characterize the texture information of the terrain structure sketch.

[0054] 203. In response to the terrain structure sketch having a size larger than a preset size, the first potential feature and the second potential feature are segmented to obtain multiple first sub-feature blocks corresponding to the first potential feature and multiple second sub-feature blocks corresponding to the second potential feature. The first sub-feature blocks and the second sub-feature blocks are in one-to-one correspondence, and there is an overlapping area between any two adjacent first sub-feature blocks. The first sub-feature blocks and the second sub-feature blocks have the same size.

[0055] This application provides a terrain generation method, particularly suitable for large-scale scenes. In practical applications, when the size of the input sketch is larger than the fixed input resolution (usually 512×512) supported by the pre-trained generation model, directly inputting the complete large-scale sketch into the terrain generation model will face two main problems: First, the memory usage increases with the square of the input size, easily exceeding hardware limitations; second, the model's receptive field is reduced in proportion to the entire sketch, making it difficult to capture the global terrain structure, resulting in instability in the overall layout and local details of the generated results.

[0056] To address the aforementioned issues, the latent features corresponding to the input sketch are segmented before the model performs forward inference. Specifically, the large-scale sketch is first mapped to a smaller latent feature map, which retains the global structural information of the sketch. Then, the latent feature map is segmented into multiple sub-feature blocks with overlapping regions according to a preset window size and sliding step size. Each sub-feature block is independently input into the denoising network to reduce the memory overhead and computational burden of the denoising network, thereby achieving stable generation of large-scale terrain with limited resources.

[0057] For example, assuming a sketch larger than 512×512 is considered a large-scale sketch (taking an input sketch size of 1024×1024 as an example), firstly, based on the large-scale sketch, the corresponding first and second latent features are obtained. The first latent feature has a size of 256×256×1 and contains global elevation structure information in the sketch. The second latent feature has a size of 256×256×3 and contains implicit global texture distribution information in the sketch. Figure 3 As shown, assuming the preset window size is 64×64 and the sliding step size is 8, 25 windows with starting positions of 0, 8, 16, ..., 192 can be generated along the height direction of the first potential feature. The same applies to the width direction, resulting in a total of 25×25=625 windows. Each window cuts out a first sub-feature block with a size of 64×64×1 from the first potential feature, and simultaneously cuts out a second sub-feature block with a size of 64×64×3 from the second potential feature. The first and second sub-feature blocks under the same window have the same spatial position, corresponding to the same local area in the sketch. Adjacent windows overlap in both row and column directions, and the overlap width is 64-8=56.

[0058] 204. Input multiple first sub-feature blocks, multiple second sub-feature blocks, and text control information into the denoising network of the single-step diffusion model to obtain multiple local elevation features and multiple local texture features.

[0059] After obtaining multiple first sub-feature blocks and multiple second sub-feature blocks, in order to avoid using a global batch processing method to simultaneously feed the first and second sub-feature blocks corresponding to multiple windows into the denoising network, thereby causing excessive memory overhead or resource contention for the forward inference process, a serial inference strategy can be adopted, which means performing forward inference on each window sequentially according to the spatial arrangement order of the windows.

[0060] Specifically, following a row-first order, starting from the first window in the top left corner, all windows in the same row are traversed from left to right, then the process moves to the next row until the last window in the bottom right corner. The first and second sub-feature blocks corresponding to each window are concatenated and input into the denoising network. The denoising network then performs forward inference in conjunction with text control information and outputs the local elevation features and local texture features corresponding to each window. These local elevation features and local texture features have the same size and the same spatial position in the latent space, meaning they correspond to the exact same local region in the sketch.

[0061] 205. Fuse multiple local elevation features to obtain target elevation features that match the text control information.

[0062] Subsequently, multiple local elevation features and multiple local texture features obtained based on all windows can be fused separately in the latent space to obtain complete target elevation features and target texture features.

[0063] Optionally, fusing the multiple local elevation features to obtain a target elevation feature that matches the text control information includes: determining the overlapping and non-overlapping regions of any two adjacent local elevation features; performing weighted fusing of the two adjacent local elevation features based on the Gaussian weights corresponding to the overlapping regions to obtain the elevation features of the fused region; and determining the target elevation feature based on the fused region features and the local elevation features corresponding to the non-overlapping regions.

[0064] Specifically, before window segmentation and fusion, a global cumulative feature tensor with the same size as the first latent feature is first created, and all its elements are initialized to zero. At the same time, a cumulative weight value is maintained for each spatial location, initially set to zero.

[0065] For each local elevation feature, its corresponding region in the global cumulative feature tensor is first determined based on its spatial location. Simultaneously, a two-dimensional Gaussian weight matrix of the same size as the local elevation feature is generated. The weight values ​​are determined by the distance from each feature point within the local elevation feature to its center, with the highest weight at the center and weights approaching zero at the edges. Next, the local elevation feature is multiplied element-wise by the Gaussian weight matrix, and the weighted result is accumulated to the corresponding region in the global cumulative feature tensor. The weight values ​​from the Gaussian weight matrix are also accumulated to the cumulative weight value at the corresponding spatial location. After all windows have been processed, for each spatial location in the global cumulative feature tensor, the accumulated feature value is divided by the accumulated weight value at that location to complete normalization, yielding the fused target elevation latent feature.

[0066] For any two adjacent overlapping areas of local elevation features, a Gaussian weighted average method is used for fusion, which can smoothly merge the two local elevation feature blocks and achieve a natural transition.

[0067] In other instances, fusing the multiple local elevation features to obtain a target elevation feature that matches the text control information includes: when any two adjacent local elevation features include a first local elevation feature and a second local elevation feature; determining the overlapping and non-overlapping regions of any two adjacent local elevation features; determining the first Gaussian weight corresponding to the first local elevation feature and the second Gaussian weight corresponding to the second local elevation feature; and, if the first Gaussian weight is greater than the second Gaussian weight, fusing the first local elevation feature with the local elevation feature corresponding to the non-overlapping region of the second local elevation feature, thus stably obtaining the target elevation feature.

[0068] 206. Fuse multiple local texture features to obtain target texture features that match the text control information.

[0069] The step of fusing the multiple local texture features to obtain a target texture feature that matches the text control information includes: determining the overlapping region and non-overlapping region of any two adjacent local texture features; performing weighted fusion of the two adjacent local texture features based on the Gaussian weights corresponding to the overlapping region to obtain the fused region texture feature; and determining the target texture feature based on the fused region texture feature and the local texture features corresponding to the non-overlapping region.

[0070] The fusion process for multiple local texture features is similar to the fusion process for local elevation features mentioned above. Specifically, before window segmentation and fusion, a global cumulative feature tensor with the same size as the second latent feature is first created, and all its elements are initialized to zero. At the same time, a cumulative weight value is maintained for each spatial location, initially set to zero.

[0071] For each local texture feature, its corresponding region in the global cumulative feature tensor is first determined based on its spatial location. Simultaneously, a two-dimensional Gaussian weight matrix of the same size as the local texture feature is generated. The weight values ​​are determined by the distance from each feature point within the local texture feature to its center, with the highest weight at the center and weights approaching zero at the edges. Next, the local texture feature is multiplied element-wise by the Gaussian weight matrix, and the weighted result is accumulated to the corresponding region in the global cumulative feature tensor. The weight values ​​in the Gaussian weight matrix are also accumulated to the accumulated weight values ​​at the corresponding spatial locations. After all windows have been processed, for each spatial location in the global cumulative feature tensor, the accumulated feature value is divided by the accumulated weight value at that location to complete normalization, yielding the fused target texture latent feature. For any two adjacent overlapping regions of local texture features, a Gaussian weighted average is used for fusion, ensuring a smooth blending and natural transition between the two local texture feature blocks.

[0072] In other instances, fusing the multiple local texture features to obtain target texture features that match the text control information includes: determining the overlapping and non-overlapping regions of any two adjacent local texture features when any two adjacent local texture features include a first local texture feature and a second local texture feature; determining the first Gaussian weight corresponding to the first local texture feature and the second Gaussian weight corresponding to the second local texture feature; and fusing the first local texture feature with the local texture feature corresponding to the non-overlapping region of the second local texture feature when the first Gaussian weight is greater than the second Gaussian weight, thus stably obtaining the target texture feature.

[0073] 207. Decode the target elevation features and target texture features respectively to obtain the target elevation results and target texture results.

[0074] 208. Based on the target elevation and target texture results, generate the target terrain corresponding to the terrain structure sketch and text control information.

[0075] By using this global accumulation and Gaussian weighted fusion method, the local elevation features and local texture features corresponding to each window are smoothly fused in the latent space, which can effectively eliminate window splicing marks and ensure the spatial alignment of the target elevation features and the target texture features.

[0076] The execution of other steps in this embodiment can be referred to the relevant descriptions in the foregoing embodiments, and will not be repeated here.

[0077] Figure 4 A flowchart of a terrain generation method provided in an embodiment of this application is shown below. Figure 4 As shown, when the text control information includes global control information and local control information, the terrain generation method in this embodiment includes the following steps: 401. Obtain terrain structure sketches and text control information. Text control information includes global control information and local control information. Global control information is used to characterize the global style of the target terrain, and local control information is used to characterize the local style of the target terrain. The local style is different from the global style.

[0078] In terrain generation applications, there is often a need to achieve multi-style terrain. Therefore, this embodiment introduces local control information to enable diversified control over terrain generation. Here, the text control information includes global control information and local control information. Global control information characterizes the global style of the target terrain, while local control information characterizes the local style. To achieve multi-style terrain generation, the local style is limited to be different from the global style. However, it is understood that in practical applications, the local style and global style can be the same.

[0079] In some embodiments, the sketch can be divided into multiple regions, and text control information of the same or different styles can be entered into each region. The division of sketch regions can be achieved in the following ways: User-defined boundaries: The interface provides drawing tools that allow users to directly sketch the boundaries of different areas (such as polygons and free curves) on the sketch and assign corresponding text descriptions (such as "forest", "river", "snowfield" etc.) to each area.

[0080] Semantic segmentation assistance: Input the sketch into the pre-trained semantic segmentation model, automatically identify semantic regions such as ridges, valleys, vegetation areas, and water areas in the sketch, and then bind default text control information to each semantic region, allowing users to modify it later.

[0081] Regular grid division: A fixed grid is used (e.g., dividing a 1024×1024 sketch into regular sub-blocks of 2×2 or 4×4), and each sub-block can be independently assigned text control information. This method is suitable for expressing periodicity or a block-based style.

[0082] It should be noted that this application does not limit the specific method of dividing the sketch area. The above method is only an example, and those skilled in the art can adopt other reasonable division strategies according to actual needs.

[0083] 402. Determine the first and second potential features corresponding to the terrain structure sketch. The first potential feature is used to characterize the elevation information of the terrain structure sketch, and the second potential feature is used to characterize the texture information of the terrain structure sketch.

[0084] 403. Based on global control information, process the first and second latent features to obtain global elevation features and global texture features that match the global control information.

[0085] 404. Based on local control information, process the first latent feature and the second latent feature to obtain local elevation features and local texture features that match the local control information.

[0086] 405. The global elevation features and local elevation features are fused to obtain the target elevation features.

[0087] 406. Fuse global and local texture features to obtain target texture features.

[0088] 407. Decode the target elevation features and target texture features respectively to obtain the target elevation results and target texture results.

[0089] 408. Based on the target elevation and target texture results, generate the target terrain corresponding to the terrain structure sketch and text control information.

[0090] After obtaining global and local control information, the first and second latent features can be processed based on this information. First, the first and second latent features are fused and input into the denoising network of a single-step diffusion model. The encoded global control information is then injected into the denoising network as conditional information. During a single forward propagation, the denoising network outputs global elevation and texture features that match the global control information. Similarly, the first and second latent features, along with the encoded local control information, are input into the same denoising network. The network can identify the effective region of the local control information based on the implicit spatial relationships in the input information (e.g., through masks or attention mechanisms) and infer and output local elevation and texture features that match the local control information. The generated local elevation and texture features can have the same or different dimensions from the global elevation and texture features.

[0091] Next, the global and local elevation features are fused. First, a weight map of the same size as the feature map is constructed. Within the region specified by the local control information, the weights are set to the maximum value (local features completely dominate); in the background region far from the local region, the weights are set to the minimum value (global features dominate); and near the region boundary, the weights transition smoothly. Then, the global and local elevation features are weighted and averaged point-by-point according to this weight map to obtain the fused target elevation feature.

[0092] Similarly, the target texture features are obtained by applying the same weighted average to both the global and local texture features using the same weighted graph.

[0093] Through this fusion method, the elevation and texture features within the area specified by the local control information can be accurately represented without being affected by the global style. The background area (the area constrained only by the global control information) completely follows the overall style of the global control information, ensuring the stability of the macroscopic features of the terrain. Furthermore, by constructing continuously varying smooth weights, feature values ​​can change continuously at the boundary between the local area and the background, eliminating harsh splicing marks and making the overall terrain visually natural and harmonious. In addition, the generation process of the above-mentioned global and local features is always constrained by the structural information of the sketch, which can effectively ensure the spatial consistency of the target elevation and texture features, and together with the text control information, ensure the realism and quality of the generated terrain.

[0094] Optionally, the local control information includes first local control information and second local control information. The first local control information is used to characterize the target terrain style of a first region, and the second local control information is used to characterize the target terrain style of a second region. The first region and the second region have overlapping areas. Before fusing the global elevation features and the local elevation features to obtain the target elevation features, the method further includes: determining the first local elevation features and the first local texture features corresponding to the first region; determining the second local elevation features and the second local texture features corresponding to the second region; determining the first region text weight of the first region based on the first local control information, whereby the first region text weight characterizes the degree of influence of the first local control information on the first region; and determining the second local text weight based on the second local text features. The system uses control information to determine the text weight of the second region, which characterizes the degree of influence of the second control information on the second region. Based on the first and second local control information, it determines the transition text weight corresponding to the overlapping region, which characterizes the combined influence of the first and second local control information on the overlapping region. Based on the text weight of the first region, the transition text weight, and the text weight of the second region, it fuses the first local elevation feature and the second local elevation feature to obtain the local elevation feature. Based on the text weight of the first region, the transition text weight, and the text weight of the second region, it fuses the first local texture feature and the second local texture feature to obtain the local texture feature.

[0095] In some embodiments, multiple local control information may have one or more overlapping regions. The following explanation uses the example of two adjacent local control information pieces corresponding to one overlapping region. Figure 5 As shown, the first region and the second region in the sketch correspond to the first local control information and the second local control information, respectively, and there is an overlapping area between the first region and the second region. Therefore, before fusing global features and local features, the features corresponding to different regions can be fused first to obtain complete local features.

[0096] Specifically, firstly, the first local elevation features and first local texture features corresponding to the first region are determined, as well as the second local elevation features and second local texture features corresponding to the second region. Then, the text weights for the first region are determined based on the first local control information. These weights can be a two-dimensional distribution with the same size as the feature map. Inside the first region, the weight value is at its maximum (indicating the strongest influence of the first local control information on that region); outside the first region, the weight value is at its minimum (indicating zero influence); near the boundary of the first region, the weight value smoothly decreases from its maximum to its minimum (e.g., using a Gaussian function or distance transform). Similarly, the text weights for the second region are determined based on the second local control information, taking the maximum value inside the second region and the minimum value outside, with a smooth transition at the boundary.

[0097] For the overlapping region formed by the first and second regions, it is also necessary to determine the transition text weight. The transition text weight is used to characterize the joint influence of the first and second local control information on the overlapping region. Specifically, for each pixel position within the overlapping region, the text weight value of the first region is obtained. Second region text weight value Then construct the transition text weights. p represents a pixel in the overlapping region, and this weight indicates the influence ratio of the first local control information in the overlapping region. The influence ratio of the second local control information is then... Transition text weight It varies pixel-by-pixel, ranging from 0 to 1, approaching 1 near the center of the first region, approaching 0 near the center of the second region, and equal to 0.5 at the boundaries. Next, the first local elevation feature and the second local elevation feature are fused. The fused local elevation feature is equal to the product of the influence ratio of the first local elevation feature and the first local control information, plus the product of the influence ratio of the second local feature and the second local control information.

[0098] The process of fusing local texture features is the same as that of local elevation features, and will not be elaborated further here.

[0099] In the above manner, the natural transition of overlapping areas is achieved by combining the text weights of the first region, the second region, and the transition text weights, so that the influence of the first local control information and the second local control information is smoothly connected at the boundary, avoiding hard switching.

[0100] Figure 6 This is a schematic diagram illustrating the principle of a terrain generation method provided in an embodiment of this application. (See attached diagram.) Figure 6As shown, this application provides a terrain generation method. The execution body of this terrain generation method can be implemented as a terrain generation model. Specifically, the terrain generation model may include an elevation encoder, a texture encoder, a text encoder, a single-step denoising network, an elevation decoder, and a texture decoder. The elevation encoder and elevation decoder are connected in a skip connection, as are the texture encoder and texture decoder. Based on this, the specific flow of the terrain generation method can be: First, the sketch is simultaneously input into the elevation encoder and texture encoder. The elevation encoder outputs the first latent feature, and the texture decoder generates the second latent feature. The text control information is then input into the text encoder to obtain the encoded text control information. Next, the first and second latent features are concatenated and input into a single-step denoising network (i.e., ...) through an extended input channel. Figure 6 In the stable diffusion model U-net (which achieves one-step denoising), encoded text control information is injected as conditional information into a single-step denoising network. This network predicts latent target features during a single forward inference process and outputs latent target elevation and texture features. Both the elevation and latent target features contain structural features from the sketch and style features from the text control information. The elevation features are then input into an elevation decoder, which outputs the target elevation result based on multi-scale elevation features obtained from the target elevation features and skip connections. Simultaneously, the texture features are input into a texture decoder, which outputs the target elevation result based on multi-scale texture features obtained from the target texture features and skip connections. The multi-scale features contain detailed information from the sketch, which provides stronger structural constraints to the generated target elevation and texture results, maintaining good overall consistency.

[0101] Finally, the target terrain is rendered based on the target elevation and texture results.

[0102] The aforementioned terrain generation method enables the joint generation of target elevation and target texture results, which are collaboratively controlled by sketch and text control information. This effectively ensures the controllability of the structure and style of terrain generation, as well as overall consistency. Furthermore, by employing a single-step denoising network to simultaneously obtain target elevation and target texture features during a single forward inference process, generation efficiency can be guaranteed.

[0103] Figure 7 A flowchart of a terrain generation model training method provided in this application embodiment is shown below. Figure 7 As shown, before obtaining the terrain structure sketch and text control information, the terrain generation model can be trained first, so that the trained terrain generation model can be used to perform corresponding terrain generation operations. At this time, the method in this embodiment may also include the following steps: 701. Obtain terrain structure sketch samples and text control samples. The terrain structure sketch samples correspond to standard elevation results and standard texture results that conform to the text control samples.

[0104] 702. Based on text control samples and terrain structure sketch samples, generate predicted elevation results and predicted texture results that match the text control information.

[0105] 703. Based on the terrain structure sketch sample, the predicted elevation result and the standard elevation result, construct the first loss function.

[0106] 704. Based on the terrain structure sketch samples, predicted texture results, and standard texture results, construct a second loss function.

[0107] 705. Construct a joint loss function based on the predicted elevation results, predicted geographic results, standard elevation results, and standard texture results.

[0108] 706. The diffusion network is trained based on the first loss function, the second loss function, and the joint loss function to obtain a single-step diffusion model.

[0109] In this embodiment, the training samples can come from public datasets, remote sensing images and DEM registration data, or be obtained through manual annotation. Each terrain structure sketch sample (hereinafter referred to as sketch sample) corresponds to a text control sample and is equipped with a standard elevation result (i.e., a real digital elevation model, DEM) and a standard texture result (i.e., a real RGB image) that conforms to the text control sample.

[0110] Next, the sketch samples are input into the elevation encoder and texture encoder to obtain the first and second latent features, respectively. Simultaneously, the text control samples are input into the text encoder to obtain text-encoded features. The first and second latent features are fused with the text-encoded features and then fed into a single-step diffusion network (U-Net architecture). This network outputs predicted elevation and predicted texture features that match the text control samples. These are then passed through the elevation decoder and texture decoder, respectively, to obtain the predicted elevation and predicted texture results.

[0111] Then, the first loss function, the second loss function, and the joint loss function are constructed respectively.

[0112] The first loss function measures the difference between the predicted elevation and the standard elevation to ensure that the generated elevation numerically approximates the true DEM. Specifically, it can employ pixel-level mean squared error (MSE) loss, mean absolute error (L1 loss), or perceptual loss, etc.

[0113] The second loss function measures the difference between the predicted texture result and the standard texture result to ensure that the generated RGB image closely approximates the real surface texture in terms of color and texture detail. MSE loss, L1 loss, or perceptual loss (e.g., perceptual loss based on a pre-trained VGG network) can also be used.

[0114] The joint loss function is used to constrain the structural consistency between elevation and texture. Specifically, it can calculate the difference in spatial gradient between the predicted elevation result and the standard elevation result, while constraining the feature correlation between the predicted texture result and the standard texture result at the same location.

[0115] By weighted summing of the three loss functions mentioned above, the single-step diffusion network and related encoders and decoders are jointly trained until the loss converges, thus obtaining the trained terrain generation model.

[0116] Optionally, an elevation-texture joint adversarial loss can be introduced by adding an additional discriminator to ensure that the two are strictly aligned in spatial location and naturally matched in semantics. Figure 8 This is a schematic diagram of a terrain generation discriminator provided in an embodiment of this application. The terrain generation discriminator consists of an elevation discriminator, a texture discriminator, and a joint discriminator.

[0117] The specific process for obtaining the joint elevation-texture adversarial loss based on the aforementioned terrain generation discriminator includes: Step P1: Input the sketch sample, standard elevation result, and predicted elevation result into the elevation discriminator to obtain the elevation adversarial loss, thereby constraining the rationality of the elevation against the sketch. Figure 1 To the point of being compatible; Step P2: Input the standard texture result and the predicted texture result into the texture discriminator to obtain the texture adversarial loss, so as to constrain the texture realism; Step P3: Input the standard elevation results and predicted elevation results, standard texture results and predicted texture results into the joint multimodal discriminator to obtain the joint adversarial loss, so as to constrain the spatial consistency of elevation and texture; Step P4: Weight the elevation adversarial loss, texture adversarial loss and joint adversarial loss to obtain the total elevation-texture joint adversarial loss.

[0118] It is worth noting that the above steps P1-P3 can be performed simultaneously, and the training process of the terrain generation discriminator can be performed simultaneously with the training process of the terrain generation model.

[0119] By introducing a terrain generation discriminator consisting of an elevation discriminator, a texture discriminator, and a joint multimodal discriminator, and simultaneously calculating the elevation adversarial loss, texture adversarial loss, and joint adversarial loss and performing a weighted sum, it is possible to achieve strict spatial alignment and natural semantic matching between elevation and texture. This enables the terrain generation model to not only achieve high quality in a single modality, but also to ensure spatial consistency between elevation structure and texture style as a whole, thereby enhancing the realism of the final generated target terrain.

[0120] Optionally, before performing end-to-end joint training on the terrain generation model, the elevation encoder, texture encoder, and corresponding decoder can be pre-trained separately to learn the potential priors and decoding capabilities required for terrain representation, thereby improving the stability and convergence speed of subsequent joint training.

[0121] Taking the encoder and corresponding decoder of the elevation branch as an example, their independent pre-training process can include: Step P1: Obtain standard elevation samples and input them into the elevation encoder. Here, a variable autodivision encoder is used to obtain the latent elevation features.

[0122] Step P2: Input the potential elevation features into the elevation decoder to obtain the reconstructed elevation results.

[0123] Step P3: Calculate the elevation reconstruction loss based on the standard elevation sample and the reconstructed elevation results (L1 reconstruction loss can be used to constrain the elevation numerical error).

[0124] Step P4: Input the standard elevation samples and the reconstructed elevation results into the discriminator (which can be a PatchGAN discriminator) in the pre-training stage of the elevation branch, calculate the adversarial loss to enhance the representation of local terrain details.

[0125] Step P5: Calculate the KL divergence loss for the encoder output distribution to make the elevation potential space smoother and more stable.

[0126] Step P6: Update the elevation encoder and decoder parameters jointly based on reconstruction loss, adversarial loss and KL divergence loss to obtain elevation prior and stable decoding capability.

[0127] After pre-training, the elevation encoder can learn robust terrain priors, and the decoder has stable elevation reconstruction capabilities, providing good initial parameters for subsequent end-to-end joint training, which helps to improve the convergence speed of joint training and the final generation quality.

[0128] The terrain generation apparatus of one or more embodiments of this application will be described in detail below. Those skilled in the art will understand that these apparatuses can all be configured using commercially available hardware components through the steps taught in this solution.

[0129] Figure 9 This is a schematic diagram of the structure of a terrain generation device provided in an embodiment of this application, as shown below. Figure 9 As shown, the device includes: The module consists of an acquisition module 11, a determination module 12, a processing module 13, a decoding module 14, and a generation module 15.

[0130] The acquisition module 11 is used to acquire terrain structure sketches and text control information, wherein the text control information is used to characterize the style of the target terrain.

[0131] The determination module 12 is used to determine a first potential feature and a second potential feature corresponding to the terrain structure sketch. The first potential feature is used to characterize the elevation information of the terrain structure sketch, and the second potential feature is used to characterize the texture information of the terrain structure sketch.

[0132] The processing module 13 is used to process the first potential feature and the second potential feature based on the text control information to obtain target elevation features and target texture features that match the text control information.

[0133] The decoding module 14 is used to decode the target elevation feature and the target texture feature respectively to obtain the target elevation result and the target texture result.

[0134] The generation module 15 is used to generate a target terrain corresponding to the terrain structure sketch and the text control information based on the target elevation result and the target texture result.

[0135] Optionally, the determining module 12 is specifically used to: simultaneously input the terrain structure sketch into the elevation encoder and the texture encoder to obtain the first latent feature output by the elevation encoder and the second latent feature output by the texture encoder.

[0136] Accordingly, the determining module 12 is specifically used for: obtaining multi-scale elevation features output by the elevation encoder during the process of encoding the terrain structure sketch by the elevation encoder; obtaining multi-scale texture features output by the texture encoder during the process of encoding the terrain structure sketch by the texture encoder; inputting the multi-scale elevation features and the target elevation features into the elevation decoder to obtain the target elevation result output by the elevation decoder, wherein the elevation decoder and the elevation encoder are connected in a skip connection; inputting the multi-scale texture features and the target texture features into the texture decoder to obtain the target texture result output by the texture decoder, wherein the texture decoder and the texture encoder are connected in a skip connection.

[0137] Optionally, the device further includes: a segmentation module, configured to, in response to the terrain structure sketch having a size greater than a preset size, segment the first potential feature and the second potential feature respectively to obtain a plurality of first sub-feature blocks corresponding to the first potential feature and a plurality of second sub-feature blocks corresponding to the second potential feature, wherein the first sub-feature blocks and the second sub-feature blocks correspond one-to-one, any two adjacent first sub-feature blocks have an overlapping area, and the first sub-feature blocks and the second sub-feature blocks have the same size.

[0138] Accordingly, the processing module 13 is further configured to: input the plurality of first sub-feature blocks, the plurality of second sub-feature blocks and the text control information into the denoising network of the single-step diffusion model to obtain a plurality of local elevation features and a plurality of local texture features; fuse the plurality of local elevation features to obtain a target elevation feature that matches the text control information; and fuse the plurality of local texture features to obtain a target texture feature that matches the text control information.

[0139] Accordingly, the processing module 13 is further configured to: determine the overlapping region and non-overlapping region of any two adjacent local elevation features; perform weighted fusion of the two adjacent local elevation features based on the Gaussian weight corresponding to the overlapping region to obtain the fused region elevation feature; and determine the target elevation feature based on the fused region feature and the local elevation feature corresponding to the non-overlapping region.

[0140] Accordingly, the processing module 13 is further configured to: determine the overlapping region and non-overlapping region of any two adjacent local texture features; perform weighted fusion of the two adjacent local texture features based on the Gaussian weights corresponding to the overlapping region to obtain the fused region texture features; and determine the target texture features based on the fused region texture features and the local texture features corresponding to the non-overlapping region.

[0141] Optionally, the text control information includes global control information and local control information. The global control information is used to characterize the global style of the target terrain, and the local control information is used to characterize the local style of the target terrain, wherein the local style is different from the global style. The processing module 13 is further configured to: process the first latent feature and the second latent feature based on the global control information to obtain global elevation features and global texture features that match the global control information; process the first latent feature and the second latent feature based on the local control information to obtain local elevation features and local texture features that match the local control information; fuse the global elevation features and the local elevation features to obtain the target elevation feature; and fuse the global texture features and the local texture features to obtain the target texture feature.

[0142] Optionally, the local control information includes first local control information and second local control information. The first local control information is used to characterize the target terrain style of the first region, and the second local control information is used to characterize the target terrain style of the second region. The first region and the second region have overlapping areas. Before fusing the global elevation features and the local elevation features to obtain the target elevation features, the processing module 13 is further configured to: determine the first local elevation features and the first local texture features corresponding to the first region; determine the second local elevation features and the second local texture features corresponding to the second region; based on the first local control information, determine the first region text weight of the first region, whereby the first region text weight characterizes the degree of influence of the first local control information on the first region; based on the second local control information... The system uses local control information to determine the text weight of the second region, which characterizes the degree of influence of the second local control information on the second region. Based on the first and second local control information, it determines the transition text weight corresponding to the overlapping region, which characterizes the degree of joint influence of the first and second local control information on the overlapping region. Based on the first region text weight, the transition text weight, and the second region text weight, it fuses the first local elevation feature and the second local elevation feature to obtain the local elevation feature. Based on the first region text weight, the transition text weight, and the second region text weight, it fuses the first local texture feature and the second local texture feature to obtain the local texture feature.

[0143] Optionally, the processing module 13 is specifically used to: fuse the first latent feature and the second latent feature to obtain a fused latent feature; input the fused latent feature and the text control information into the denoising network of the single-step diffusion model to obtain target elevation features and target texture features that match the text control information.

[0144] Optionally, the apparatus further includes a training module for acquiring terrain structure sketch samples and text control samples, wherein the terrain structure sketch samples correspond to standard elevation results and standard texture results that conform to the text control samples; generating predicted elevation results and predicted texture results that match the text control information based on the text control samples and the terrain structure sketch samples; constructing a first loss function based on the terrain structure sketch samples, the predicted elevation results, and the standard elevation results; constructing a second loss function based on the terrain structure sketch samples, the predicted texture results, and the standard texture results; constructing a joint loss function based on the predicted elevation results and the predicted geographic results, the standard elevation results, and the standard texture results; and training the diffusion network based on the first loss function, the second loss function, and the joint loss function to obtain the single-step diffusion model.

[0145] Figure 9 The apparatus shown can perform the steps in the terrain generation method in the foregoing embodiments. For detailed execution process and technical effects, please refer to the description in the foregoing embodiments, which will not be repeated here.

[0146] Figure 10 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 10 As shown, in practice, this electronic device includes a memory 21 and a processor 22.

[0147] Memory 21 is used to store computer programs and can be configured to store various other data to support operation on the electronic device. Examples of this data include instructions for any application or method used to operate on the electronic device, data structures, contact data, phone book data, messages, pictures, videos, etc.

[0148] Processor 22, coupled to memory 21, is used to execute computer programs stored in memory 21 for the purpose of implementing Figures 1 to 9 The terrain generation method shown.

[0149] Furthermore, such as Figure 10 As shown, the electronic device also includes other components such as a communication component 23, a display 24, a power supply component 25, and an audio component 26. Figure 10 The diagram only shows some components and does not mean that the electronic device includes only these components. Figure 10The components shown are as follows. The electronic device in this embodiment can be a terminal device such as a desktop computer, laptop computer, smartphone, or IoT device, or a server device such as a conventional server, cloud server, or server array.

[0150] The aforementioned memory can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as Static Random-Access Memory (SRAM), Electrically Erasable Programmable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.

[0151] The aforementioned communication component is configured to facilitate wired or wireless communication between the device containing the communication component and other devices. The device containing the communication component can access wireless networks based on communication standards, such as 2G, 3G, 4G / LTE, 5G, or combinations thereof. In one exemplary embodiment, the communication component receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel.

[0152] The aforementioned display includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a Touch Panel, the screen can be implemented as a touchscreen to receive input signals from the user. The Touch Panel includes one or more touch sensors to sense touches, swipes, and gestures on the Touch Panel. The touch sensors can sense not only the boundaries of touch or swipe actions but also the duration and pressure associated with the touch or swipe operation.

[0153] The aforementioned power supply components provide power to various components within the device in which they reside. These power supply components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power to the device in which they reside.

[0154] The aforementioned audio component can be configured to output and / or input audio signals. For example, the audio component includes a microphone (MIC) configured to receive external audio signals when the device containing the audio component is in an operating mode, such as call mode, recording mode, or voice recognition mode. The received audio signals can be further stored in memory or transmitted via a communication component. In some embodiments, the audio component also includes a speaker for outputting audio signals.

[0155] Accordingly, embodiments of this application also provide a computer-readable storage medium storing a computer program, which, when executed by a processor, enables the processor to implement the steps in the above-described method embodiments. The computer-readable storage medium includes volatile or non-volatile components, or a combination thereof, and can be removable or non-removable. Examples of computer-readable storage media include, but are not limited to, phase-change random access memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random-access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), flash memory or other memory technologies, CD-ROM, Digital Video Disc (DVD) or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium. Accordingly, this application also provides a computer program product, which includes a computer program or instructions that, when executed by a processor, cause the processor to implement the steps in the above method embodiments. It should be understood that each step or combination of steps in the above method flow can be implemented by the computer program or instructions. Furthermore, these computer programs or instructions can be applied to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing device, enabling the processor of the general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing device to function as an apparatus for implementing the corresponding functions in the above method embodiments.

[0156] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A terrain generation method, characterized in that, include: Obtain terrain structure sketches and text control information, wherein the text control information is used to characterize the style of the target terrain; The terrain structure sketch is simultaneously input into an elevation encoder and a texture encoder to obtain a first latent feature output by the elevation encoder and a second latent feature output by the texture encoder. The first latent feature is used to characterize the elevation information of the terrain structure sketch, and the second latent feature is used to characterize the texture information of the terrain structure sketch. Based on the text control information, the first latent feature and the second latent feature are processed to obtain target elevation features and target texture features that match the text control information; The target elevation features and the target texture features are decoded respectively to obtain the target elevation result and the target texture result; Based on the target elevation result and the target texture result, a target terrain corresponding to the terrain structure sketch and the text control information is generated.

2. The method according to claim 1, characterized in that, Decoding the target elevation feature and the target texture feature respectively to obtain the target elevation result and the target texture result includes: During the process of encoding the terrain structure sketch by the elevation encoder, multi-scale elevation features output by the elevation encoder are obtained. During the process of the texture encoder encoding the terrain structure sketch, multi-scale texture features output by the texture encoder are obtained; The multi-scale elevation features and the target elevation features are input into the elevation decoder to obtain the target elevation result output by the elevation decoder, wherein the elevation decoder and the elevation encoder are connected in a skip connection. The multi-scale texture features and the target texture features are input into the texture decoder to obtain the target texture result output by the texture decoder, wherein the texture decoder and the texture encoder are connected in a skip connection.

3. The method according to claim 1, characterized in that, After simultaneously inputting the terrain structure sketch into the elevation encoder and the texture encoder to obtain the first latent feature output by the elevation encoder and the second latent feature output by the texture encoder, the method further includes: In response to the terrain structure sketch having a size larger than a preset size, the first potential feature and the second potential feature are segmented to obtain a plurality of first sub-feature blocks corresponding to the first potential feature and a plurality of second sub-feature blocks corresponding to the second potential feature. The first sub-feature blocks and the second sub-feature blocks are in one-to-one correspondence, and there is an overlapping area between any two adjacent first sub-feature blocks. The first sub-feature blocks and the second sub-feature blocks have the same size. The step of processing the first latent feature and the second latent feature based on the text control information to obtain target elevation features and target texture features that match the text control information includes: The plurality of first sub-feature blocks, the plurality of second sub-feature blocks and the text control information are input into the denoising network of the single-step diffusion model to obtain multiple local elevation features and multiple local texture features. The multiple local elevation features are fused to obtain target elevation features that match the text control information; The multiple local texture features are fused to obtain target texture features that match the text control information.

4. The method according to claim 3, characterized in that, The process of fusing the multiple local elevation features to obtain target elevation features that match the text control information includes: Determine the overlapping and non-overlapping regions of any two adjacent local elevation features; Based on the Gaussian weights corresponding to the overlapping regions, the two adjacent local elevation features are weighted and fused to obtain the elevation features of the fused region. The target elevation features are determined based on the elevation features of the fused region and the local elevation features corresponding to the non-overlapping region.

5. The method according to claim 3, characterized in that, The process of fusing the multiple local texture features to obtain target texture features that match the text control information includes: Determine the overlapping and non-overlapping regions of any two adjacent local texture features; Based on the Gaussian weights corresponding to the overlapping regions, the two adjacent local texture features are weighted and fused to obtain the texture features of the fused region. The target texture features are determined based on the texture features of the fused region and the local texture features corresponding to the non-overlapping region.

6. The method according to claim 1, characterized in that, The text control information includes global control information and local control information. The global control information is used to characterize the global style of the target terrain, and the local control information is used to characterize the local style of the target terrain. The local style is different from the global style. The step of processing the first latent feature and the second latent feature based on the text control information to obtain target elevation features and target texture features that match the text control information includes: Based on the global control information, the first potential feature and the second potential feature are processed to obtain global elevation features and global texture features that match the global control information; Based on the local control information, the first latent feature and the second latent feature are processed to obtain local elevation features and local texture features that match the local control information; The global elevation features and the local elevation features are fused to obtain the target elevation features; The global texture features and the local texture features are fused to obtain the target texture features.

7. The method according to claim 6, characterized in that, The local control information includes first local control information and second local control information. The first local control information is used to characterize the target terrain style of a first region, and the second local control information is used to characterize the target terrain style of a second region. The first region and the second region have overlapping areas. Before fusing the global elevation features and the local elevation features to obtain the target elevation features, the method further includes: Determine the first local elevation feature and the first local texture feature corresponding to the first region; Determine the second local elevation features and the second local texture features corresponding to the second region; Based on the first local control information, the first region text weight of the first region is determined. The first region text weight is used to characterize the degree of influence of the first local control information on the first region. Based on the second local control information, the second region text weight of the second region is determined. The second region text weight is used to characterize the degree of influence of the second local control information on the second region. Based on the first local control information and the second local control information, the transition text weight corresponding to the overlapping region is determined. The transition text weight represents the degree of joint influence of the first local control information and the second local control information on the overlapping region. The first local elevation feature and the second local elevation feature are fused based on the first region text weight, the transition text weight, and the second region text weight to obtain the local elevation feature; The first local texture feature and the second local texture feature are fused based on the first region text weight, the transition text weight, and the second region text weight to obtain the local texture feature.

8. The method according to any one of claims 1-7, characterized in that, The step of processing the first latent feature and the second latent feature based on the text control information to obtain target elevation features and target texture features that match the text control information includes: The first latent feature and the second latent feature are fused to obtain the fused latent feature; The fused latent features and the text control information are input into the denoising network of the single-step diffusion model to obtain target elevation features and target texture features that match the text control information.

9. The method according to claim 8, characterized in that, Before acquiring terrain structure sketches and text control information, the method further includes: Obtain terrain structure sketch samples and text control samples, wherein the terrain structure sketch samples correspond to standard elevation results and standard texture results that conform to the text control samples; Based on the text control samples and the terrain structure sketch samples, predictive elevation results and predictive texture results that match the text control information are generated. Based on the terrain structure sketch sample, the predicted elevation result, and the standard elevation result, a first loss function is constructed; A second loss function is constructed based on the terrain structure sketch sample, the predicted texture result, and the standard texture result; A joint loss function is constructed based on the predicted elevation results, the predicted texture results, the standard elevation results, and the standard texture results; The diffusion network is trained based on the first loss function, the second loss function, and the joint loss function to obtain the single-step diffusion model.

10. An electronic device, characterized in that, include: The system includes a memory, a processor, and a communication interface; wherein the memory stores executable code that, when executed by the processor, causes the processor to perform the method as described in any one of claims 1 to 9.

11. A non-transitory machine-readable storage medium, characterized in that, The non-transitory machine-readable storage medium stores executable code that, when executed by a processor of an electronic device, causes the processor to perform the method as described in any one of claims 1 to 9.

12. A computer program product, characterized in that, include: A computer program, when executed by a processor of an electronic device, causes the processor to perform the method as described in any one of claims 1 to 9.