Aerial multi-view data generation method based on identifier nested text-to-image large model

By combining identifier nesting and text-based large-scale models, the problem of viewpoint and target consistency in UAV aerial photography data generation is solved, enabling high-quality generation and augmentation of multi-view aerial photography data, and improving the diversity and robustness of UAV aerial photography data.

CN122199696APending Publication Date: 2026-06-12NAT INNOVATION INST OF DEFENSE TECH PLA ACAD OF MILITARY SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NAT INNOVATION INST OF DEFENSE TECH PLA ACAD OF MILITARY SCI
Filing Date
2025-08-04
Publication Date
2026-06-12

Smart Images

  • Figure CN122199696A_ABST
    Figure CN122199696A_ABST
Patent Text Reader

Abstract

The application discloses a kind of aerial multi-view data generation methods based on identifier nesting text figure big model, comprising: obtaining the image sample of plane perspective and using preset identifier nesting module to register the target object to be generated, so that each target object has the identifier consisting of unique given label and original category label;Based on the image sample configured with identifier and using preset reconstruction loss function, prior reservation loss function, the pre-trained text figure big model module is fine-tuned;Based on the preset shooting parameter embedding module, the shooting parameters of text figure big model module are set;Image description text containing target identifier and shooting parameter are input into fine-tuned text figure big model module to generate image data.The present application can convert plane perspective shooting data into unmanned aerial vehicle aerial data, and maintain the consistency of generated content, and can control the shooting perspective and height of generated data, finally realize aerial sample augmentation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data augmentation, specifically to a method for generating multi-view aerial data based on a large model of text-to-image images with nested identifiers. Background Technology

[0002] With the development of the "low-altitude economy," drones are increasingly being used in various smart city applications, such as logistics delivery, traffic monitoring, and law enforcement. In these tasks, drones need to accurately understand urban scenarios. Deep learning-based computer vision technology, due to its significantly superior generalization performance compared to traditional vision algorithms, has gradually become one of the key supporting technologies for drone environmental perception. However, deep learning-based computer vision algorithms often require a large amount of data for training, placing high demands on the scale and diversity of the training data. However, with the implementation of drone "no-fly" orders in various cities, collecting aerial photography data from a drone's perspective has become increasingly difficult. Aerial photography data from a drone's perspective exhibits problems such as limited categories, limited scenes, cluttered perspectives, missing perspectives, and heterogeneous data sources, resulting in poor generalization of aerial visual perception models trained using this data and insufficient robustness to different scenarios and shooting conditions.

[0003] To address the data scarcity issue in deep learning, numerous data augmentation methods have been proposed. The goal of these methods is to increase the amount of data through various means. Among them, StabilityAI designed the Stable Diffusion model, a large-scale text-based image generation model. The Stable Diffusion model uses a latent space-based diffusion generation model as its foundational generation module and introduces a text encoder (CLIP) to encode language generation instructions, guiding the model to generate specified content. This method enhances the diversity of generated images through text-based images, enabling the acquisition of high-quality image samples. To increase the controllability of the image generation process, researchers have further proposed methods that add conditional constraints to the image generation process using additional inputs such as edges, depth maps, and pose maps, such as the T2I-Adapter and ControlNet methods. These methods improve the controllability of image generation and can generate augmented samples with specified content and layout.

[0004] However, existing data generation methods mainly rely on the layout and perspective of the input image to generate new images, and cannot directly generate drone aerial photography sample data under different perspectives and shooting parameters. In addition, existing methods cannot maintain the consistency of the content of people or vehicle targets in the input image, and the generated images are prone to distortion and generating content that does not conform to physical laws. Therefore, the quality and diversity of generated image samples need to be further improved. Summary of the Invention

[0005] The purpose of this invention is to provide a method for generating aerial multi-view data based on a large model of text-based images with nested identifiers. This method transforms existing natural image datasets taken from planar perspectives into drone aerial datasets while maintaining the consistency of the generated content. It also allows control over the shooting angle and altitude of the generated data, ultimately expanding the aerial sample and solving the problem of insufficient data diversity.

[0006] To achieve the above objectives, the present invention adopts the following technical solution:

[0007] A method for generating multi-view aerial data based on a large model of text-based images with nested identifiers, the method comprising:

[0008] Image samples from a planar perspective are acquired and registered using a pre-defined identifier nesting module, so that each target object has an identifier consisting of a unique given label and an original category label;

[0009] Based on the image samples configured with the identifier, the pre-trained text-to-image large model module is fine-tuned using a preset reconstruction loss function and a priori preservation loss function;

[0010] The shooting parameters of the text image large model module are set based on the preset shooting parameter embedding module;

[0011] The image data is generated by the image description text containing the target identifier and the finely tuned image model module based on the shooting parameters.

[0012] Preferably, the text-based large model module is a Stable Diffusion XL large model, which includes a variational autoencoder and a Unet, wherein the variational autoencoder includes an encoder and a decoder;

[0013] The Stable Diffusion XL large model was pre-trained using the large dataset LAION-5B.

[0014] Preferably, the reconstruction loss function is as follows:

[0015]

[0016] Among them, L recon. Represents the reconstruction loss function. X represents the average image reconstruction error over the entire training data distribution. gen This represents the image generated based on the identifier nesting module, where X represents the truth image of the target object to be registered.

[0017] Preferably, the X gen As shown in the following formula:

[0018]

[0019] in, Z represents the pre-trained text-based large model module. t C represents the noise feature map generated by the encoder at step t of the large model module of the text image, where C is the noise feature map. emb. =T1(t emb. ) represents the text constraints processed by the identifier nesting module, T1(·) represents the pre-trained text input CLIP model in the text-generated graph large model module, and t emb. This represents the image description text containing the identifier.

[0020] Preferably, during the fine-tuning of the text image large model module using the reconstruction loss function, the training denoising optimization loss function of the text image large model module is as follows:

[0021]

[0022] Among them, L gen This indicates the loss function used for training the large model module of the text image for denoising and optimization. This represents the average image denoising optimization error over the entire training data distribution. Z0 represents the noise amplitude hyperparameter, Z0 represents the latent space intermediate vector obtained by the encoder downsampling X in the Wensheng graph large model module, and ε represents the noise amplitude hyperparameter. t ε represents the noise predicted by Unet in the large model module of the raw image at step t. θ (Z t C emb. ) represents the truth noise under textual constraints.

[0023] Preferably, the prior retention loss function is as follows:

[0024]

[0025] Among them, L preserve To preserve the loss function a priori, X represents the average prior retention error over the entire training data distribution. pr-gen X represents the original category image generated based on the identifier nesting module. pr This represents the original category truth image of the target object to be registered.

[0026] Preferably, the X pr-gen As shown in the following formula:

[0027]

[0028] in, C represents the pre-trained text-based large model module. pr =T1(t pr ) represents the original category text constraint processed by the identifier nesting module, t pr The generated text representing the original category.

[0029] Preferably, during the fine-tuning of the text image large model module using the prior loss function, the training denoising optimization loss function of the text image large model module is as follows:

[0030]

[0031] Among them, L pr-gen This indicates the loss function used for training the large model module of the text image for denoising and optimization. ε represents the average denoising optimization error across the entire training data distribution. θ (Z t C pr ) represents the noise truth value under text constraints with nested identifiers.

[0032] Preferably, the shooting parameters include shooting viewpoint text and shooting height text; the shooting parameter embedding module is a pre-trained text parameter embedding CLIP model;

[0033] The shooting parameter embedding module sets the shooting parameters of the text image large model module in the following manner:

[0034]

[0035] Where U(·) represents the Unet encoder of the text image large model module, T2(·) represents the pre-trained image parameter embedding module, and W Q W K W V These represent the preset mapping matrices, Q, K, and V represent the three key values ​​of the hybrid attention mechanism, and d... K The dimension of the K-key matrix is ​​represented by _Attention(·)_, which represents the hybrid attention mechanism, _softmax(·)_, which represents the softmax function, _T_, which represents the transpose sign, and _t_. angle The text input indicates the shooting perspective, t height Image height text input.

[0036] Preferably, the identifier nesting module registers the target object to be generated based on Dreambooth technology.

[0037] The advantages of this invention are:

[0038] This invention provides a method for generating multi-view aerial data based on a large-scale text-to-image model with nested identifiers. It transforms existing natural image datasets captured from planar perspectives into UAV aerial datasets while maintaining consistency in the generated content. Furthermore, it allows control over the shooting angle and altitude of the generated data, ultimately achieving aerial sample augmentation.

[0039] Furthermore, based on the identifier nesting module and the text-based image large model module, this invention can achieve appearance registration of a specified target object using only 3 to 5 data samples. The designed reconstruction loss function and prior preservation loss function are used to fine-tune the text-based image large model module, enabling the model to remember targets associated with specific identifiers. Subsequently, the specific identifier is used to generate a specified target with a specified pose, achieving content-consistent generation.

[0040] Furthermore, the shooting parameter nesting module designed in this invention includes two conditional embeddings: shooting angle and shooting height. This module uses a cross-attention mechanism to establish a two-modal association between text and image, enabling text-guided control of the shooting angle and shooting height of the generated image. Attached Figure Description

[0041] Figure 1 This is a schematic diagram of the main process of an aerial multi-view data generation method based on a large text-based image model with nested identifiers, according to the present invention.

[0042] Figure 2 This is a schematic diagram of the application model of the aerial multi-view data generation method based on the identifier nested text-to-image large model of the present invention;

[0043] Figure 3 This is a schematic diagram of an identifier nesting module structure according to the present invention. Detailed Implementation

[0044] The overall process of the aerial multi-view data generation method based on an identifier-nested text-to-image large model provided in this embodiment is as follows: First, planar view image samples are collected, and specific identifiers are associated with various target objects in the scene. Then, the identifier nesting method is used to fine-tune and train the text-to-image large model module using the planar view dataset associated with the identifiers, and the various targets in the scene are registered accordingly. Next, the shooting angle and height are set by embedding shooting parameters, and text prompts containing target identifiers and shooting parameters are input into the trained text-to-image large model module to generate consistent image data. Furthermore, consistent video data is generated based on the generated images using an image-to-video method. The invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0045] See Figure 1 and Figure 2 , Figure 1This paper exemplifies the main process of a method for generating multi-view aerial data based on a large model of text-based images with nested identifiers. Figure 2 This is a schematic diagram of an application model for a method of generating multi-view aerial data based on a large-scale text-based image model with nested identifiers. The method for generating multi-view aerial data based on a large-scale text-based image model with nested identifiers provided in this embodiment includes:

[0046] Step S1: Obtain image samples from a planar perspective and register the target objects to be generated using a preset identifier nesting module, so that each target object has an identifier consisting of a unique given label and an original category label.

[0047] Specifically, a challenge in generating drone aerial images and videos is maintaining consistency in the generated content. To achieve consistent scenes and objectives, this invention designs an identifier nesting module, the structure of which is as follows: Figure 3 As shown, the main approach is to register the target objects of the generated content using Dreambooth technology. First, 3 to 5 images of the target objects to be registered are acquired. A unique identifier is designed for each target object to distinguish it. The identifier consists of a unique given label for the target and an original category label, in the form of "a [unique given label][original category label]". Then, the image-text module is fine-tuned using the target object image and the identifier to represent different generating subjects for target registration. This allows the image-text module to remember the target object with a single unique identifier. In subsequent data sample generation, different identifiers can be used to retrieve different target subjects, generating specified objects with consistent content.

[0048] Step S2: Fine-tune the pre-trained text-based image large model module based on image samples configured with identifiers and using preset reconstruction loss function and prior preservation loss function.

[0049] Specifically, the text-to-image large-scale model module encodes the generation instructions (Prompt) through the text input model, guiding the Stable Diffusion XL (SDXL) large-scale model to generate images. This model is pre-trained using the large dataset LAION-5B, which contains 5 billion text-image pair training samples. By fitting this dataset, the SDXL model achieves alignment between text and image modalities, possessing powerful text-to-image generation capabilities. The module structure is as follows: Figure 2As shown, the image consists of two parts: a pre-trained Variational Autoencoder (VAE) and Unet. The VAE comprises an encoder and a decoder. The encoder downsamples the input image to a low-dimensional latent space, while the decoder upsamples the low-dimensional vector to generate the final image. The advantage of using a VAE for downsampling is that image editing is performed in the low-dimensional latent space, effectively improving computational efficiency. The Unet part primarily performs progressive denoising on the input image in the low-dimensional latent space to obtain the low-dimensional vector of the final generated image, which is then fed into the VAE decoder.

[0050] The large model module in this paper uses reconstruction loss function and prior preservation loss function to supervise the model fine-tuning process.

[0051] The reconstruction loss function is shown in Equation (1):

[0052]

[0053] Among them, L recon. Represents the reconstruction loss function. X represents the average reconstruction error over the entire training data distribution. gen This represents the image generated based on the identifier nesting module, where X represents the truth image of the target object to be registered.

[0054] X gen As shown in formula (2):

[0055]

[0056] in, Z represents the pre-trained large model module for text-based images. t C represents the noise feature map generated by the encoder at step t of the large model module of the text image, where C is the noise feature map. emb. =T1(t emb. ) represents the text constraints processed by the identifier nesting module, T1(·) represents the pre-trained text input CLIP model in the text-generated graph large model module, and t emb. This represents the image description text containing the identifier. During the fine-tuning of the raw image large model module using the reconstruction loss function, the training denoising optimization loss function of the raw image large model module is shown in formula (3):

[0057]

[0058] Among them, L gen This indicates the loss function used for training the large model module of the text image for denoising and optimization. This represents the average denoising optimization error across the entire training data distribution. Z0 represents the noise amplitude hyperparameter, Z0 represents the latent vector obtained by downsampling X by the encoder in the Wensheng graph large model module, and ε represents the noise amplitude hyperparameter. t ε represents the noise predicted by Unet in the large model module of the raw image at step t. θ (Z t C emb. Z0 represents the ground truth noise under textual constraints. During training, this formula is first used to add noise to Z0 to obtain the ground truth noise value at each step. During denoising, the initial noise feature map Z is first sampled from an arbitrary Gaussian distribution. T Then, in each step of the noise reduction process, C emb. Unet is used to generate prediction noise for the constraints, and then Z is used. T Subtract ε θ Proceed to the next denoising step. After T rounds of denoising, a low-dimensional feature map free of noise is obtained. The image X is finally generated by sending it to the decoder. gen The weights of the variational autoencoder and Unet components in the Wensheng large model module will be updated according to the reconstruction loss function during the denoising process.

[0059] The prior retention loss function is shown in Equation (4):

[0060]

[0061] Among them, L preserve To preserve the loss function a priori, X represents the average prior retention error over the entire training data distribution. pr-gen X represents the original category image generated based on the identifier nesting module. pr This represents the original category truth image of the target object to be registered.

[0062] X pr-gen As shown in formula (5):

[0063]

[0064] in, Z represents the pre-trained large model module for text-based images. t C represents the noise feature map generated by the encoder at step t of the large model module of the text image, where C is the noise feature map. pr =T1(t pr ) represents the original category text constraint processed by the identifier nesting module, t pr The generated text represents the original category. During the fine-tuning of the text-to-image large model module using the prior loss function, the training denoising optimization loss function of the text-to-image large model module is shown in formula (6):

[0065]

[0066] Among them, L pr-gen This indicates the loss function used for training the large model module of the text image for denoising and optimization. ε represents the average denoising optimization error across the entire training data. θ (Z t C pr ) represents the ground truth noise under text constraints with nested identifiers. The weights of the variational autoencoder and Unet components in the text-generated graph large model module will also be updated during the denoising process based on the prior preservation loss function.

[0067] Step S3: Based on the preset shooting parameters, the embedding module sets the shooting parameters of the text image large model module.

[0068] Specifically, the shooting parameter embedding module is a pre-trained text parameter embedding CLIP model. The shooting parameters of the text-to-image large model module are set through this module, and these parameters include shooting viewpoint text and shooting height text. The shooting parameter embedding module controls the shooting viewpoint and shooting height of the generated image in a text-guided manner. Let the shooting viewpoint text input be t. angle The text input for the shooting height is t. height A cross-attention mechanism is used to establish a two-modal association between text and image, completing the embedding of shooting parameters, including shooting viewpoint text and shooting height text;

[0069] The shooting parameter embedding module sets the shooting parameters of the text image large model module as shown in formula (7):

[0070]

[0071] Where U(·) represents the Unet encoder of the text image large model module, T2(·) represents the pre-trained image parameter embedding module, and W Q W K W V These represent the preset mapping matrices, Q, K, and V represent the three key values ​​of the hybrid attention mechanism, and d... K The dimension of the K-key matrix is ​​represented by _Attention(·)_, which represents the hybrid attention mechanism, _softmax(·)_, which represents the softmax function, _T_, which represents the transpose sign, and _t_. angle The text input indicates the shooting perspective, t height Capture the high-resolution text input. During training, freeze the weights of the text-to-image large model module and optimize only the weights of the capture parameter embedding module to align the capture parameters.

[0072] Step S4: Generate image data from the image description text containing the target identifier and the finely tuned image model module after inputting the shooting parameters.

[0073] Specifically, after fine-tuning the training of the text-to-image large model module and the shooting parameter embedding module, by inputting specified shooting parameters and image description text containing specific target identifiers, the system can output images with specified target content, controllable viewpoint and height, thereby converting image samples from a planar viewpoint into drone aerial image samples. Furthermore, the image-to-video algorithm can be used to output aerial video samples, thus completing the construction of the aerial image / video dataset.

[0074] In summary, this invention proposes a method for generating multi-view aerial photography data based on a large model of text-based images with nested identifiers. This method transforms planar viewpoint shooting data into UAV aerial photography data, maintaining the consistency of the generated content and controlling the shooting angle and altitude of the generated data. Ultimately, it broadens the aerial photography sample and solves the problem of insufficient diversity in aerial photography data.

[0075] Furthermore, the identifier nesting module and the text-based image large model module of this invention can complete the registration of the appearance of a specified target using only 3 to 5 data samples. A reconstruction loss function and a priori preservation loss function are designed to fine-tune the text-based image large model, enabling the model to remember targets associated with specific identifiers. Subsequently, the specified target is generated using the specific identifier, achieving consistent content generation.

[0076] Furthermore, the shooting parameter nesting module includes two conditional embeddings: shooting angle and shooting height. This module uses a cross-attention mechanism to establish a modal relationship between text and image, enabling text-guided control of the shooting angle and shooting height of the generated image.

[0077] Furthermore, this invention designs a method for generating drone aerial video data, which uses generated drone aerial multi-view image data as input, employs a text-based large-scale model module to generate aerial video data, and simultaneously acquires video data with consistent content.

[0078] Those skilled in the art will recognize that the method steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of electronic hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in electronic hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of the invention.

[0079] The term "comprising" or any other similar term is intended to cover non-exclusive inclusion, such that a parameter, method, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to those elements, method, or apparatus.

[0080] The above description describes the preferred embodiments of the present invention and the technical principles applied thereto. For those skilled in the art, any obvious changes such as equivalent transformations or simple substitutions based on the technical solutions of the present invention, without departing from the spirit and scope of the present invention, shall fall within the protection scope of the present invention.

Claims

1. A method for generating multi-view aerial data based on a large-scale text-based image model with nested identifiers, characterized in that, The method includes: Image samples from a planar perspective are acquired and registered using a pre-defined identifier nesting module, so that each target object has an identifier consisting of a unique given label and an original category label; Based on the image samples configured with the identifier, the pre-trained text-to-image large model module is fine-tuned using a preset reconstruction loss function and a priori preservation loss function; The shooting parameters of the text image large model module are set based on the preset shooting parameter embedding module; The image data is generated by the image description text containing the target identifier and the finely tuned image model module based on the shooting parameters.

2. The aerial multi-view data generation method based on identifier-nested text-based image large model as described in claim 1, characterized in that, The text-based large model module is the Stable Diffusion XL large model, which includes a variational autoencoder and Unet. The variational autoencoder includes an encoder and a decoder. The Stable Diffusion XL large model was pre-trained using the large dataset LAION-5B.

3. The aerial multi-view data generation method based on identifier-nested text-based image large model as described in claim 2, characterized in that, The reconstruction loss function is shown in the following equation: Among them, L recon. Represents the reconstruction loss function. X represents the average image reconstruction error over the entire training data distribution. gen This represents the image generated based on the identifier nesting module, where X represents the truth image of the target object to be registered.

4. The aerial multi-view data generation method based on identifier-nested text-based image large model as described in claim 3, characterized in that, The X gen As shown in the following formula: in, Z represents the pre-trained text-based large model module. t C represents the noise feature map generated by the encoder at step t of the large model module of the text image, where C is the noise feature map. emb. =T1(t emb. ) represents the text constraints processed by the identifier nesting module, T1(·) represents the pre-trained text input CLIP model in the text-generated graph large model module, and t emb. This represents the image description text containing the identifier.

5. The aerial multi-view data generation method based on identifier-nested text-based image large model as described in claim 4, characterized in that, During the fine-tuning of the text image large model module using the reconstruction loss function, the training denoising optimization loss function of the text image large model module is shown in the following formula: Among them, L gen This indicates the loss function used for training the large model module of the text image for denoising and optimization. This represents the average image denoising optimization error over the entire training data distribution. Z0 represents the noise amplitude hyperparameter, Z0 represents the latent space intermediate vector obtained by the encoder downsampling X in the Wensheng graph large model module, and ε represents the noise amplitude hyperparameter. t ε represents the noise predicted by Unet in the large model module of the raw image at step t. θ (Z t C emb. ) represents the truth noise under textual constraints.

6. The aerial multi-view data generation method based on identifier-nested text-based image large model as described in claim 5, characterized in that, The prior retention loss function is shown in the following equation: Among them, L preserve To preserve the loss function a priori, X represents the average prior retention error over the entire training data distribution. pr-gen X represents the original category image generated based on the identifier nesting module. pr This represents the original category truth image of the target object to be registered.

7. The aerial multi-view data generation method based on identifier-nested text-based image large model as described in claim 6, characterized in that, The X pr-gen As shown in the following formula: in, C represents the pre-trained text-based large model module. pr =T1(t pr ) represents the original category text constraints processed by the identifier nesting module, t pr The generated text representing the original category.

8. The aerial multi-view data generation method based on identifier-nested text-based image large model as described in claim 7, characterized in that, During the fine-tuning of the text image large model module using the prior loss function, the training denoising optimization loss function of the text image large model module is as follows: Among them, L pr-gen This indicates the loss function used for training the large model module of the text image for denoising and optimization. ε represents the average denoising optimization error across the entire training data distribution. θ (Z t C pr ) represents the noise truth value under text constraints with nested identifiers.

9. The aerial multi-view data generation method based on identifier-nested text-based image large model as described in claim 2, characterized in that, The shooting parameters include shooting angle text and shooting height text; the shooting parameter embedding module is a pre-trained text parameter embedding CLIP model; The shooting parameter embedding module sets the shooting parameters of the text image large model module in the following manner: Where U(·) represents the Unet encoder of the text image large model module, T2(·) represents the pre-trained image parameter embedding module, and W Q W K W V These represent the preset mapping matrices, Q, K, and V represent the three key values ​​of the hybrid attention mechanism, and d... K The dimension of the K-key matrix is ​​represented by _Attention(·)_, which represents the hybrid attention mechanism, _softmax(·)_, which represents the softmax function, _T_, which represents the transpose sign, and _t_. angle The text input indicates the shooting perspective, t height Input the text indicating the shooting height.

10. The aerial multi-view data generation method based on identifier-nested text-based image large model as described in claim 1, characterized in that, The identifier nesting module is used to register the target object to be generated based on Dreambooth technology.