Panorama image generation method and electronic device
By iteratively fine-tuning the cross-attention processing module of the panoramic image generation model and iteratively denoising the text description information, the problems of high computational overhead and high training cost of the panoramic image generation model are solved, and high-quality panoramic images are generated efficiently.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING HUMANOID ROBOTICS INNOVATION CENTER CO LTD
- Filing Date
- 2026-03-23
- Publication Date
- 2026-06-23
Smart Images

Figure CN122265027A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image processing technology, and more specifically, to a panoramic image generation method and an electronic device. Background Technology
[0002] Panoramic image generation is an important research direction in the field of computer vision, aiming to create images that can capture a complete 360-degree field of view. Due to the 2:1 aspect ratio and unique spherical distortion features of panoramic images, which have significant structural differences from traditional perspective images, coupled with the scarcity of high-quality panoramic datasets, directly training panoramic image generation models faces enormous challenges.
[0003] In existing technologies, panoramic image generation models typically employ a dual-branch architecture. Specifically, the panoramic image generation model includes a panoramic branch and a perspective branch, which interact with each other through a special cross-branch attention module. During training, all attention layers in both the panoramic and perspective branches are fine-tuned. Simultaneously, the perspective view generated by the perspective branch is used to constrain the output of the panoramic branch.
[0004] However, this approach suffers from high computational overhead and excessively long training time due to its dual-branch architecture, resulting in high training costs. Summary of the Invention
[0005] The purpose of this application is to provide a panoramic image generation method and electronic device to address the shortcomings of the prior art, thereby solving the problem of high training costs in the prior art.
[0006] To achieve the above objectives, the technical solutions adopted in the embodiments of this application are as follows: In a first aspect, one embodiment of this application provides a panoramic image generation method, the method comprising: Multiple sample panoramic images are acquired, and multiple training data pairs are generated based on each sample panoramic image. Each training data pair includes: the sample panoramic image and the text description information corresponding to the sample panoramic image. Based on each of the training data pairs, at least one target weight matrix in each cross-attention processing module of the initial panoramic image generation model is iteratively fine-tuned to obtain the target panoramic image generation model, wherein the initial panoramic image generation model includes at least a plurality of the cross-attention processing modules. Obtain the text description information corresponding to the target panoramic image, and input the text description information into the target panoramic image generation model. The target panoramic image generation model performs iterative denoising based on the randomly generated initial noisy image and the text description information to generate the target panoramic image.
[0007] Secondly, another embodiment of this application provides a panoramic image generation apparatus, the apparatus comprising: The generation module is used to acquire multiple sample panoramic images and generate multiple training data pairs based on each sample panoramic image. Each training data pair includes: the sample panoramic image and the text description information corresponding to the sample panoramic image. The fine-tuning module is used to iteratively fine-tune at least one target weight matrix in each cross-attention processing module of the initial panoramic image generation model according to each of the training data pairs, so as to obtain the target panoramic image generation model, wherein the initial panoramic image generation model includes at least a plurality of the cross-attention processing modules. The inference module is used to obtain text description information corresponding to the target panoramic image and input the text description information into the target panoramic image generation model. The target panoramic image generation model performs iterative denoising based on the randomly generated initial noise image and the text description information to generate the target panoramic image.
[0008] Thirdly, another embodiment of this application provides an electronic device, including: a processor, a storage medium, and a bus, wherein the storage medium stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor communicates with the storage medium via the bus, and the processor executes the machine-readable instructions to perform the steps of any of the methods described in the first aspect above.
[0009] Fourthly, another embodiment of this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, performs the steps of any of the methods described in the first aspect above.
[0010] The beneficial effects of this application are as follows: By acquiring multiple sample panoramic images and generating multiple training data pairs based on each sample panoramic image, and iteratively fine-tuning at least one target weight matrix in each cross-attention processing module of the initial panoramic image generation model based on each training data pair, a target panoramic image generation model is obtained. This allows the target panoramic image generation model to learn about spherical distortion related to panoramic images at a lower training cost. Simultaneously, it retains the original ability of the initial panoramic image generation model to generate ordinary perspective views. Furthermore, it acquires textual description information corresponding to the target panoramic image and inputs this textual description information into the target panoramic image generation model. The panoramic image generation model then iteratively denoises based on randomly generated initial noisy images and textual description information to generate the target panoramic image. This method can achieve visually correct and semantically matched high-quality panoramic images under any given text prompt. Moreover, the generated panoramic images are not limited to the training data and have good adaptability to complex and unseen text prompts. Furthermore, this method is independent of the base model and can be easily extended to higher-resolution base models to generate higher-quality images. In addition, it has advantages such as high data efficiency, fast inference speed, and ease of integration. Attached Figure Description
[0011] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0012] Figure 1 A schematic flowchart of a panoramic image generation method provided in an embodiment of this application; Figure 2 A schematic diagram of the structure of an initial panoramic image generation model in the panoramic image generation method provided in the embodiments of this application; Figure 3 A schematic flowchart illustrating the process of obtaining a target panoramic image generation model in the panoramic image generation method provided in this application embodiment; Figure 4 This is a schematic flowchart illustrating the calculation of the prediction noise corresponding to each current training data pair in the panoramic image generation method provided in the embodiments of this application. Figure 5 This is a schematic diagram of another structure of the initial panoramic image generation model in the panoramic image generation method provided in the embodiments of this application; Figure 6 A schematic flowchart illustrating the determination of intersection features in the panoramic image generation method provided in this application embodiment; Figure 7 This is a schematic diagram of another structure of the initial panoramic image generation model in the panoramic image generation method provided in the embodiments of this application; Figure 8 A schematic flowchart illustrating the determination of intersection features in the panoramic image generation method provided in this application embodiment; Figure 9 This is a schematic diagram of another structure of the initial panoramic image generation model in the panoramic image generation method provided in the embodiments of this application; Figure 10 A schematic diagram of a process for calculating the output incremental features in the panoramic image generation method provided in the embodiments of this application; Figure 11 Another flowchart illustrating the calculation of output incremental features in the panoramic image generation method provided in this application embodiment; Figure 12 This is a schematic diagram of the electronic device structure provided in an embodiment of this application. Detailed Implementation
[0013] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. It should be understood that the accompanying drawings in this application are for illustrative and descriptive purposes only and are not intended to limit the scope of protection of this application. Furthermore, it should be understood that the schematic drawings are not drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of this application. It should be understood that the operations in the flowcharts may not be implemented in sequence, and steps without logical contextual relationships may be reversed or implemented simultaneously. In addition, those skilled in the art, guided by the content of this application, may add one or more other operations to the flowcharts, or remove one or more operations from the flowcharts.
[0014] Furthermore, the described embodiments are merely some, not all, of the embodiments of this application. The components of the embodiments of this application described and illustrated herein can typically be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely to illustrate selected embodiments of the application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.
[0015] It should be noted that the term "comprising" will be used in the embodiments of this application to indicate the presence of the features declared thereafter, but does not exclude the addition of other features.
[0016] It is understandable that panoramic images have a 2:1 aspect ratio and equirectangular projection, which are very different from the structure of ordinary perspective images. Furthermore, panoramic image data is scarce (such as Matterport3D, which only has about 10,000 images), making it difficult to train and generate models from scratch.
[0017] In existing technologies, panoramic image generation models typically employ a dual-branch architecture. Specifically, the panoramic image generation model includes a panoramic branch and a perspective branch, which interact with each other through a special cross-branch attention module. During training, all attention layers in both the panoramic and perspective branches are fine-tuned. Simultaneously, the perspective view generated by the perspective branch is used to constrain the output of the panoramic branch.
[0018] However, this approach suffers from high computational overhead and excessively long training time due to its dual-branch architecture, resulting in high training costs. Furthermore, it lacks interpretability, cannot separate general-purpose and specialized capabilities, and has limited generalization ability.
[0019] Based on the aforementioned problems, this application proposes a panoramic image generation method. It acquires multiple sample panoramic images and generates multiple training data pairs based on each sample image. Then, based on each training data pair, iteratively fine-tunes at least one target weight matrix in each cross-attention processing module of the initial panoramic image generation model to obtain a target panoramic image generation model. This method enables the target panoramic image generation model to achieve the strongest panoramic generation capability through the most efficient modifications. It also acquires textual description information corresponding to the target panoramic image and inputs this textual description information into the target panoramic image generation model. The model then iteratively denoises based on randomly generated initial noisy images and the textual description information to generate the target panoramic image. This method achieves high-quality panoramic image generation and can be extended to high-resolution generation. Furthermore, it possesses advantages such as high training stability, high data efficiency, fast inference speed, and ease of integration.
[0020] First, the relevant concepts involved in the panoramic image generation method provided in the embodiments of this application will be explained.
[0021] A panoramic image is a complete spherical view with a horizontal 360° and a vertical 180° field of view. Panoramic images have an aspect ratio of 2:1.
[0022] Spherical distortion refers to the phenomenon in panoramic images where objects appear increasingly elongated closer to the top and bottom poles. This is caused by the fact that projecting a sphere onto a plane inevitably results in stretching.
[0023] A diffusion model is a model that starts with random noise, gradually removes noise, and finally generates a clear image; a latent diffusion model is a diffusion model that diffuses in the latent space (compressed feature space) rather than the pixel space.
[0024] Based on this, the panoramic image generation method provided in the embodiments of this application will be described below.
[0025] It is understood that the panoramic image generation method provided in this application embodiment can be applied to any electronic device with processing capabilities, and this application embodiment does not impose any limitations on it.
[0026] Figure 1 This is a schematic flowchart of a panoramic image generation method provided in an embodiment of this application, referring to... Figure 1 As shown, the subject executing this method can be any electronic device with processing capabilities, and the method includes: S101. Acquire multiple sample panoramic images and generate multiple training data pairs based on each sample panoramic image.
[0027] Optionally, multiple sample panoramic images can be acquired, wherein the sample panoramic image refers to a panoramic image used as a sample, and the sample panoramic image can be a real 360° panoramic image.
[0028] Optionally, after obtaining the panoramic images of each sample, textual description information corresponding to each panoramic image can be generated based on the panoramic images of each sample, thereby obtaining multiple training data pairs.
[0029] Each training data pair includes: a sample panoramic image and corresponding text description information.
[0030] For example, after obtaining the sample panoramic image, it can be input into a pre-trained text description generation model, which then generates text description information corresponding to the sample panoramic image based on the sample panoramic image. The text description generation model can be a BLIP-2 model.
[0031] S102. Based on each training data pair, iteratively fine-tune at least one target weight matrix in each cross-attention processing module of the initial panoramic image generation model to obtain the target panoramic image generation model.
[0032] Optionally, after obtaining each training data pair, at least one target weight matrix in each cross-attention processing module of the initial panoramic image generation model can be iteratively fine-tuned using each training data pair, and the target panoramic image generation model can be obtained after the iteration is completed.
[0033] The initial panoramic image generation model includes at least several cross-attention processing modules. The target panoramic image generation model refers to a single-branch latent diffusion model used to generate panoramic images.
[0034] For example, the initial panoramic image generation model can be a Stable Diffusion model. Based on each training data pair, at least one target weight matrix in each cross-attention processing module of the Stable Diffusion model can be iteratively fine-tuned, and after the iteration is completed, the target panoramic image generation model is obtained.
[0035] Fine-tuning refers to further training a pre-trained model using data from a specific task, adjusting some or all of its parameters to adapt the model to the new task. In this application's embodiments, fine-tuning includes partial fine-tuning and parameter-efficient fine-tuning (PEFT).
[0036] For example, at least one target weight matrix in each cross-attention processing module of the Stable Diffusion model can be fine-tuned using LoRA or Adapter through each training data pair.
[0037] The target weight matrix refers to the weight matrix used in the cross-attention processing module to achieve spatially perceptual rendering of panoramic images.
[0038] For example, the target weight matrix includes at least one of the following: a value weight matrix and an output weight matrix. Specifically, the value weight matrix is used to implement spatial modulation of the content of the panoramic image, and the output weight matrix is used to implement adaptive projection of the space of the panoramic image.
[0039] It is worth noting that the cross-attention processing module is implemented based on the cross-attention mechanism, which includes a query matrix, a key matrix, a value matrix, and an output matrix. In the panoramic image generation scenario, it is possible to pre-determine whether to use the query matrix, key matrix, value matrix, and output matrix in the cross-attention processing module as the target weight matrix.
[0040] Specifically, the process of determining whether to use the query matrix, key matrix, value matrix, and output matrix in the cross-attention processing module as the target weight matrix includes: obtaining an initial panoramic image generation model; performing isolated training on the query matrix, key matrix, value matrix, and output matrix in the initial panoramic image generation model to obtain the isolated training results of each matrix; and determining at least one first matrix with the ability to independently learn the panoramic structure based on the isolated training results of each matrix. Isolated training refers to fine-tuning each matrix individually while maintaining the original pre-trained weights of other matrices. The isolated training results of each matrix are used to indicate whether each matrix can generate a panoramic image with successful spherical distortion under isolated training. If the isolated training result of a matrix indicates that the matrix can generate a panoramic image with successful spherical distortion under isolated training, then that matrix is used as a first matrix.
[0041] Based on this, the query matrix, key matrix, value matrix, and output matrix in the initial panoramic image generation model are jointly fine-tuned to obtain a valid panoramic image generation model. After obtaining the valid panoramic image generation model, it is decomposed. Specifically, in the valid panoramic image generation model, the fine-tuning corresponding to each first matrix is disabled to verify whether each first matrix is the knowledge carrier corresponding to the panoramic image. If so, each first matrix is used as a target weight matrix.
[0042] For example, taking the first matrix as the value weight matrix, in verifying the panoramic image generation model, the fine-tuning corresponding to each first matrix is turned off respectively. Verifying whether each first matrix is the knowledge carrier corresponding to the panoramic image includes: in verifying the panoramic image generation model, training is performed with the fine-tuning corresponding to the value weight matrix turned off, an intermediate panoramic image generation model is obtained, and it is determined whether the intermediate panoramic image generation model retains the panoramic capability. If so, the value weight matrix is determined to be the knowledge carrier corresponding to the panoramic image.
[0043] S103. Obtain the text description information corresponding to the target panoramic image, and input the text description information into the target panoramic image generation model. The target panoramic image generation model performs iterative denoising based on the randomly generated initial noise image and the text description information to generate the target panoramic image.
[0044] Optionally, after obtaining the target panoramic image generation model, the text description information corresponding to the target panoramic image can be obtained and input into the target panoramic image generation model. The target panoramic image generation model then performs iterative denoising inference based on the randomly generated initial noisy image and the text description information to generate the target panoramic image.
[0045] In this embodiment, multiple sample panoramic images are acquired, and multiple training data pairs are generated based on each sample panoramic image. Based on each training data pair, at least one target weight matrix in each cross-attention processing module of the initial panoramic image generation model is iteratively fine-tuned to obtain the target panoramic image generation model. This allows the target panoramic image generation model to learn about spherical distortion related to panoramic images at a lower training cost. Simultaneously, it retains the original ability of the initial panoramic image generation model to generate ordinary perspective views. Textual description information corresponding to the target panoramic image is acquired and input into the target panoramic image generation model. The panoramic image generation model iteratively denoises based on randomly generated initial noisy images and textual description information to generate the target panoramic image. It can achieve visually correct and semantically matched high-quality panoramic images under any given text prompt. Furthermore, the generated panoramic images are not limited to the training data and have good adaptability to complex and unseen text prompts. Moreover, this method is independent of the base model and can be easily extended to higher resolution base models to generate higher-quality images. In addition, it has advantages such as high data efficiency, fast inference speed, and ease of integration.
[0046] In one possible implementation, the target weight matrix includes a value weight matrix and an output weight matrix.
[0047] Optionally, the value weight matrix and output weight matrix in each cross-attention processing module can be adjusted simultaneously. The value weight matrix is used to modulate the semantic content according to the location, and the output weight matrix is used to adaptively project the modulated content onto the target spatial location, thereby achieving accurate rendering of spherical distortion.
[0048] By simultaneously adjusting the value weight matrix and output weight matrix in each cross-attention processing module, spatially perceptive semantic rendering is achieved, thereby realizing distortion-perceptive feature projection and completing the full mapping from "what to draw" to "where to draw and how to deform".
[0049] In one possible implementation, Figure 2 This is a schematic diagram of the structure of an initial panoramic image generation model in the panoramic image generation method provided in this application embodiment. Figure 3 This is a schematic flowchart illustrating the process of obtaining a target panoramic image generation model in the panoramic image generation method provided in this application embodiment, with reference to... Figure 2 as well as Figure 3As shown, the initial panoramic image generation model further includes: a text encoder module, a VAE encoder module, and an attention diffusion module. The attention diffusion module includes: multiple attention processing modules, each of which includes at least: a self-attention processing module and a cross-attention processing module. In S102 above, based on each training data pair, at least one target weight matrix in each cross-attention processing module of the initial panoramic image generation model is iteratively fine-tuned to obtain the target panoramic image generation model, including: S301. In the current iteration, determine multiple current training data pairs in each training data pair, input the text description information of each current training data pair into the text encoder module, and generate the text vector corresponding to each current training data pair according to the text description information.
[0050] Optionally, taking the current iteration as an example, multiple current training data pairs can be determined from each training data pair. Each current training data pair refers to the training data pair for the current iteration.
[0051] Optionally, the text description information of each current training data pair is input into the text encoder module, which converts each text description information into a vector representation to generate the text vector (text embedding) corresponding to each current training data pair.
[0052] For example, four current training data pairs can be identified from each training data pair for training.
[0053] S302. Input the panoramic images of the samples in each current training data pair into the VAE encoder module. The VAE encoder module performs latent space encoding on each sample panoramic image to obtain the true latent features corresponding to each current training data pair. Then, add preset real noise to the true latent features corresponding to each current training data pair to generate the noisy true latent features corresponding to each current training data pair.
[0054] Optionally, the panoramic images of the samples in each current training data pair are input into the VAE encoder module, which performs latent space encoding on each sample panoramic image, converting it from pixel space to latent space to obtain the true latent features corresponding to each current training data pair. Preset real noise is then added to the true latent features corresponding to each current training data pair to generate noisy true latent features corresponding to each current training data pair.
[0055] For each pair of current training data corresponding to the true latent features, the same true noise can be added, or different true noise can be added according to each true latent feature.
[0056] S303. Input the text vector corresponding to each current training data pair and the noisy real latent features corresponding to each current training data pair into each attention processing module. The self-attention processing module and the cross-attention processing module in each attention processing module calculate the prediction noise corresponding to each current training data pair.
[0057] Optionally, the text vectors corresponding to each current training data pair and the noisy real latent features corresponding to each current training data pair are input into each attention processing module. The self-attention processing module and the cross-attention processing module in each attention processing module perform forward propagation based on the frozen parameters of the self-attention processing module, the frozen parameters of the cross-attention processing module, and the current parameters of each target weight matrix in the cross-attention processing module in the current iteration round, to predict the prediction noise corresponding to each current training data pair.
[0058] For example, each attention processing module can perform forward propagation on the corresponding text vector and the corresponding noisy real latent features of the current training data pair to predict the corresponding prediction noise of the current training data pair.
[0059] The attention processing modules can be further subdivided into downsampling blocks, midsampling blocks, and upsampling blocks according to their positions in the initial panoramic image generation model. Each attention processing module may include a Transformer block, which in turn includes a self-attention processing module and a cross-attention processing module.
[0060] Optionally, each Transformer block may also include multiple normalization layers and a feedforward network. The first normalization layer, the self-attention processing module, the second normalization layer, the cross-attention processing module, and the feedforward network are connected in sequence.
[0061] S304. Based on the predicted noise corresponding to each current training data pair, the real noise corresponding to each current training data pair, and the preset loss function, determine the loss result, and adjust the value weight matrix and output weight matrix in the cross-attention processing module according to the loss result.
[0062] Optionally, after obtaining the prediction noise corresponding to each current training data pair, the loss information corresponding to each current training data pair can be calculated based on the prediction noise, the real noise, and the preset loss function. The loss information corresponding to each current training data pair can then be summed to obtain the loss result.
[0063] Optionally, the value weight matrix and output weight matrix in the cross-attention processing module can be adjusted based on the loss result.
[0064] For example, based on the loss result, the value weight matrix and output weight matrix in the cross-attention processing module are fine-tuned using LoRA, and the rank of LoRA is 4.
[0065] By implementing iterative fine-tuning of the target weight matrix in the latent space through the VAE encoder module, the computational load can be greatly reduced. Furthermore, the noise in the latent space is closer to a Gaussian distribution, resulting in better prediction performance, easier denoising, and more stable training.
[0066] In one possible implementation, Figure 4 This is a flowchart illustrating the process of calculating the prediction noise corresponding to each current training data pair in the panoramic image generation method provided in this application embodiment, with reference to... Figure 4 As shown, in S303 above, the prediction noise corresponding to each current training data pair is calculated by the self-attention processing module and the cross-attention processing module in each attention processing module, including: S401. The self-attention processing module determines the self-attention features based on the noisy real latent features.
[0067] Optionally, the self-attention processing module learns the features inside the image based on the noisy real latent features to obtain self-attention features.
[0068] The self-attention feature is used to characterize the relationships within the image. The parameters in the self-attention processing module remain frozen.
[0069] For example, taking the attention processing module as the first attention processing module, noisy real latent features can be processed to obtain self-attention features.
[0070] For example, taking an attention processing module that is not the first attention processing module as an example, the features output by the previous attention processing module can be processed to obtain self-attention features.
[0071] S402. Perform residual connections between the self-attention features and the noisy real latent features to obtain fused latent features.
[0072] Optionally, the self-attention features and the noisy real latent features are residually connected to obtain fused latent features, thereby ensuring that the gradient can be directly backpropagated and the original information is preserved.
[0073] S403. The cross-attention processing module determines the cross features based on the text vector and the fused latent features.
[0074] Optionally, the cross-attention processing module determines the cross features based on the text vector and the fused latent features.
[0075] Cross features are used to characterize the fusion result of each image pixel and its associated semantic information. Cross features can indicate what content each pixel should present and in what way.
[0076] S404. Based on the self-attention features and cross features, the predicted noise is obtained.
[0077] Optionally, the sum of self-attention features, cross features, and noisy real latent features can be calculated and processed through a feedforward network to obtain the prediction noise.
[0078] For example, if the attention processing module is not the last attention processing module, the sum of the self-attention features, cross features, and noisy real latent features can be used as the input of the next attention processing module, and S401-S404 can continue to be executed.
[0079] For example, taking the attention processing module as the last attention processing module, the sum of self-attention features, cross features, and noisy real latent features can be obtained by convolution to obtain the prediction noise.
[0080] The initial panoramic image generation model first determines the self-attention features based on the noisy real latent features and performs residual connections to obtain fused latent features. Then, the cross-attention processing module determines the cross features based on the text vector and the fused latent features. Based on the self-attention features and the cross features, the predicted noise is obtained. This allows the initial panoramic image generation model to process the internal relationships of the image before injecting text information when learning how to generate panoramic images, thereby improving the generation effect of the target panoramic image generation model.
[0081] In one possible implementation, Figure 5 This is a schematic diagram of another structure of the initial panoramic image generation model in the panoramic image generation method provided in the embodiments of this application. Figure 6 This is a flowchart illustrating the process of determining intersection features in the panoramic image generation method provided in this application embodiment, with reference to... Figure 5 as well as Figure 6 As shown, the cross-attention processing module includes: a query unit, a key unit, a value unit, an output unit, and a value increment unit; the value weight matrix includes: the original value matrix corresponding to the value unit and the increment value matrix corresponding to the value increment unit. In step S601 above, cross features are determined based on the text vector and fused latent features, including: S601. The query attention feature is calculated by the query unit based on the fusion latent features.
[0082] Optionally, the query unit is configured to generate query attention features based on the input features. The query attention features can be calculated by the query unit based on the fused latent features. The query attention features can be a query matrix.
[0083] S602. The key attention features are calculated from the key units based on the text vector.
[0084] Optionally, the key unit is configured to generate key attention features based on the input text vector. The key attention features can be calculated by the key unit based on the text vector. The key attention features can be a key matrix.
[0085] S603. The original value attention features are calculated by the value unit based on the original value matrix and the text vector.
[0086] Optionally, the value unit is configured to generate raw value attention features based on the text vector. The raw value attention features can be calculated by the value unit based on the raw value matrix and the text vector.
[0087] Among them, the original value matrix corresponding to the frozen value unit during the fine-tuning process.
[0088] S604. The value attention increment feature is calculated by the value increment unit based on the increment value matrix and the text vector.
[0089] Optionally, the value increment unit is configured to generate value attention increment features based on the text vector. The value attention increment features can be calculated by the value increment unit based on the increment value matrix and the text vector.
[0090] For example, the value increment unit can be implemented using a fine-tuning component, such as a LoRA component. Specifically, the parameters in the LoRA component can be pre-trained.
[0091] S605. Calculate the sum of the original value attention feature and the value attention increment feature, and use it as the total value attention feature.
[0092] Optionally, the sum of the original value attention feature and the value attention increment feature is calculated as the total value attention feature.
[0093] S606. Based on the query attention features, key attention features, and total value attention features, calculate the attention weighting features.
[0094] Optionally, after obtaining the query attention features, key attention features, and total value attention features, attention weights can be calculated using the query attention features and key attention features, and then the attention weights and total value attention features can be weighted and summed to obtain the attention weighted features.
[0095] For example, attention weights are used to indicate the degree of attention each location in the panoramic image pays to each text word in the text vector. Attention-weighted features are used to indicate the information that each location in the panoramic image reads from the text vector.
[0096] S607. The cross features are calculated by the output unit based on the attention-weighted features.
[0097] Optionally, the output unit calculates the cross features by projecting the attention-weighted features.
[0098] In one possible implementation, Figure 7 This is a schematic diagram of another structure of the initial panoramic image generation model in the panoramic image generation method provided in the embodiments of this application. Figure 8 This is a flowchart illustrating the process of determining intersection features in the panoramic image generation method provided in this application embodiment, with reference to... Figure 7 as well as Figure 8 As shown, the cross-attention processing module also includes: an output increment unit; the output weight matrix includes: the original output matrix corresponding to the output unit and the increment output matrix corresponding to the output increment unit; the above S607 calculates the cross features based on the attention weighted features by the output unit, including: S801. The output unit calculates the original output features based on the original output matrix and the attention-weighted features.
[0099] Optionally, the output unit projects the features of the attention space back into the image space based on the original output matrix and the attention-weighted features to calculate the original output features.
[0100] The original output matrix can be obtained through pre-training, and during fine-tuning, the original output matrix corresponding to the output unit is frozen.
[0101] S802. The output increment feature is calculated by the output increment unit based on the increment output matrix and the attention weighted feature.
[0102] Optionally, the output increment unit can adjust the attention weighted features based on the increment output matrix to calculate the output increment features.
[0103] S803. Calculate the sum of the original output features and the output increment features, and use it as the cross feature.
[0104] Optionally, the sum of the original output features and the output incremental features is calculated as a cross feature, thereby superimposing general capabilities and specific capabilities. This retains the basic spatial layout capabilities of the original panoramic image generation model while adding distortion adjustment specific to panoramic images.
[0105] In one possible implementation, Figure 9 This is a schematic diagram of another structure of the initial panoramic image generation model in the panoramic image generation method provided in the embodiments of this application. Figure 10 This is a schematic flowchart illustrating the calculation of output incremental features in the panoramic image generation method provided in this application embodiment, with reference to... Figure 9 as well as Figure 10 As shown, the output increment unit includes: a routing unit, a first sub-increment unit, a second sub-increment unit, a third sub-increment unit, and a fourth sub-increment unit; the increment output matrix includes: a first output matrix corresponding to the first sub-increment unit, a second output matrix corresponding to the second sub-increment unit, a third output matrix corresponding to the third sub-increment unit, and a fourth output matrix corresponding to the fourth sub-increment unit; in S802 above, based on the increment output matrix and attention weighted features, the output increment features are calculated, including: S1001, The routing unit determines the first weight of the first sub-incremental unit, the second weight of the second sub-incremental unit, the third weight of the third sub-incremental unit, and the fourth weight of the fourth sub-incremental unit based on the attention weighting feature.
[0106] Optionally, the routing unit includes a pre-trained linear transformation layer. The routing unit receives attention-weighted features as input, calculates an initial score for each sub-incremental unit through the pre-trained linear transformation layer, and performs softmax normalization on the initial scores corresponding to each sub-incremental unit to obtain the first weight of the first sub-incremental unit, the second weight of the second sub-incremental unit, the third weight of the third sub-incremental unit, and the fourth weight of the fourth sub-incremental unit.
[0107] For example, each sub-incremental unit can be implemented using a fine-tuning component, such as a LoRA component. Specifically, the parameters in the LoRA component can be pre-trained.
[0108] Each sub-incremental unit can be configured to focus on different regions of the panoramic image. For example, the first sub-incremental unit focuses on strong stretching of the top region, the second sub-incremental unit focuses on strong stretching of the bottom region, the third sub-incremental unit focuses on normal perspective of the middle region, and the fourth sub-incremental unit focuses on normal perspective of the middle region.
[0109] In this case, the input dimension of the linear transformation layer is equal to the dimension of the attention-weighted features, and the output dimension of the linear transformation layer is equal to the number of sub-increment units.
[0110] The sum of the first weight, the second weight, the third weight, and the fourth weight is 1. The first weight, the second weight, the third weight, and the fourth weight are used to indicate the importance of the corresponding sub-incremental unit to the attention-weighted features.
[0111] S1002. Determine the first target weight, the first target sub-incremental unit corresponding to the first target weight, the second target weight, and the second target sub-incremental unit corresponding to the second target weight from the first weight of the first sub-incremental unit, the second weight of the second sub-incremental unit, the third weight of the third sub-incremental unit, and the fourth weight of the fourth sub-incremental unit.
[0112] Optionally, the two weights with the largest values are selected from the first weight of the first sub-incremental unit, the second weight of the second sub-incremental unit, the third weight of the third sub-incremental unit, and the fourth weight of the fourth sub-incremental unit as the first target weight and the second target weight.
[0113] Optionally, the sub-incremental units corresponding to the first target weight and the second target weight are determined and divided into the first target sub-incremental unit and the second target sub-incremental unit.
[0114] Taking LoRA as an example, each sub-incremental unit includes two low-rank matrices (i.e., the increment output matrix). The input features of each sub-incremental unit are first multiplied with the first matrix to achieve dimensionality reduction, and then multiplied with the second matrix to restore the dimension. Finally, the output is an increment feature with the same dimension as the original output features.
[0115] By determining the first objective weight and the corresponding first objective sub-incremental unit, the second objective weight and the corresponding second objective sub-incremental unit, sparse activation is achieved. That is, only a portion of the increment units are used for calculation each time, which reduces the amount of computation and forces each sub-incremental unit to focus on different types of distortion processing. This improves efficiency, generation quality, and generalization ability.
[0116] S1003. The output incremental features are calculated from the first target sub-incremental unit and the second target sub-incremental unit in the first sub-incremental unit, the second sub-incremental unit, the third sub-incremental unit and the fourth sub-incremental unit, based on the output matrix corresponding to the first target sub-incremental unit and the output matrix corresponding to the second target sub-incremental unit, the attention weighted features, the first target weight and the second target weight.
[0117] Optionally, the output incremental features are calculated by weighting and fusing the first target sub-incremental unit and the second target sub-incremental unit based on the attention-weighted features, the first target weight, and the second target weight.
[0118] In one possible implementation, Figure 11 This is another flowchart illustrating the calculation of output incremental features in the panoramic image generation method provided in this application embodiment, referring to... Figure 11As shown, in S1003 above, the first target sub-incremental unit and the second target sub-incremental unit among the first sub-incremental unit, the second sub-incremental unit, the third sub-incremental unit, and the fourth sub-incremental unit calculate the output increment features based on the output matrices corresponding to the first target sub-incremental unit and the second target sub-incremental unit, the attention weighted features, the first target weight, and the second target weight, including: S1101. The first output increment feature is calculated from the first target sub-increment unit based on the output matrix, attention weighted feature and first target weight corresponding to the first target sub-increment unit.
[0119] Optionally, the first target sub-incremental unit calculates the first incremental feature based on the output matrix corresponding to the first target sub-incremental unit and the attention weighted feature, and calculates the first output incremental feature based on the first incremental feature and the first target weight.
[0120] For example, taking LoRA as an example, the output matrix corresponding to the first target sub-incremental unit includes a first LoRA matrix and a second LoRA matrix. The attention-weighted features are multiplied by the first LoRA matrix to achieve dimensionality compression, resulting in the first compressed intermediate features. Specifically, the first LoRA matrix is used to condense information from the original features and extract the most relevant core features.
[0121] For example, the first compressed intermediate feature is multiplied by the second LoRA matrix to restore the dimension, resulting in the first incremental feature. The second LoRA matrix enables the condensed information to be restored to a complete feature representation, incorporating processing methods unique to the first target sub-incremental unit.
[0122] For example, the product of the first incremental feature and the first target weight is calculated to obtain the first output incremental feature. The first output incremental feature reflects the contribution of the first target sub-incremental unit to the overall output.
[0123] S1102. The second output increment feature is calculated from the second target sub-increment unit based on the output matrix, attention weighted features and second target weights corresponding to the second target sub-increment unit.
[0124] Optionally, the second target sub-incremental unit calculates the second incremental feature based on the output matrix corresponding to the second target sub-incremental unit and the attention weighted feature, and calculates the second output incremental feature based on the second incremental feature and the second target weight.
[0125] For example, taking LoRA as an example, the output matrix corresponding to the second target sub-increment unit includes a third LoRA matrix and a fourth LoRA matrix. The attention-weighted features are multiplied by the third LoRA matrix to achieve dimensionality compression, resulting in the second compressed intermediate features. Specifically, the third LoRA matrix is used to condense information from the original features, extracting the most relevant core features.
[0126] For example, the second compressed intermediate feature is multiplied by the fourth LoRA matrix to restore the dimension, resulting in the second incremental feature. The fourth LoRA matrix enables the condensed information to be restored to a complete feature representation, incorporating processing methods unique to the second target sub-incremental unit.
[0127] For example, the product of the second incremental feature and the second target weight is calculated to obtain the second output incremental feature. This second output incremental feature reflects the contribution of the second target sub-incremental unit to the overall output.
[0128] S1103. Summing the first output increment feature and the second output increment feature yields the output increment feature.
[0129] Optionally, the first output increment feature and the second output increment feature are added element by element to obtain the output increment feature, thereby realizing the fusion of the first target sub-increment unit and the second target sub-increment unit, and also realizing continuous transition, making the distortion transition between different regions smoother.
[0130] In one possible implementation, step S304 above adjusts the value weight matrix and output weight matrix in the cross-attention processing module based on the loss result, including: Based on the loss results, the increment value matrix corresponding to the value increment unit and the increment output matrix corresponding to the output increment unit are adjusted.
[0131] Optionally, the increment matrix corresponding to the value increment unit and the increment output matrix corresponding to the output increment unit are adjusted according to the loss result. That is, the value increment unit and the output increment unit in the cross-attention processing module are adjusted according to the loss result.
[0132] Based on the same inventive concept, this application also provides a panoramic image generation device corresponding to the panoramic image generation method. Since the principle of the device in this application is similar to the panoramic image generation method described above in this application, the implementation of the device can refer to the implementation of the method, and the repeated parts will not be described again.
[0133] The panoramic image generation device includes: a generation module, a fine-tuning module, and an inference module; wherein... The generation module is used to acquire multiple sample panoramic images and generate multiple training data pairs based on each sample panoramic image. Each training data pair includes: the sample panoramic image and the corresponding text description information of the sample panoramic image. The fine-tuning module is used to iteratively fine-tune at least one target weight matrix in each cross-attention processing module of the initial panoramic image generation model based on each training data pair, so as to obtain the target panoramic image generation model, wherein the initial panoramic image generation model includes at least multiple cross-attention processing modules. The inference module is used to obtain the text description information corresponding to the target panoramic image and input the text description information into the target panoramic image generation model. The target panoramic image generation model performs iterative denoising based on the randomly generated initial noisy image and the text description information to generate the target panoramic image.
[0134] Optionally, the target weight matrix includes: a value weight matrix and an output weight matrix.
[0135] Optionally, the initial panoramic image generation model further includes: a text encoder module, a VAE encoder module, and an attention diffusion module. The attention diffusion module includes: multiple attention processing modules, and each attention processing module includes at least: a self-attention processing module and a cross-attention processing module. Based on each training data pair, at least one target weight matrix in each cross-attention processing module of the initial panoramic image generation model is iteratively fine-tuned to obtain the target panoramic image generation model, including: In the current iteration, multiple current training data pairs are determined in each training data pair. The text description information of each current training data pair is input into the text encoder module, and the text encoder module generates the text vector corresponding to each current training data pair based on the text description information. The panoramic images of the samples in each current training data pair are input into the VAE encoder module. The VAE encoder module performs latent space encoding on each sample panoramic image to obtain the true latent features corresponding to each current training data pair. Preset real noise is added to the true latent features corresponding to each current training data pair to generate the noisy true latent features corresponding to each current training data pair. The text vectors corresponding to each current training data pair and the noisy real latent features corresponding to each current training data pair are input into each attention processing module. The self-attention processing module and the cross-attention processing module in each attention processing module calculate the prediction noise corresponding to each current training data pair. Based on the predicted noise corresponding to each current training data pair, the real noise corresponding to each current training data pair, and the preset loss function, the loss result is determined, and the value weight matrix and output weight matrix in the cross-attention processing module are adjusted according to the loss result.
[0136] Optionally, the prediction noise corresponding to each current training data pair is calculated by the self-attention processing module and the cross-attention processing module in each attention processing module, including: The self-attention processing module determines the self-attention features based on the noisy real latent features; By performing residual connections between self-attention features and noisy real latent features, fused latent features are obtained. The cross-attention processing module determines the cross features based on the text vector and the fused latent features; The predicted noise is obtained based on the self-attention feature and the cross feature.
[0137] Optionally, the cross-attention processing module includes: a query unit, a key unit, a value unit, an output unit, and a value increment unit; the value weight matrix includes: the original value matrix corresponding to the value unit and the increment value matrix corresponding to the value increment unit; Based on the text vectors and fused latent features, the cross features are determined, including: The query attention features are calculated by the query unit based on the fused latent features; Key attention features are calculated from the key units based on text vectors; The original value attention features are calculated from the value unit based on the original value matrix and the text vector; The value attention increment feature is calculated by the value increment unit based on the increment value matrix and the text vector; The sum of the original value attention feature and the value attention increment feature is calculated as the total value attention feature; Based on the query attention features, key attention features, and total value attention features, the attention weighting features are calculated. Cross features are calculated from the output unit based on attention-weighted features.
[0138] Optionally, the cross-attention processing module further includes: an output increment unit; the output weight matrix includes: the original output matrix corresponding to the output unit and the increment output matrix corresponding to the output increment unit; The cross features are calculated by the output unit based on the attention-weighted features, including: The original output features are calculated by the output unit based on the original output matrix and the attention-weighted features; The output increment feature is calculated by the output increment unit based on the increment output matrix and the attention weighted feature; Calculate the sum of the original output features and the output increment features, and use it as the cross feature.
[0139] Optionally, the output increment unit includes: a routing unit, a first sub-increment unit, a second sub-increment unit, a third sub-increment unit, and a fourth sub-increment unit; the increment output matrix includes: a first output matrix corresponding to the first sub-increment unit, a second output matrix corresponding to the second sub-increment unit, a third output matrix corresponding to the third sub-increment unit, and a fourth output matrix corresponding to the fourth sub-increment unit; based on the increment output matrix and attention weighted features, the output increment features are calculated, including: The routing unit determines the first weight of the first sub-incremental unit, the second weight of the second sub-incremental unit, the third weight of the third sub-incremental unit, and the fourth weight of the fourth sub-incremental unit based on attention weighting features. The first target weight, the first target sub-incremental unit corresponding to the first target weight, the second target weight, and the second target sub-incremental unit corresponding to the second target weight are determined from the first weight of the first sub-incremental unit, the second weight of the second sub-incremental unit, the third weight of the third sub-incremental unit, and the fourth weight of the fourth sub-incremental unit. The output incremental features are calculated from the first target sub-increment unit and the second target sub-increment unit in the first sub-increment unit, the second sub-increment unit, the third sub-increment unit and the fourth sub-increment unit, based on the output matrix corresponding to the first target sub-increment unit and the output matrix corresponding to the second target sub-increment unit, the attention weighted features, the first target weight and the second target weight.
[0140] Optionally, the output incremental features are calculated from the first target sub-increment unit and the second target sub-increment unit among the first sub-increment unit, the second sub-increment unit, the third sub-increment unit, and the fourth sub-increment unit, based on the output matrix corresponding to the first target sub-increment unit and the output matrix corresponding to the second target sub-increment unit, the attention weighted features, the first target weight, and the second target weight, including: The first output incremental feature is calculated from the first target sub-incremental unit, based on the output matrix, attention weighted features, and first target weights corresponding to the first target sub-incremental unit; The second output incremental feature is calculated from the second target sub-incremental unit based on the output matrix, attention weighted features, and second target weights corresponding to the second target sub-incremental unit. The first output increment feature and the second output increment feature are summed to obtain the output increment feature.
[0141] Optionally, based on the loss result, the value weight matrix and output weight matrix in the cross-attention processing module are adjusted, including: Based on the loss results, the increment value matrix corresponding to the value increment unit and the increment output matrix corresponding to the output increment unit are adjusted.
[0142] The processing flow of each module in the device and the interaction flow between each module can be referred to the relevant descriptions in the above method embodiments, and will not be detailed here.
[0143] This application also provides an electronic device, such as... Figure 12 As shown, Figure 12 The schematic diagram of the electronic device structure provided in this application embodiment includes: a processor 1201 and a memory 1202, and optionally, a bus 1203. The memory 1202 stores machine-readable instructions executable by the processor 1201. When the electronic device is running, the processor 1201 and the memory 1202 communicate via the bus 1203, and the processor 1201 executes the machine-readable instructions to perform the steps of the panoramic image generation method described above.
[0144] This application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, performs the steps of the panoramic image generation method described above.
[0145] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems and devices described above can be referred to the corresponding processes in the method embodiments, and will not be repeated here. In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. Furthermore, multiple modules or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the displayed or discussed mutual coupling or direct coupling or communication connection can be through some communication interfaces; the indirect coupling or communication connection of devices or modules can be electrical, mechanical, or other forms.
[0146] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. If the functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes: USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, optical disks, and other media capable of storing program code.
[0147] The above are merely specific embodiments of this application, but the scope of protection of this application is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application.
Claims
1. A method for generating panoramic images, characterized in that, include: Multiple sample panoramic images are acquired, and multiple training data pairs are generated based on each sample panoramic image. Each training data pair includes: the sample panoramic image and the text description information corresponding to the sample panoramic image. Based on each of the training data pairs, at least one target weight matrix in each cross-attention processing module of the initial panoramic image generation model is iteratively fine-tuned to obtain the target panoramic image generation model, wherein the initial panoramic image generation model includes at least a plurality of the cross-attention processing modules. Obtain the text description information corresponding to the target panoramic image, and input the text description information into the target panoramic image generation model. The target panoramic image generation model performs iterative denoising based on the randomly generated initial noisy image and the text description information to generate the target panoramic image.
2. The panoramic image generation method according to claim 1, characterized in that, The target weight matrix includes: a value weight matrix and an output weight matrix.
3. The panoramic image generation method according to claim 2, characterized in that, The initial panoramic image generation model further includes: a text encoder module, a VAE encoder module, and an attention diffusion module. The attention diffusion module includes: multiple attention processing modules, and each attention processing module includes at least: a self-attention processing module and the cross-attention processing module. The step of iteratively fine-tuning at least one target weight matrix in each cross-attention processing module of the initial panoramic image generation model based on each of the training data pairs to obtain the target panoramic image generation model includes: In the current iteration round, multiple current training data pairs are determined in each of the training data pairs, and the text description information in each of the current training data pairs is input into the text encoder module. The text encoder module generates the text vector corresponding to each of the current training data pairs based on the text description information. The panoramic images of the samples in each of the current training data pairs are input into the VAE encoder module. The VAE encoder module performs latent space encoding on each of the sample panoramic images to obtain the true latent features corresponding to each of the current training data pairs. Preset real noise is added to the true latent features corresponding to each of the current training data pairs to generate noisy true latent features corresponding to each of the current training data pairs. The text vectors corresponding to each current training data pair and the noisy real latent features corresponding to each current training data pair are input into each attention processing module, and the prediction noise corresponding to each current training data pair is calculated by the self-attention processing module and the cross-attention processing module in each attention processing module. Based on the predicted noise corresponding to each current training data pair, the real noise corresponding to each current training data pair, and the preset loss function, the loss result is determined, and the value weight matrix and output weight matrix in the cross-attention processing module are adjusted according to the loss result.
4. The panoramic image generation method according to claim 3, characterized in that, The prediction noise corresponding to each current training data pair is calculated by the self-attention processing module and the cross-attention processing module in each of the attention processing modules, including: The self-attention processing module determines the self-attention features based on the noisy real latent features; The self-attention features and the noisy real latent features are residually connected to obtain fused latent features; The cross-attention processing module determines the cross features based on the text vector and the fusion latent features; The predicted noise is obtained based on the self-attention features and the cross features.
5. The panoramic image generation method according to claim 4, characterized in that, The cross-attention processing module includes: a query unit, a key unit, a value unit, an output unit, and a value increment unit; the value weight matrix includes: the original value matrix corresponding to the value unit and the increment value matrix corresponding to the value increment unit; The step of determining the cross features based on the text vector and the fusion latent features includes: The query unit calculates the query attention features based on the fused latent features; The key attention features are calculated by the key unit based on the text vector; The original value attention features are calculated by the value unit based on the original value matrix and the text vector; The value attention increment feature is calculated by the value increment unit based on the increment value matrix and the text vector; The sum of the original value attention feature and the value attention increment feature is calculated as the total value attention feature; Based on the query attention features, the key attention features, and the total value attention features, the attention weighting features are calculated. The cross features are calculated by the output unit based on the attention-weighted features.
6. The panoramic image generation method according to claim 5, characterized in that, The cross-attention processing module further includes: an output increment unit; the output weight matrix includes: the original output matrix corresponding to the output unit and the increment output matrix corresponding to the output increment unit; The calculation of the cross features by the output unit based on the attention-weighted features includes: The original output features are calculated by the output unit based on the original output matrix and the attention-weighted features; The output increment feature is calculated by the output increment unit based on the increment output matrix and the attention weighted feature; The sum of the original output feature and the output increment feature is calculated as the cross feature.
7. The panoramic image generation method according to claim 6, characterized in that, The output increment unit includes: a routing unit, a first sub-increment unit, a second sub-increment unit, a third sub-increment unit, and a fourth sub-increment unit; the increment output matrix includes: a first output matrix corresponding to the first sub-increment unit, a second output matrix corresponding to the second sub-increment unit, a third output matrix corresponding to the third sub-increment unit, and a fourth output matrix corresponding to the fourth sub-increment unit; The calculation of the output incremental features based on the incremental output matrix and the attention weighted features includes: The routing unit determines the first weight of the first sub-incremental unit, the second weight of the second sub-incremental unit, the third weight of the third sub-incremental unit, and the fourth weight of the fourth sub-incremental unit based on the attention weighting features. The first target weight, the first target sub-incremental unit corresponding to the first target weight, the second target weight, and the second target sub-incremental unit corresponding to the second target weight are determined from the first weight of the first sub-incremental unit, the second weight of the second sub-incremental unit, the third weight of the third sub-incremental unit, and the fourth weight of the fourth sub-incremental unit. The output incremental features are calculated from the first target sub-increment unit and the second target sub-increment unit in the first sub-increment unit, the second sub-increment unit, the third sub-increment unit and the fourth sub-increment unit, based on the output matrix corresponding to the first target sub-increment unit and the output matrix corresponding to the second target sub-increment unit, the attention weighted features, the first target weight and the second target weight.
8. The panoramic image generation method according to claim 7, characterized in that, The step of calculating the output increment feature based on the output matrix corresponding to the first target sub-increment unit, the output matrix corresponding to the second target sub-increment unit, the attention weighted feature, the first target weight, and the second target weight from the first sub-increment unit, the second sub-increment unit, the third sub-increment unit, and the fourth sub-increment unit includes: The first output incremental feature is calculated from the first target sub-incremental unit based on the output matrix corresponding to the first target sub-incremental unit, the attention weighted feature, and the first target weight; The second output incremental feature is calculated from the second target sub-incremental unit based on the output matrix corresponding to the second target sub-incremental unit, the attention weighted feature, and the second target weight; The first output increment feature and the second output increment feature are summed to obtain the output increment feature.
9. The panoramic image generation method according to claim 8, characterized in that, The step of adjusting the value weight matrix and output weight matrix in the cross-attention processing module based on the loss result includes: Based on the loss result, the incremental value matrix corresponding to the value increment unit and the incremental output matrix corresponding to the output increment unit are adjusted.
10. An electronic device, characterized in that, include: The device includes a processor and a memory, the memory storing machine-readable instructions executable by the processor, which, when the electronic device is in operation, are executed by the processor to perform the steps of the panoramic image generation method as described in any one of claims 1 to 9.