Image generation method and device based on spatiotemporal data interaction and electronic equipment

By independently encoding visual representations and decoding mask representations, and using a deentanglement model for image generation, the problem of high computational complexity and low efficiency of non-autoregressive transformers is solved, achieving efficient and high-quality image generation.

CN119295581BActive Publication Date: 2026-06-26TSINGHUA UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TSINGHUA UNIVERSITY
Filing Date
2024-09-23
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Non-autoregressive transformers involve large computational loads and low computational efficiency during image generation, making it impossible to efficiently generate high-quality images.

Method used

By independently encoding visual representations and decoding mask representations based on self-attention interactions, image generation is performed using a de-entangled model, reducing unnecessary attention interactions and performing only key interactions such as self-attention and cross-self-attention interactions.

Benefits of technology

It achieves efficient image generation, reduces computational load, and improves generation efficiency and image quality.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119295581B_ABST
    Figure CN119295581B_ABST
Patent Text Reader

Abstract

The application provides an image generation method and device based on space-time data interaction and electronic equipment, relates to the technical field of computer vision, and aims to realize efficient image generation. The method comprises the following steps: acquiring an image representation sequence of the last time; the image representation sequence comprises a visible representation and a mask representation, the mask representation represents unknown image content, and the visible representation represents known image content, which is used for providing image information for the mask representation to infer unknown image content; performing self-attention interaction-based coding on the visible representation to obtain a visible representation feature; performing cross-self-attention interaction-based decoding on the mask representation according to the visible representation feature to obtain a new image representation sequence; performing image representation sequence iteration generation for multiple times according to the above steps; and generating an image according to the new image representation sequence in the case that the new image representation sequence does not contain the mask representation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer vision technology, and in particular to an image generation method, apparatus and electronic device based on spatiotemporal data interaction. Background Technology

[0002] In recent years, the field of AI-generated content has seen unprecedented growth. In the context of natural language processing, content is usually synthesized by using transformers (attention mechanisms) to generate discrete representations (tokens).

[0003] Token-based generation methods have shown good performance in synthesizing visual content. As a representative method, non-autoregressive deformers can generate multiple image tokens simultaneously to produce high-quality images in a few steps. However, in each generation step of a non-autoregressive deformer, visual tokens and mask tokens are processed and interact concurrently, meaning multiple types of data interaction occur simultaneously. Therefore, non-autoregressive deformers suffer from high computational complexity and low computational efficiency. Summary of the Invention

[0004] In view of the above problems, embodiments of this application provide an image generation method, apparatus and electronic device based on spatiotemporal data interaction, so as to overcome the above problems or at least partially solve the above problems.

[0005] A first aspect of this application discloses an image generation method based on spatiotemporal data interaction, comprising:

[0006] Obtain the previous image representation sequence; the image representation sequence includes visual representation and mask representation, the mask representation represents unknown image content, and the visual representation represents known image content, which is used to provide image information for the mask representation to infer the unknown image content;

[0007] The visual representation is encoded based on self-attention interaction to obtain visual representation features;

[0008] Based on the visual representation features, the mask representation is decoded using a cross-self-attention interaction to obtain a new image representation sequence, which includes newly added visual representations.

[0009] The image representation sequence is generated iteratively multiple times according to the above steps. If the new image representation sequence does not contain a mask representation, an image is generated based on the new image representation sequence.

[0010] Optionally, in two adjacent newly generated image representation sequences, the image representations at the newly added visual representation locations differ significantly, while the image representations at other locations are similar; the visual representation features include the newly added visual representation features; the visual representations are encoded based on self-attention interactions to obtain visual representation features, including:

[0011] New visual representations are determined from the previous image representation sequence;

[0012] The newly added visual representation is encoded based on cross-self-attention interaction to obtain the new visual representation features.

[0013] Optionally, the mask representation is decoded based on the visual representation features using a cross-self-attention interaction to obtain a new image representation sequence, including:

[0014] The features of the previous image representation sequence are concatenated with the newly added visual marker features to obtain the concatenated representation features.

[0015] The mask representation is decoded using the spliced ​​representation features based on cross-self-attention interaction to obtain a new image representation sequence.

[0016] Optionally, the visual representation features are used to decode the mask representation based on cross-self-attention interaction to obtain a new image representation sequence, including:

[0017] Based on the visual representation features and the mask representation features corresponding to the mask representation, generate key representations and value representations;

[0018] The mask representation features are used as query representations;

[0019] A new image representation sequence is obtained by performing attention interactions based on the key representation, the value representation, and the query representation.

[0020] Optionally, the new image representation sequence is generated based on a disentanglement model, which includes a multi-layer encoder and a multi-layer decoder, wherein the number of layers in the multi-layer encoder is greater than the number of layers in the multi-layer decoder.

[0021] The multi-layer encoder is used to encode the visual representation based on self-attention interaction, and the multi-layer decoder is used to decode the mask representation based on cross-self-attention interaction according to the visual representation features.

[0022] Optionally, the unentanglement model is trained according to the following steps:

[0023] Construct an image representation sequence set, which includes multiple image representation sequence samples;

[0024] Randomly mask the image representations in the image representation sequence samples to obtain image representation sequence samples containing masked representations;

[0025] The target training strategy is determined according to the target probability, and the unentangled model is trained according to the target training strategy to predict the image representation corresponding to the mask representation position based on the image representation sequence sample containing the mask representation.

[0026] Optionally, the target training strategy includes a first training strategy; training the disentanglement model according to the first training strategy to predict the image representation sequence samples based on the image representation sequence samples containing mask representations includes:

[0027] The image representation sequence sample containing the mask representation is used as the current image representation sequence sample, and the image representation sequence sample containing the mask representation is randomly masked to simulate the previous image representation sequence sample;

[0028] The current image representation sequence sample is compared with the previous image representation sequence sample to determine the newly added visual marker sample;

[0029] The features of the previous image representation sequence sample are concatenated with the features of the newly added visual marker sample to obtain a concatenated representation feature sample.

[0030] The image representation corresponding to the position of the mask representation is predicted by decoding the mask representation in the image representation sequence sample at the previous time using the spliced ​​representation feature sample.

[0031] A second aspect of this application discloses an image generation apparatus based on spatiotemporal data interaction, comprising:

[0032] An acquisition module is used to acquire the previous image representation sequence; the image representation sequence includes visual representation and mask representation, the mask representation represents unknown image content, and the visual representation represents known image content, and is used to provide image information for the mask representation to infer the unknown image content;

[0033] The encoding module is used to encode the visual representation based on self-attention interaction to obtain visual representation features;

[0034] The decoding module is used to decode the mask representation based on the visual representation features to obtain a new image representation sequence, wherein the new image representation sequence contains newly added visual representations.

[0035] The generation module is used to perform multiple iterative generation of image representation sequences according to the above steps, and to generate an image based on the new image representation sequence when the new image representation sequence does not contain a mask representation.

[0036] A third aspect of this application discloses an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the steps of the image generation method based on spatiotemporal data interaction described in the first aspect of this application.

[0037] A fourth aspect of this application discloses a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the image generation method based on spatiotemporal data interaction described in the first aspect of this application.

[0038] A fifth aspect of this application discloses a computer program product, including a computer program that, when executed by a processor, implements the steps of the image generation method based on spatiotemporal data interaction described in the first aspect of this application.

[0039] The embodiments of this application have the following advantages:

[0040] In this embodiment, the previous image representation sequence is obtained; the image representation sequence includes visual representation and mask representation, the mask representation represents unknown image content, and the visual representation represents known image content, used to provide image information for the mask representation to infer the unknown image content; and the visual representation is encoded based on self-attention interaction to obtain visual representation features; the mask representation is decoded based on cross-self-attention interaction according to the visual representation features to obtain a new image representation sequence, the new image representation sequence containing newly added visual representations; the above steps are performed to iteratively generate multiple image representation sequences, and if the new image representation sequence does not contain mask representations, an image is generated based on the new image representation sequence.

[0041] Thus, in each generation of an image representation sequence, the current visible and reliable image information is encoded by independently encoding visual representations (i.e., visual tokens). Based on the encoded visual representation features, the mask representation (i.e., mask token) is decoded to correctly predict unknown image content based on the visible and reliable image information, resulting in a new image representation sequence. Because this method performs different types of attention interactions independently during representation sequence generation, it does not need to execute all types of attention interactions, only the key ones: self-attention interaction of visual representations and cross-self-attention interaction of mask representations with respect to visual representations. Therefore, the amount of data computation in the image generation process is reduced, achieving efficient image generation. Attached Figure Description

[0042] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments of this application will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0043] Figure 1 This is a schematic diagram of the image generation process based on a non-autoregressive transformer;

[0044] Figure 2 This is a schematic diagram of data interaction for an image generated by a non-autoregressive transformer;

[0045] Figure 3 This is a flowchart illustrating the steps of an image generation method based on spatiotemporal data interaction provided in an embodiment of this application.

[0046] Figure 4 This is a schematic diagram of the structure of a de-entanglement model provided in an embodiment of this application;

[0047] Figure 5 This is a flowchart of another image generation method based on spatiotemporal data interaction provided in this application embodiment;

[0048] Figure 6 This is a schematic diagram of an image representation sequence generation method provided in an embodiment of this application;

[0049] Figure 7 This is a schematic diagram of another de-entanglement model provided in an embodiment of this application;

[0050] Figure 8 This is a schematic diagram of the structure of an image generation device based on spatiotemporal data interaction provided in an embodiment of this application;

[0051] Figure 9 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0052] To make the above-mentioned objectives, features, and advantages of this application more apparent and understandable, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0053] To better understand the technical solution of this application, a brief introduction to non-self-regressing transformers will be given first:

[0054] like Figure 1 As shown, Non-Autoregressive Transformers (NATs) follow a progressive generation paradigm. In each generation step, a certain number of latent image representations of the generated image are decoded in parallel. The NATs iteratively execute this process to generate the final complete image representation map. Specifically, in each step, the unknown latent image representations of the image are represented by mask representations and concatenated with the decoded image representations (i.e., visual representations). Then, the sequence of complete mask representations and visual representations is input into a transformer-based model to predict appropriate values ​​for the mask representations, and the most reliable prediction is retained as the increment of the visual representation for the next step. In other words, for a generated image, it starts with a sequence of image representations consisting entirely of mask representations, decodes multiple image representations at each step, and then uses a pre-trained decoder to generate the image.

[0055] However, in each generation step of a non-autoregressive transformer, visual representation and mask representation are processed and interact concurrently, meaning that multiple types of data interaction occur simultaneously. Therefore, non-autoregressive transformers suffer from high computational complexity and low computational efficiency.

[0056] To achieve efficient image generation, this application analyzes the underlying mechanism of the effectiveness of the non-autoregressive transformer asymptotic generation process. The analysis reveals that, at the spatial level, in each generation step, even though mask representations and visual representations are treated equally in the computational graph of the non-autoregressive transformer, the visual representations naturally learn to provide image information for the mask representations to infer unknown image content, and their corresponding depth features can be constructed without mask representations.

[0057] Specifically, to better understand this mechanism, an ablation study was conducted on four types of spatial interactions involved in the image generation process of non-autoregressive transformers: (1) attention from mask representation to visual representation; (2) attention from visual representation to mask representation; (3) attention from visual representation to visual representation; and (4) attention from mask representation to mask representation. These four types of spatial interactions have significantly different effects on the performance of non-autoregressive transformers. Figure 2 As shown, Figure 2 In this context, FID (Fixed Intentions) refers to a metric for evaluating the quality of generative models. It's evident that the most crucial spatial interaction is the attention from mask representation to visual representation (i.e., information propagation from visual representation to mask representation). Without this attention, the model cannot converge. Furthermore, attention from mask representation n to mask representation and from visual representation to visual representation (i.e., self-attention during feature extraction processes for visual and mask representations, respectively) moderately improves the model. However, removing the attention from visual representation to mask representation (i.e., information propagation from mask representation to visual representation) only slightly impairs the model's performance.

[0058] The unbalanced importance of these four types of spatial interactions highlights the distinct roles of visual representation and masked representation. Specifically, visual representation processing primarily involves building certain internal representations based on currently available and reliable image information and propagating them to masked representation; furthermore, the depth features corresponding to the visual representation can be built upon the visual representation itself. Conversely, masked representation gradually extracts image information from the visual representation to predict appropriate representations corresponding to unknown parts of the image. In other words, the roles of visual representation and masked representation are naturally separated when learning to efficiently generate images, even though they are treated equally in the computational graph of a non-autoregressive transformer.

[0059] Furthermore, the interactions between adjacent generation steps are mainly focused on updating a small number of "key image representations" on top of the previous steps, while the computation of most other image representations is usually repetitive.

[0060] Therefore, in order to achieve efficient image generation, based on the underlying mechanism of the effectiveness of the non-autoregressive transformer asymptotic generation process, this application proposes the following technical concept:

[0061] Efficient image generation is achieved by encouraging inherent key interactions. Specifically, at the spatial level, the computation of visual representations and mask representations (mask tokens) is decrypted by independently encoding visual representations (visual tokens). Based on the visual representation features obtained from the encoded visual representations, the mask representations are decoded, thereby correctly predicting unknown image content based on visual and reliable image information, achieving rapid image generation. At the temporal level, the computation of key representations at each step is prioritized, while maximizing the reuse of previously computed representation features to supplement necessary information.

[0062] Based on the above technical concept, this application provides an image generation method based on spatiotemporal data interaction, referring to... Figure 3 As shown, Figure 3 This is a flowchart illustrating the steps of an image generation method based on spatiotemporal data interaction provided in an embodiment of this application. Figure 3 As shown, the image generation method based on spatiotemporal data interaction may include steps S310 to S340:

[0063] Step S310: Obtain the previous image representation sequence; the image representation sequence includes visual representation and mask representation, the mask representation represents unknown image content, and the visual representation represents known image content, used to provide image information for the mask representation to infer the unknown image content.

[0064] The image representation sequence refers to the image token sequence, which contains multiple image representations. Image representations are divided into visual representations (i.e., visual tokens) and mask representations (i.e., mask tokens). The previous image representation sequence refers to the previously generated image representation sequence. For the first generation of the image representation sequence, the previous image representation sequence refers to the initial input image representation sequence, which is an image representation sequence composed of mask representations.

[0065] Step S320: Encode the visual representation based on self-attention interaction to obtain visual representation features.

[0066] In this embodiment, visual representation is processed independently of mask representation. By encoding the visual representation based on self-attention interaction, the currently visible and reliable image information is encoded. That is, the internal representation of the visual representation is established based on the currently visible and reliable image information, and the visual representation features are obtained.

[0067] Specifically, encoding the visual representation based on self-attention interaction to obtain visual representation features includes: performing self-attention interaction and forward propagation processing on the visual representation through a multi-layer encoder to obtain visual representation features. Each encoder layer includes a self-attention interaction layer and a forward propagation layer.

[0068] Step S330: Decode the mask representation based on cross-self-attention interaction according to the visual representation features to obtain a new image representation sequence, wherein the new image representation sequence contains newly added visual representations.

[0069] In this embodiment, decoding the mask representation refers to predicting the unknown image content based on the current image information (i.e., visual representation features). This involves obtaining image information from the visual representation to predict the appropriate image representation corresponding to the unknown part of the image, thereby obtaining a new visual representation. This new image representation sequence has more visual representations compared to the previous image representation sequence.

[0070] For example, based on the visual representation features, the mask representation is decoded using cross-self-attention interaction (i.e., SC-attention). For each mask representation, a probability distribution is predicted, and the probability distribution is sampled to obtain the prediction result corresponding to the mask representation. Based on the reliable prediction result, the corresponding mask representation in the previous image representation sequence is replaced to obtain a new image representation sequence. The reliable prediction result is the newly added visual representation.

[0071] Specifically, the visual representation features are used to decode the mask representation based on cross-self-attention interaction to obtain a new image representation sequence, including: generating key representations and value representations according to the visual representation features and the mask representation features corresponding to the mask representations; using the mask representation features as query representations; and performing attention interaction based on the key representations, the value representations and the query representations to obtain a new image representation sequence.

[0072] In this way, the encoded visual representation (i.e., visual representation features) is effectively integrated into the decoding process of the mask representation through cross-self-attention interaction, and unknown image content is quickly and correctly predicted based on visual and reliable image information, resulting in new visual representations and thus a new image representation sequence.

[0073] Step S340: Perform multiple iterations of image representation sequence generation according to the above steps. If the new image representation sequence does not contain mask representation, generate an image based on the new image representation sequence.

[0074] In this embodiment of the application, multiple image representation sequence iterations are performed according to steps S310 to S330 above. As the image representation sequence is iteratively generated, the number of mask representations decoded in the image representation sequence gradually increases. When a new image representation sequence generated in a certain iteration does not contain mask representations, it means that all image information is known. At this time, the iterative generation ends, and an image is generated according to the current new image representation sequence.

[0075] Specifically, generating an image based on the new image representation sequence includes: decoding the new image representation sequence using a decoder to obtain the corresponding image. For example, generating an image based on the new image representation sequence can be represented as:

[0076]

[0077] in, D represents the generated image. VQ Indicates decoder, This represents a new sequence of image representations, in which all image representations are visual representations.

[0078] Through the above implementation process, in each generation of the image representation sequence, the current visible and reliable image information is encoded by independently encoding visual representations (i.e., visual tokens). Then, based on the encoded visual representation features, the mask representation (i.e., mask token) is decoded to correctly predict unknown image content based on the visible and reliable image information, thus obtaining a new image representation sequence. Because this method performs different types of attention interactions independently during the generation of the representation sequence, it does not need to execute all types of attention interactions, only the key attention interactions—namely, the self-attention interaction of the visual representation and the cross-self-attention interaction of the mask representation with respect to the visual representation—thus reducing the amount of data computation in the image generation process and achieving efficient image generation.

[0079] In one specific embodiment, the new image representation sequence is generated based on a deentanglement model, which includes a multi-layer encoder and a multi-layer decoder, wherein the number of layers in the multi-layer encoder is greater than the number of layers in the multi-layer decoder; wherein the multi-layer encoder is used to encode the visual representation based on self-attention interaction, and the multi-layer decoder is used to decode the mask representation based on cross-self-attention interaction according to the visual representation features.

[0080] Reference Figure 4 As shown, Figure 4This is a schematic diagram of the structure of a de-entanglement model provided in an embodiment of this application. Each encoder layer includes a self-attention interaction layer and a forward propagation layer, and each decoder layer includes a cross-self-attention interaction layer and a forward propagation layer.

[0081] Specifically, the disentangled model independently encodes visual representations to obtain visual representation features. Then, based on these visual representation features, the mask representation is decoded using a cross-self-attention interaction to predict unknown image content, resulting in a new image representation sequence. Compared to methods using non-autoregressive deformers to generate image representation sequences, this method of generating new image representation sequences based on the disentangled model only requires performing key attention interactions—namely, self-attention interactions of the visual representations and cross-self-attention interactions of the mask representations with respect to the visual representations—thus achieving efficient image generation.

[0082] Furthermore, the performance of the deentanglement model was analyzed by varying the number of layers in the encoder and decoder, as shown in Table 1 (GFLOPs represent one trillion floating-point operations per second). It was found that increasing the number of encoder layers and decreasing the number of decoder layers improves the performance of the deentanglement model. Therefore, in the deentanglement model, the number of layers in the multi-layer encoder is greater than the number of layers in the multi-layer decoder, allowing more computation to be allocated to the visual representation, thereby significantly improving performance without sacrificing efficiency. The masking reduces the computation of the representation to a single layer. This further illustrates the importance of independently processing the visual representation in achieving efficient image generation in the embodiments of this application.

[0083] Table 1 shows the results of the analysis on the effect of different encoder and decoder layer numbers on the performance of the deentanglement model.

[0084] Number of layers in the encoder number of layers in the decoder GFLOPs FID 8 8 40.2 5.50 12 4 38.2 4.98 15 1 39.8 4.78

[0085] In this embodiment, to explain the new image representation sequence generated in the current iteration, the correlation between the previously generated new image representation sequences is analyzed by examining the similarity of the new image representation sequences generated in two adjacent iterations (using pre-calculated similarity to assess the similarity between the new image representation sequences generated in two adjacent iterations). It was found that there are significant differences in image representations at some "key locations," while image representations at other locations have high similarity. Here, "key locations" refer to the locations corresponding to newly added visual representations. In other words, the significance of generating a new image representation sequence in each iteration lies in updating the newly decoded image representation, while the calculation of most other image representations is repetitive.

[0086] This application provides another image generation method based on spatiotemporal data interaction. Specifically, in two adjacent newly generated image representation sequences, the image representations at newly added visual representation locations differ significantly, while the image representations at other locations are similar. Therefore, this method only encodes the newly added visual representations and maximizes the reuse of features from the previous image representation sequence to supplement necessary information. (Refer to...) Figure 5 As shown, the method includes the following steps S510 to S540:

[0087] Step S510: Obtain the previous image representation sequence; the image representation sequence includes visual representation and mask representation, the mask representation represents unknown image content, and the visual representation represents known image content, used to provide image information for the mask representation to infer the unknown image content.

[0088] Step S520: Determine the new visual representation from the previous image representation sequence; encode the new visual representation based on cross-self-attention interaction to obtain the new visual representation features.

[0089] Step S530: Concatenate the features of the previous image representation sequence with the newly added visual marker features to obtain concatenated representation features; use the concatenated representation features to decode the mask representation based on cross-self-attention interaction to obtain a new image representation sequence.

[0090] Step S540: Perform multiple iterations of image representation sequence generation according to the above steps. If the new image representation sequence does not contain mask representation, generate an image based on the new image representation sequence.

[0091] In this embodiment, since the image representations at the newly added visual representation positions in two adjacent newly generated image representation sequences differ significantly, while the image representations at other positions are similar, only the newly added visual representation is encoded to inject knowledge about the image, thereby obtaining new visual representation features. Specifically, in step S520, "encoding the newly added visual representation based on cross-self-attention interaction" includes: processing the features of the previous image representation sequence using a lightweight projection network to obtain projection features, and using the projection features to encode the newly added visual representation based on cross-self-attention interaction, so that the newly added visual representation is integrated with the knowledge about the image from the previous steps, thereby obtaining new visual representation features.

[0092] Furthermore, to ensure the quality of the generated image, features from the previous image representation sequence are directly reused to aid in the decoding of the mask representation in the current iteration. In some embodiments, the spliced ​​representation features can be obtained by splicing the projection features corresponding to the features of the previous image representation sequence with the newly added visual marker features. Specifically, in step S530, "decoding the mask representation based on cross-self-attention interaction using the spliced ​​representation features" specifically includes: generating key representations and value representations based on the spliced ​​representation features and the mask representation features corresponding to the mask representations; using the mask representation features as query representations; and then performing attention interaction based on the key representations, value representations, and query representations to obtain a new image representation sequence.

[0093] The image representation sequence is generated iteratively multiple times according to steps S510 to S530. As the image representation sequence is generated iteratively, the number of mask representations that are decoded in the image representation sequence gradually increases. When a new image representation sequence generated in a certain iteration does not contain a mask representation, it means that all the image information is known. At this time, the iterative generation ends, and an image is generated according to the current new image representation sequence.

[0094] For example, the new image representation sequence generated according to steps S510 to S530 can be represented as follows:

[0095] z = Forward(v) Δ ,v M ,f(z prev )),

[0096] Where z represents the features of the new image representation sequence, Forward represents forward propagation, and v Δ Indicates the addition of a new image representation, v M Represents the mask representation, z prev The features of the previous image representation sequence are represented by f(·), and f(·) represents the lightweight projection module.

[0097] In this embodiment of the application, the new image representation sequence is generated based on a disentanglement model, referring to... Figure 6 As shown, Figure 6 This is a schematic diagram of an image representation sequence generation method provided in an embodiment of this application. For the generation of the T-th image representation sequence, only the newly added visual representation of the (T-1)-th generation is encoded, and the features of the (T-1)-th image representation sequence are reused to supplement necessary information. Specifically, the features of the (T-1)-th image representation sequence are processed by a lightweight projection module and then input into a disentanglement model, and the (T-1)-th image representation sequence is also input into the disentanglement model, so that the disentanglement model generates a new image representation sequence according to the methods of steps S510 to S530 described above.

[0098] For example, Figure 7 This is a schematic diagram of another deentanglement model provided in this application embodiment. The deentanglement model includes an encoder and a decoder; in the encoder, the newly added visual representation at the (T-1)th iteration is encoded using projection features based on cross-self-attention interaction, so that the newly added visual representation learns knowledge about the image, resulting in newly added visual representation features; then, the newly added visual representation features and projection features are concatenated to obtain concatenated representation features; and in the decoder, the concatenated representation features are used to decode the mask representation based on cross-self-attention interaction to obtain a new image representation sequence.

[0099] Through the above implementation process, in each generation of image representation sequence, only the newly added visual representation is encoded. The features of the previous image representation sequence are concatenated with the features of the newly added visual marker, and the resulting concatenated representation features are used to decode the mask representation based on cross-self-attention interaction. In this way, the previously computed features can be reused to the maximum extent, supplementing the necessary information in the decoding and encoding process in the current iteration, greatly reducing the computational cost and achieving efficient image generation.

[0100] In an optional embodiment, the unentangled model is trained according to steps A1 to A3:

[0101] Step A1: Construct an image representation sequence set, which includes multiple image representation sequence samples.

[0102] Step A2: Randomly mask the image representations in the image representation sequence sample to obtain an image representation sequence sample containing the masked representations.

[0103] Step A3: Determine the target training strategy according to the target probability, and train the unentangled model according to the target training strategy to predict the image representation corresponding to the mask representation position based on the image representation sequence sample containing the mask representation.

[0104] In this embodiment, encoder Q and quantizer ε can be used. VQ Multiple original sample images are converted into a series of image representation sequences, resulting in a set of image representation sequences. For example, converting original sample images into image representation sequence samples can be represented as:

[0105] v=Q(ε vQ (x)),

[0106] Where, v = [v i ] i=1:N denoted as image representation sequence sample, and N represents the length of the image representation sequence sample.

[0107] This allows for random masking of image representations in image representation sequence samples, resulting in image representation sequence samples containing masked representations. These samples are then used as input for training the de-entanglement model, enabling the model to predict the image representation corresponding to the masked representation position based on the unmasked image representation.

[0108] To enable the deentanglement model to reuse features from previous image representation sequences and to decode mask representations, it is trained according to a target training strategy with a defined target probability. The target probability is set based on the specific circumstances; for example, in one instance, the target probability is 50%.

[0109] In one specific implementation, the target training strategy includes a first training strategy; training the disentanglement model according to the first training strategy to predict the image representation corresponding to the mask representation position based on the image representation sequence samples containing the mask representation, including steps B1 to B3:

[0110] Step B1: Use the image representation sequence sample containing the mask representation as the current image representation sequence sample, and randomly mask the image representation sequence sample containing the mask representation to simulate the previous image representation sequence sample.

[0111] Step B2: Compare the current image representation sequence sample with the previous image representation sequence sample to determine the newly added visual marker sample.

[0112] Step B3: Concatenate the features of the previous image representation sequence sample with the features of the newly added visual marker sample to obtain a concatenated representation feature sample.

[0113] Step B4: Use the spliced ​​representation feature samples to decode the mask representation in the image representation sequence samples at the previous time step based on cross-self-attention interaction, so as to predict the image representation corresponding to the mask representation position.

[0114] In this embodiment, by simulating the previous image representation sequence samples, the deentanglement model can reuse the features of the previous image representation sequence samples to supplement the necessary information in the encoding process of the new image representation and the decoding process of the mask representation. This allows the trained deentanglement model to reuse the previous computational features to the maximum extent, greatly reducing computational costs and achieving efficient image generation.

[0115] In one specific implementation, the target training strategy includes a second training strategy; training the disentangled model according to the first training strategy to predict the image representation corresponding to the mask representation position based on the image representation sequence samples containing the mask representation, specifically including: aiming to minimize the negative log-likelihood loss of the mask representation, using the visual representation in the image representation sequence samples to predict the mask representation, thereby obtaining the image representation corresponding to the mask representation position. In this way, the disentangled model can quickly learn the decoding capability of the mask representation based on the visual representation.

[0116] Through the above implementation process, the disentanglement model is trained using both the first and second training strategies with the target probability. This enables the disentanglement model to reuse features from previous image representation sequences and to decode mask representations. Thus, the disentanglement model allows for the efficient generation of image representation sequences.

[0117] This application also provides an image generation device based on spatiotemporal data interaction, as shown in Figure 8. Figure 8 This is a schematic diagram of an image generation device based on spatiotemporal data interaction provided in an embodiment of this application. The device includes:

[0118] The acquisition module 810 is used to acquire the previous image representation sequence; the image representation sequence includes visual representation and mask representation, the mask representation represents unknown image content, and the visual representation represents known image content, and is used to provide image information for the mask representation to infer the unknown image content;

[0119] The encoding module 820 is used to encode the visual representation based on self-attention interaction to obtain visual representation features;

[0120] The decoding module 830 is used to decode the mask representation based on the visual representation features to obtain a new image representation sequence, wherein the new image representation sequence contains newly added visual representations.

[0121] The generation module 840 is used to perform multiple iterative generation of image representation sequences according to the above steps, and to generate an image based on the new image representation sequence when the new image representation sequence does not contain a mask representation.

[0122] In one optional embodiment, the image representations at newly added visual representation locations in two adjacent newly generated image representation sequences differ significantly, while the image representations at other locations are similar; the visual representation features include newly added visual representation features; the encoding module includes:

[0123] The first determining module is used to determine the new visual representation from the previous image representation sequence;

[0124] The first encoding submodule is used to encode the newly added visual representation based on cross-self-attention interaction to obtain the newly added visual representation features.

[0125] In one optional embodiment, the decoding module includes:

[0126] The first splicing module is used to splice the features of the previous image representation sequence with the newly added visual marker features to obtain spliced ​​representation features;

[0127] The first decoding submodule is used to decode the mask representation based on cross-self-attention interaction using the spliced ​​representation features to obtain a new image representation sequence.

[0128] In one optional embodiment, the decoding module includes:

[0129] The representation generation module is used to generate key representations and value representations based on the visual representation features and the mask representation features corresponding to the mask representations;

[0130] The query representation module is used to use the mask representation features as query representations;

[0131] An interaction module is used to perform attention interactions based on the key representation, the value representation, and the query representation to obtain a new image representation sequence.

[0132] In one optional embodiment, the new image representation sequence is generated based on a disentanglement model, which includes a multi-layer encoder and a multi-layer decoder, wherein the number of layers in the multi-layer encoder is greater than the number of layers in the multi-layer decoder.

[0133] The multi-layer encoder is used to encode the visual representation based on self-attention interaction, and the multi-layer decoder is used to decode the mask representation based on cross-self-attention interaction according to the visual representation features.

[0134] In an optional embodiment, the apparatus further includes a training module for training a disentangled model, the training module comprising:

[0135] The construction module is used to construct an image representation sequence set, which includes multiple image representation sequence samples;

[0136] The first random masking module is used to randomly mask the image representations in the image representation sequence sample to obtain an image representation sequence sample containing the masked representations.

[0137] The training prediction module is used to determine the target training strategy according to the target probability, and train the unentangled model according to the target training strategy to predict the image representation corresponding to the mask representation position based on the image representation sequence sample containing the mask representation.

[0138] In one optional embodiment, the target training strategy includes a first training strategy; the training prediction module includes:

[0139] The second random masking module is used to take the image representation sequence sample containing the mask representation as the current image representation sequence sample, and to perform random masking on the image representation sequence sample containing the mask representation to simulate the previous image representation sequence sample.

[0140] The second determining module is used to compare the current image representation sequence sample with the previous image representation sequence sample to determine the newly added visual marker sample;

[0141] The second splicing module is used to splice the features of the previous image representation sequence sample with the features of the newly added visual marker sample to obtain a spliced ​​representation feature sample.

[0142] The training prediction submodule is used to perform cross-self-attention interaction-based decoding of the mask representation in the image representation sequence sample at the previous time using the spliced ​​representation feature sample, so as to predict the image representation corresponding to the mask representation position.

[0143] This application also provides an electronic device, see embodiments thereof. Figure 9 As shown, Figure 9 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 9 As shown, the electronic device 900 includes a memory 910 and a processor 920. The memory 910 and the processor 920 are connected via a bus for communication. The memory 910 stores a computer program that can run on the processor 920 to implement the steps of the image generation method based on spatiotemporal data interaction described in the embodiments of this application.

[0144] This application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the image generation method based on spatiotemporal data interaction described in this application.

[0145] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the image generation method based on spatiotemporal data interaction described in this application.

[0146] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.

[0147] This application describes embodiments of methods, apparatus, and devices according to embodiments of this application with reference to flowchart illustrations and / or block diagrams. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0148] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0149] These computer program instructions can also be loaded onto a computer or other programmable data processing terminal equipment, causing a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0150] Although preferred embodiments of the present application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the embodiments of the present application.

[0151] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.

[0152] The foregoing has provided a detailed description of an image generation method, apparatus, and electronic device based on spatiotemporal data interaction provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. An image generation method based on spatiotemporal data interaction, characterized in that, include: Obtain the previous image representation sequence; the image representation sequence includes visual representation and mask representation, the mask representation represents unknown image content, and the visual representation represents known image content, which is used to provide image information for the mask representation to infer the unknown image content; The visual representation is encoded based on self-attention interaction to obtain visual representation features; Based on the visual representation features, the mask representation is decoded using a cross-self-attention interaction to obtain a new image representation sequence, which includes newly added visual representations. The above steps are performed to iterate and generate image representation sequences multiple times. If the new image representation sequence does not contain mask representation, an image is generated based on the new image representation sequence. The new image representation sequence is generated based on a disentanglement model, which includes a multi-layer encoder and a multi-layer decoder, wherein the number of layers in the multi-layer encoder is greater than the number of layers in the multi-layer decoder. The multi-layer encoder of the disentanglement model is used to encode the visual representation based on self-attention interaction to obtain visual representation features. The multi-layer decoder of the disentanglement model is used to decode the mask representation based on cross-self-attention interaction according to the visual representation features to predict unknown image content and obtain a new image representation sequence.

2. The method according to claim 1, characterized in that, In two consecutive newly generated image representation sequences, the image representations at the newly added visual representation locations differ significantly, while the image representations at other locations are similar. The visual representation features include newly added visual representation features; Encoding the visual representation based on self-attention interaction yields visual representation features, including: New visual representations are determined from the previous image representation sequence; The newly added visual representation is encoded based on cross-self-attention interaction to obtain the new visual representation features.

3. The method according to claim 2, characterized in that, Based on the visual representation features, the mask representation is decoded using a cross-self-attention interaction to obtain a new image representation sequence, including: The features of the previous image representation sequence are concatenated with the newly added visual representation features to obtain the concatenated representation features; The mask representation is decoded using the spliced ​​representation features based on cross-self-attention interaction to obtain a new image representation sequence.

4. The method according to claim 1, characterized in that, Using the visual representation features, the mask representation is decoded based on cross-self-attention interaction to obtain a new image representation sequence, including: Based on the visual representation features and the mask representation features corresponding to the mask representation, generate key representations and value representations; The mask representation features are used as query representations; A new image representation sequence is obtained by performing attention interactions based on the key representation, the value representation, and the query representation.

5. The method according to claim 1, characterized in that, The unentanglement model is trained according to the following steps: Construct an image representation sequence set, which includes multiple image representation sequence samples; Randomly mask the image representations in the image representation sequence samples to obtain image representation sequence samples containing masked representations; The target training strategy is determined according to the target probability, and the unentangled model is trained according to the target training strategy to predict the image representation corresponding to the mask representation position based on the image representation sequence sample containing the mask representation.

6. The method according to claim 5, characterized in that, The target training strategy includes a first training strategy; training the disentanglement model according to the first training strategy to predict the image representation corresponding to the mask representation position based on the image representation sequence samples containing the mask representation includes: The image representation sequence sample containing the mask representation is used as the current image representation sequence sample, and the image representation sequence sample containing the mask representation is randomly masked to simulate the previous image representation sequence sample; The current image representation sequence sample is compared with the previous image representation sequence sample to determine the newly added visual representation sample; The features of the previous image representation sequence sample are concatenated with the features of the newly added visual representation sample to obtain a concatenated representation feature sample. The image representation corresponding to the position of the mask representation is predicted by decoding the mask representation in the image representation sequence sample at the current time using the spliced ​​representation feature sample.

7. An image generation device based on spatiotemporal data interaction, characterized in that, include: An acquisition module is used to acquire the previous image representation sequence; the image representation sequence includes visual representation and mask representation, the mask representation represents unknown image content, and the visual representation represents known image content, and is used to provide image information for the mask representation to infer the unknown image content; The encoding module is used to encode the visual representation based on self-attention interaction to obtain visual representation features; The decoding module is used to decode the mask representation based on the visual representation features to obtain a new image representation sequence, wherein the new image representation sequence contains newly added visual representations. The generation module performs multiple iterations of image representation sequence generation, and generates an image based on the new image representation sequence when the new image representation sequence does not contain a mask representation. The new image representation sequence is generated based on a deentanglement model, which includes a multi-layer encoder and a multi-layer decoder, wherein the number of layers in the multi-layer encoder is greater than the number of layers in the multi-layer decoder. The multi-layer encoder is used to encode the visual representation based on self-attention interaction to obtain visual representation features. The multi-layer decoder is used to decode the mask representation based on cross-self-attention interaction according to the visual representation features to predict unknown image content and obtain a new image representation sequence.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the image generation method based on spatiotemporal data interaction as described in any one of claims 1-6.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the computer program implements the steps of the image generation method based on spatiotemporal data interaction as described in any one of claims 1-6.