Video generation method and apparatus, medium, electronic device, and program product

WO2026123947A1PCT designated stage Publication Date: 2026-06-18TENCENT TECHNOLOGY (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
TENCENT TECHNOLOGY (SHENZHEN) CO LTD
Filing Date
2025-10-20
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing content generation models rely on fixed reference conditions for user input, resulting in a lack of stylistic diversity in the generated video content and difficulty in accurately reproducing the inherent stylistic differences between different speakers.

Method used

By acquiring video style parameter distribution data that conforms to a preset probability distribution, and combining it with multimodal parameters for encoding and decoding, a video frame sequence that matches the video style is generated. A Gaussian distribution model is used to fit the distribution state of the video style parameters, thereby improving style diversity.

🎯Benefits of technology

The style diversity of the generated video content has been optimized, which can better reflect the inherent style differences of different speakers and generate more realistic video content.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025128724_18062026_PF_FP_ABST
    Figure CN2025128724_18062026_PF_FP_ABST
Patent Text Reader

Abstract

A video generation method, executed by an electronic device, and comprising: acquiring video style parameter distribution data matching a preset probability distribution, the video style parameter distribution data being used for describing a display form of the preset probability distribution and being used for determining video style parameters, and the display form of the preset probability distribution describing the distribution of the video style parameters in a preset dimension (S210); acquiring multi-modal parameters used for describing video content of a preset video, the multi-modal parameters including audio parameters and image parameters (S220); and on the basis of the video style parameter distribution data, coding / decoding the multi-modal parameters to obtain a video frame sequence that comprises the video content and matches a video style indicated by the video style parameters (S230).
Need to check novelty before this filing date? Find Prior Art

Description

Video generation methods, apparatus, media, electronic devices and software products

[0001] Related applications

[0002] This application claims priority to Chinese patent application filed on December 11, 2024, with application number 202411827747.4, entitled "Video Generation Method, Apparatus, Medium, Electronic Equipment and Program Product", the entire contents of which are incorporated herein by reference. Technical Field

[0003] This application belongs to the field of artificial intelligence technology, specifically relating to a video generation method, a video generation device, a computer-readable medium, an electronic device, and a computer program product. Background Technology

[0004] Generative Artificial Intelligence (AIGC) refers to the technology that uses artificial intelligence techniques such as generative adversarial networks and large-scale pre-trained models to generate relevant content with appropriate generalization capabilities through learning and recognizing existing data.

[0005] The core idea of ​​AIGC technology is to use artificial intelligence algorithms to generate content with a certain degree of creativity and quality. By building a content generation model and training it on a large amount of data, AIGC can generate relevant content based on input conditions or guiding information. For example, by inputting keywords, descriptions, or samples, AIGC can generate matching multimedia content such as articles, images, videos, and audio.

[0006] Existing content generation models typically rely on fixed reference conditions or guiding information input by users, resulting in a serious problem of homogenization in the style of the generated content and a lack of stylistic diversity. Summary of the Invention

[0007] This application provides a video generation method, a video generation apparatus, a computer-readable medium, an electronic device, and a computer program product, with the aim of optimizing the stylistic diversity of video generation content.

[0008] According to one aspect of the embodiments of this application, a video generation method is provided, executed by an electronic device, including:

[0009] Obtain video style parameter distribution data that conforms to a preset probability distribution. The video style parameter distribution data is used to describe the manifestation of the preset probability distribution and to determine the video style parameters. The manifestation of the preset probability distribution describes the distribution of the video style parameters in a preset dimension.

[0010] Obtain multimodal parameters for describing the video content of a preset video, the multimodal parameters including audio parameters and image parameters; and

[0011] Based on the video style parameter distribution data, the multimodal parameters are encoded and decoded to obtain a video frame sequence that includes the video content and matches the video style indicated by the video style parameters.

[0012] According to one aspect of the embodiments of this application, a video generation apparatus is provided, the apparatus comprising:

[0013] The first acquisition module is configured to acquire video style parameter distribution data that conforms to a preset probability distribution. The video style parameter distribution data is used to describe the appearance of the preset probability distribution and to determine the video style parameters. The appearance of the preset probability distribution describes the distribution of the video style parameters in a preset dimension.

[0014] The second acquisition module is configured to acquire multimodal parameters used to describe the video content of a preset video, the multimodal parameters including audio parameters and image parameters; and

[0015] The encoding / decoding module is configured to encode and decode the multimodal parameters based on the video style parameter distribution data to obtain a video frame sequence that includes the video content and matches the video style indicated by the video style parameters.

[0016] According to one aspect of the embodiments of this application, a computer-readable medium is provided having a computer program stored thereon, which, when executed by a processor, implements the video generation method as described in the above technical solutions.

[0017] According to one aspect of the embodiments of this application, an electronic device is provided, the electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the executable instructions to implement the video generation method as described in the above technical solutions.

[0018] According to one aspect of the embodiments of this application, a computer program product is provided, including a computer program that, when executed by a processor, implements the video generation method as described in the above technical solutions.

[0019] Details of one or more embodiments of this application are set forth in the following drawings and description. Other features, objects, and advantages of this application will become apparent from the specification, drawings, and claims. Attached Figure Description

[0020] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the published drawings without creative effort.

[0021] Figure 1 shows a system architecture block diagram of the application of the technical solution of this application.

[0022] Figure 2 shows a flowchart of a video generation method in one embodiment of this application.

[0023] Figure 3 shows a flowchart of obtaining video style parameters in one embodiment of this application.

[0024] Figure 4 shows a flowchart of extracting style modality features in one embodiment of this application.

[0025] Figure 5 shows a block diagram of a model structure for extracting style modal features from a style reference video in one embodiment of this application.

[0026] Figure 6 shows a flowchart of multimodal feature fusion based on cross-attention mechanism in one embodiment of this application.

[0027] Figure 7 illustrates a schematic diagram of the process of mapping style modality features to a Gaussian distribution model in one embodiment of this application.

[0028] Figure 8 shows a flowchart of training a style mapping activation layer in one embodiment of this application.

[0029] Figure 9 shows a flowchart of the encoding and decoding process of input parameters in one embodiment of this application.

[0030] Figure 10 shows a structural block diagram of a video generation model based on a diffusion model in one embodiment of this application.

[0031] Figure 11 schematically shows a structural block diagram of the video generation apparatus provided in an embodiment of this application.

[0032] Figure 12 schematically illustrates a computer system architecture block diagram suitable for implementing electronic devices according to embodiments of the present application. Detailed Implementation

[0033] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0034] Furthermore, the described features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. Numerous specific details are provided in the following description to give a thorough understanding of embodiments of this application. However, those skilled in the art will recognize that the technical solutions of this application can be practiced without one or more of the specific details, or other methods, components, apparatuses, steps, etc., can be employed. In other instances, well-known methods, apparatuses, implementations, or operations are not shown or described in detail to avoid obscuring various aspects of this application.

[0035] In this application embodiment, the terms "module" or "unit" refer to a computer program or part of a computer program that has a predetermined function and works with other related parts to achieve a predetermined goal, and can be implemented wholly or partially using software, hardware (such as processing circuitry or memory), or a combination thereof. Similarly, a processor (or multiple processors or memory) can be used to implement one or more modules or units. Furthermore, each module or unit can be part of an overall module or unit that includes the functionality of that module or unit.

[0036] The block diagrams shown in the accompanying drawings are merely functional entities and do not necessarily correspond to physically independent entities. That is, these functional entities can be implemented in software, in one or more hardware modules or integrated circuits, or in different network and / or processor devices and / or microcontroller devices.

[0037] The flowcharts shown in the accompanying drawings are merely illustrative and do not necessarily include all content and operations / steps, nor do they necessarily have to be performed in the described order. For example, some operations / steps can be broken down, while others can be combined or partially combined; therefore, the actual execution order may change depending on the specific circumstances.

[0038] Figure 1 shows a system architecture block diagram of the application of the technical solution of this application.

[0039] As shown in Figure 1, the system architecture applying the technical solution of this application may include a terminal device 110 and a server 130. The terminal device 110 may include various electronic devices such as smartphones, tablets, laptops, desktop computers, smart wearable devices, and smart in-vehicle devices. The server 130 may be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. Various types of communication media may be included between the terminal device 110 and the server 130 to provide a communication link, such as a wired communication link or a wireless communication link.

[0040] Content generation model 120 is a machine learning model used to perform content generation tasks, such as generating corresponding multimedia content such as text, images, audio, and video based on user-input reference conditions or guiding information. Taking the dialogue avatar generation model as an example, combining artificial intelligence, computer vision, and deep learning technologies, it can convert static images, audio, or video into dynamic avatars that can speak and change facial expressions. It can be applied to various fields such as virtual reality, digital humans, game development, film and television production, education and training, customer service, social media, and assistive healthcare.

[0041] In one application scenario of this application embodiment, the content generation model 120 can be pre-deployed on the server 130, and the server 130 can train the content generation model 120. During the model training process, the loss error can be determined based on the learning results of the content generation model 120 on the training samples, and then the model parameters of the content generation model 120 can be iteratively updated based on the loss error. Through continuous training, the loss error of the model can be gradually reduced, and the content generation accuracy of the model can be improved.

[0042] Once the content generation model 120 is trained, it can provide content generation services to the terminal device 110. For example, the terminal device 110 can upload user-inputted reference conditions or guidance information (using a reference image as an example in the diagram) to the server 130. The content generation model 120 deployed on the server 130 processes the data based on the reference conditions or guidance information and outputs automatically generated multimedia content (using video generation as an example in the diagram). The server 130 then returns the generated multimedia content to the terminal device 110, which presents the multimedia content to the user or uses the multimedia content to fulfill other application requirements.

[0043] In other application scenarios, the trained content generation model 120 can also be directly deployed to the terminal device 110, enabling the terminal device 110 to run the content generation model locally. When content generation is required, the terminal device 110 can input reference conditions or guidance information into the trained content generation model 120, which will then process the data according to the reference conditions or guidance information and output automatically generated multimedia content.

[0044] In the related technologies of this application, the reference conditions or guidance information input by the user may include various types of reference data such as text, images, audio, and video. Since the input information is usually deterministic and static, the content generated from it often exhibits a serious problem of homogenization, lacking stylistic diversity and vividness, and thus is limited in real-life scenarios.

[0045] For example, Talking Head Generation (THG), a human-centered task, aims to generate conversational headshot videos through conditional guidance such as voice and images. It has wide applications in scenarios such as digital humans, filmmaking, and virtual reality. THG is one of the most challenging tasks in video generation because it has low tolerance for artifacts and requires high fidelity in aspects such as lip movements, facial expressions, and head movements.

[0046] Following commonly used generative models, the THG method based on Generative Adversarial Networks (GANs) has achieved remarkable results in generating high-resolution videos, particularly in visual quality and lip-sync accuracy, through adversarial training between the generator and discriminator. On the other hand, the THG method based on diffusion models excels in generating high-quality and high-resolution images and videos, and surpasses GANs in terms of the stability and consistency of generated content, thus becoming the mainstream method of THG.

[0047] These methods have largely facilitated head-and-speech generation (THG) by reinforcing explicit control conditions such as facial keypoints and head movement sequences. However, they often overlook a crucial fact about head-and-speech videos. Essentially, when different people speak in real life, their habits and emotions can vary significantly across different situations. This fact, in turn, leads to different attributes in the corresponding head-and-speech videos, including visual phonemes and facial expressions. Therefore, these habits and emotions are embedded in the intrinsic style of the head-and-speech video. This intrinsic style is highly correlated with the realism of the video but is difficult to infer from previously widely adopted conditions such as facial keypoints. Thus, previous methods struggled to accurately reproduce reality when there was a significant gap in the intrinsic style between the speaker in the reference face and the style reference video.

[0048] To address the problems existing in the above-mentioned related technologies, this application proposes an implementation scheme for generating video content with diverse styles based on video style parameters that conform to a probability distribution. Since a probabilistic style prior is designed, the video style parameters can be resampled from the predicted probability distribution model, providing sufficient variation for the relevant information of the video style, thereby enabling the video generation model to have strong generalization ability.

[0049] To facilitate the explanation of the implementation details of the solution, the following scenario description of the embodiments of this application mainly uses the generation of dialogue avatars as an example, but this application is not limited to this.

[0050] The following detailed description, in conjunction with specific embodiments, outlines the technical solutions provided in this application, including video generation methods, video generation devices, computer-readable media, electronic devices, and computer program products.

[0051] Figure 2 shows a flowchart of a video generation method according to one embodiment of this application. This video generation method can be executed by the terminal device or server shown in Figure 1, or it can be executed jointly by the terminal device and the server. This embodiment of the application uses a video generation method executed by a terminal device as an example for illustration. As shown in Figure 2, the video generation method may include the following steps S210 to S230.

[0052] S210: Obtain video style parameter distribution data that conforms to a preset probability distribution. The video style parameter distribution data is used to describe the appearance of the preset probability distribution and to determine the video style parameters. The appearance of the preset probability distribution describes the distribution of the video style parameters in a preset dimension.

[0053] Video style parameters are parameters that quantify and represent the style of a video using data. They are obtained by processing the qualitative descriptions of video style. Determined by video style parameter distribution data, they reflect the specific artistic style and emotional atmosphere presented by the video in various aspects such as color, composition, editing rhythm, music selection, use of special effects, and emotional expression of characters. They are used to control the style type of the generated video.

[0054] S220: Obtain multimodal parameters used to describe the video content of the preset video. The multimodal parameters include audio parameters and image parameters.

[0055] Multimodal parameters refer to data parameters used to describe the video content of a preset video, which include data from various information sources or in various forms. In this application, they mainly include audio parameters and image parameters. Through the combined effect of these different modal parameters, the video content can be described more comprehensively and accurately, providing a rich information foundation for video generation.

[0056] S230: Based on the video style parameter distribution data, the multimodal parameters are encoded and decoded to obtain a video frame sequence that includes video content and matches the video style indicated by the video style parameters.

[0057] A video frame sequence is a sequence of video frames arranged chronologically. These frames contain the specific content of the video and are matched with the video style indicated by video style parameters. It is the final output of video generation, generated by encoding and decoding multimodal parameters and combining them with video style parameter distribution data, and can present a complete video frame with a specific style and content.

[0058] In some embodiments, the multimodal parameters can be processed using a pre-built encoding / decoding neural network based on the video style parameter distribution data. This neural network first concatenates and fuses the video style parameter distribution data and the multimodal parameters to obtain a fused feature vector. Then, it performs a nonlinear transformation on the fused feature vector through multiple fully connected layers, with each fully connected layer followed by an activation function (such as the ReLU function) to increase the model's nonlinear expressive power. Finally, the output layer produces a sequence of video frames that includes the video content and matches the video style indicated by the video style parameters.

[0059] In the video generation method provided in this application embodiment, by acquiring video style parameter distribution data conforming to a preset probability distribution and multimodal parameters used to describe video content, a video matching the video style type and video content can be generated based on the video style parameters under the guidance of the multimodal parameters. According to the preset probability distribution model, dynamically changing video style parameters can be sampled during the video generation process, solving the problem of severe homogenization of video content style and optimizing the style diversity of video content.

[0060] The following sections provide a detailed explanation of the specific implementation methods for each step of this video generation method.

[0061] In step S210, video style parameter distribution data that conforms to a preset probability distribution is obtained.

[0062] In one embodiment of this application, video style may include stylistic features related to the video production and editing process, such as the specific artistic style and emotional atmosphere conveyed by the creator through visual and auditory elements. Video style can be specifically reflected in various aspects such as the video's color, composition, editing rhythm, music selection, and use of special effects.

[0063] In one embodiment of this application, particularly in dialogue avatar videos, the video style can include the type of emotion expressed by a person during speech through language, facial expressions, and actions, such as anger, contempt, disgust, fear, happiness, sadness, surprise, neutrality, etc. The video style related to the person's emotional type can be characterized through multiple dimensions such as the person's facial expressions, head posture, and mouth movements.

[0064] Video style parameters are obtained by quantifying the qualitative description of video style. Specifically, they can include converting video style features with multiple representation dimensions into feature vectors with specified dimensions.

[0065] In one embodiment of this application, the preset probability distribution model includes a Gaussian distribution model. The Gaussian distribution model, also known as the normal distribution model, is a very important continuous probability distribution model in statistics and probability theory. The Gaussian distribution model is defined by the mean and variance / standard deviation, where the mean determines the central location of the distribution, and the variance / standard deviation determines the width or dispersion of the distribution. The shape of the Gaussian distribution is a symmetrical bell-shaped curve, and its probability density function reaches its maximum value at the mean and gradually decreases as the distance from the mean increases.

[0066] Given a mean and variance, the Gaussian distribution is the probability distribution with maximum entropy. This means that with minimal known information, the Gaussian distribution has the highest uncertainty, consistent with the maximum entropy principle in information theory. This application utilizes a Gaussian distribution model to fit the distribution of video style parameters. On one hand, this obtains style transformation characteristics that conform to a normal distribution; on the other hand, it increases the uncertainty of style sampling, thereby optimizing the diversity of video styles.

[0067] In some alternative implementations, the preset probability distribution may also include the Boltzmann distribution. The Boltzmann distribution, also known as the Boltzmann-Gibbs distribution, describes the probability distribution of different energy states of a system in thermal equilibrium. The Boltzmann distribution shows that the lower-energy state is always more likely to be occupied. The ratio of the probabilities of two states is called the Boltzmann factor, which depends on the energy difference between the two states.

[0068] Figure 3 shows a flowchart of obtaining video style parameters in one embodiment of this application. As shown in Figure 3, based on the above embodiment, the step S210 of obtaining video style parameter distribution data that conforms to a preset probability distribution may further include the following steps S211 to S213.

[0069] In step S211, a style reference video is obtained for extracting the video style type.

[0070] Style reference videos include at least one video of a certain length, the content of which may include, for example, a person speaking. By processing the style reference videos, representative video style types can be extracted.

[0071] In one embodiment of this application, a style reference video for extracting video style types can be obtained through a terminal device with a human-computer interaction interface.

[0072] For example, a content input control can be provided to the user on the terminal device's interactive interface. The user can upload style reference videos saved on the local terminal device through the content input control, or fill in the Uniform Resource Locator (URL) of the style reference videos saved on the remote server through the content input control, so that the corresponding style reference videos can be downloaded from the remote server through the URL.

[0073] For example, the user interface of a terminal device can provide a content selection control, allowing users to choose a style reference video with at least one video style type from a set of candidate videos. The video description information for each candidate video can provide at least one video tag matching its style type, facilitating user selection of a style reference video with a specified style type based on the recommended video tags. Additionally, the video description information for each candidate video can also provide at least one preview of its style type, allowing users to confirm the style type presentation effect of the candidate video by viewing the preview. The preview information can include a still image generated from a single video frame in the candidate video, or a dynamic image (e.g., a GIF) generated from multiple video frames in the candidate video, or a short video generated from a video clip in the candidate video.

[0074] In step S212, style modal features including at least one modal type are extracted from the style reference video.

[0075] A modality refers to a way of expressing or perceiving things; every source or form of information can be called a modality. For example, based on how humans perceive data, it can include various modalities such as touch, vision, hearing, and smell. Based on the medium through which data is transmitted, it can include various modalities such as text, images, audio, and video.

[0076] Style modal features extracted from style reference videos can have one or more modal types. The richer the modal types contained in the style modal features, the more accurate the style modal features will be in representing the style type of the video.

[0077] In one embodiment of this application, the style modal features extracted from the style reference video may include two types: image modality and audio modality. Figure 4 shows a flowchart of the style modal feature extraction process in one embodiment of this application. As shown in Figure 4, based on the above embodiment, step S212 may further include the following steps S2121 to S2123.

[0078] S2121: Perform image encoding processing on the style reference video based on a pre-trained image encoder to obtain image modal features.

[0079] Specifically, after receiving a sequence of video frames from a style reference video, the image encoder first normalizes the video frames, scaling the pixel values ​​to a specific range. Then, it extracts features from the video frames using multiple convolutional layers, employing kernels of varying sizes and numbers to capture features at different scales within the video frames. Next, pooling layers downsample the feature maps output from the convolutional layers, reducing the amount of data while preserving important features. Finally, the output of the pooling layers is passed through a fully connected layer for feature mapping, yielding image modality features with a specified dimension.

[0080] S2122: Perform audio encoding processing on the style reference video based on the pre-trained audio encoder to obtain audio modal features.

[0081] Specifically, the audio encoder first segments the audio frame sequence of the style reference video into multiple short frames. Then, windowing is applied to each frame to reduce spectral leakage. Next, a Fast Fourier Transform (FFT) is used to convert the time-domain audio frames into a frequency-domain representation, extracting the spectral features of the audio. Afterward, feature selection and dimensionality reduction operations are performed on the spectral features to remove redundant information. Finally, the processed features are mapped to a specified dimension through a fully connected layer to obtain the audio modal features.

[0082] S2123: Perform feature fusion processing on image modal features and audio modal features to obtain style modal features that fuse image modality and audio modality.

[0083] Audio typically contains key information about the spoken content and is crucial for extracting the inherent style of a video. However, extracting audio information that complements the visual information contained in video frames is not easy. This application utilizes a specific encoder and feature fusion structure to process this complex data, considering both image and audio information to obtain a stronger style embedding. This application extracts two modal features corresponding to the image and audio modalities from a style reference video using a pre-trained encoder. Then, it fuses these two modal features through a weighted operation. This modal feature extraction method, which involves decomposition followed by fusion, can fully explore the inherent style of the video and improve the ability of style modal features to represent the video's style type.

[0084] Image modal features and audio modal features can be feature vectors with specified dimensions output by the encoder. Methods for feature fusion processing of image modal features and audio modal features can include, for example, adding the two feature vectors to obtain a fusion vector with the same dimension, or concatenating the two feature vectors to obtain a fusion vector with doubled dimensions.

[0085] Figure 5 shows a block diagram of a model structure for extracting style modal features from a style reference video in one embodiment of this application.

[0086] As shown in Figure 5, the style reference video includes a sequence of video frames and an audio frame sequence arranged in chronological order.

[0087] By inputting an audio frame sequence into a pre-trained audio encoder, frame-level audio parameters can be extracted. By inputting a sequence of video frames into a pre-trained video encoder, the sequence of facial expression parameters can be extracted. Where N represents the number of frames.

[0088] For audio feature extraction, an audio encoder, such as the encoder structure in the Whisper-Tiny model, can be used. Whisper is an open-source end-to-end model for speech recognition developed by OpenAI. Its encoder structure encodes the input audio data into feature parameters in the hidden layer space. The Whisper-Tiny model is the smallest version in the Whisper series, characterized by its low parameter count (approximately 39 million parameters), small size, fast processing speed, and support for multiple languages.

[0089] For video feature extraction, a video encoder structure, such as that found in 3D Morphable Models (3DMMs), can be used. 3DMMs are end-to-end models suitable for 3D face reconstruction and analysis, capable of learning 3D face features from 2D images. The encoder structure in a 3DMM model uses principal component analysis to learn the distribution of face shape and texture, extracting shape coefficients, texture coefficients, and rendering parameters (such as camera position and target scale) representing face features.

[0090] To perform feature fusion between image and audio modalities, the extracted feature parameters α and β are input into a two-branch transformer (Transformer) to transform the data into features belonging to the same dimensional space. audio modal features and image modal features

[0091] After fusing the features of the two modalities using the cross attention module, style modality features that fuse the audio and image modalities are obtained.

[0092] Cross-attention is used for information interaction between different inputs, enabling the model to effectively align and focus on contexts from different sources, thus helping the model better capture the correlation between two inputs. In cross-attention, the model uses one input sequence as the query vector and the other input sequence as the key and value vectors, and then calculates the attention weights associated with the two input sequences.

[0093] Figure 6 shows a flowchart of multimodal feature fusion based on a cross-attention mechanism in one embodiment of this application. As shown in Figure 6, based on the above embodiment, step S2123 may further include the following steps S21231 to S21233.

[0094] S21231: Obtain the feature similarity between image modal features and audio modal features.

[0095] In this embodiment, one of the image modal features and the audio modal features can be used as the query vector Q, and the other feature can be used as the key vector K and the value vector V. The feature similarity between the two modal features can be obtained by calculating the dot product of the query vector Q and the key vector.

[0096] S21232: Map the feature similarity according to the preset activation function to obtain attention weights that conform to the probability distribution.

[0097] After obtaining feature similarity, a function mapping can be performed on the feature similarity using a preset activation function to obtain attention weights that conform to the probability distribution.

[0098] The attention weights in this application can be calculated using the following formula.

[0099] Among them, QK T It is the dot product of the query vector and the key vector, representing the similarity between the two sequences at different positions; d k It is the dimension of the key vector, which acts as a scaling factor to avoid excessively large values.

[0100] In this embodiment, the softmax function is used as the activation function to convert feature similarity into a probability distribution, representing the attention weight of the query vector to each key vector.

[0101] S21233: Perform weighted operations on image modal features or audio modal features based on attention weights.

[0102] If image modal features are used as query vectors and audio modal features are used as value vectors, attention weights can be used to weight the audio modal features to obtain style modal features that fuse image and audio modal features.

[0103] If the audio modal features are used as the query vector and the image modal features are used as the value vector, then attention weights can be used to weight the image modal features to obtain style modal features that fuse the image and audio modal features.

[0104] This application embodiment utilizes a cross-attention mechanism to fuse image modal features and audio modal features, which can flexibly capture the dependencies between the two modal features, comprehensively fuse the contextual information of the two modal features, and improve the accuracy of style modal features in expressing the inherent style of the video.

[0105] Referring to Figure 3, in step S213, the style modality features are mapped according to a preset activation function to obtain video style parameter distribution data that conforms to a preset probability distribution.

[0106] In one embodiment of this application, the preset probability distribution model may include a Gaussian distribution model. The method for mapping style modality features to a Gaussian distribution model may include: mapping the style modality features according to a preset activation function to obtain the mean and variance that conform to the probability distribution; and performing a weighted operation on a preset multivariate normal distribution according to the mean and variance to obtain video style parameters that conform to the Gaussian distribution.

[0107] Figure 7 illustrates a schematic diagram of the process of mapping style modality features to a Gaussian distribution model in one embodiment of this application.

[0108] As shown in Figure 7, the embodiments of this application can utilize style modal features. The input is fed into a pre-trained activation layer, where a preset activation function, softmax, is used to process the style modality features. After mapping, the mean μ that conforms to the probability distribution is obtained. s and variance σ s The specific mapping formula is as follows.

[0109] in d represents the activation parameters learned through pre-training. s This represents the dimension of the input vector.

[0110] After obtaining the mean and variance, the preset multivariate normal distribution can be weighted according to the following formula to obtain the video style parameters s that conform to a Gaussian distribution. s = μ s +σ s ·∈

[0111] Where ∈ ~ N(0,I) indicates that the random variable ∈ follows a multivariate normal distribution with a mean of 0 and a covariance matrix of identity matrix I. Identity matrix I is a square matrix with diagonal elements of 1 and all other elements of 0. In a multivariate normal distribution, the covariance matrix being identity matrix I indicates that the random variables are independent of each other, and the variance of each variable is 1.

[0112] Since a speaker's emotions change as the video frames progress, it is insufficient to represent the inherent style using a deterministic feature. This application's embodiments train video style parameters that conform to a Gaussian distribution. When it is necessary to incorporate video style into the video to be generated, a better sequential embedding can be obtained through parameter sampling. This helps to model the style prior as a more representative Gaussian distribution.

[0113] The encoder and cross-attention module shown in Figure 5, along with the activation layer shown in Figure 7, together constitute a style extractor for extracting the intrinsic style of a video. To optimize the extraction of the intrinsic style of the video, this style extractor can be trained to iteratively update its model parameters.

[0114] In one embodiment of this application, the model parameters of the video encoder and audio encoder are fixed, and the activation layer in the graph is trained using pre-collected data samples to iteratively update the corresponding activation parameters W. s .

[0115] Figure 8 shows a flowchart of training a style mapping activation layer in one embodiment of this application. As shown in Figure 8, the method for training the style mapping activation layer includes the following steps S810 to S840.

[0116] S810: Obtain pre-collected style modality feature samples.

[0117] S820: Form positive sample pairs from style modal feature samples with the same video style type, and form negative sample pairs from style modal feature samples with different video style types.

[0118] Sample pairs are obtained by combining style modality feature samples according to video style type, and are divided into positive sample pairs and negative sample pairs. Positive sample pairs consist of style modality feature samples with the same video style type, while negative sample pairs consist of style modality feature samples with different video style types. By processing and analyzing sample pairs, similarity loss error can be calculated, and then the activation parameters of the activation function can be iteratively updated to optimize the extraction and generation of video styles.

[0119] S830: Map positive and negative sample pairs according to the preset activation function to obtain the similarity loss error of the predicted values ​​of the sample pairs. The similarity loss error is negatively correlated with the similarity of the predicted values ​​of the positive sample pairs and positively correlated with the similarity of the predicted values ​​of the negative sample pairs.

[0120] S840: Iteratively update the activation parameters of the activation function based on the similarity loss error.

[0121] Essentially, feature samples with similar inherent styles should cluster together in the style space. Therefore, embodiments of this application construct positive sample pairs (s, s) with the same identity and sentiment by comparative learning of style priors. p ), and negative sample pairs (s, s) with different identities or emotions. n Then, the similarity loss error between positive and negative sample pairs is enhanced.

[0122] The embodiments of this application can calculate the similarity loss error between positive and negative sample pairs according to the following loss function.

[0123] Where τ represents a temperature parameter, S n This represents all negative samples corresponding to the video style parameter s, where ζ is based on the relationship between sample pairs. The reciprocal of the distance determines the similarity of sample pairs. In addition, a fixed constant is added to the similarity to stabilize the numerical range of similarity, making the training process more stable.

[0124] During the training of the intrinsic style extractor, all parameters of this lightweight model can be trained directly.

[0125] In one embodiment of this application, the method for obtaining pre-collected style modality feature samples may include: obtaining image modality feature samples and audio modality feature samples from a pre-collected sample dataset; setting the feature value of one of the image modality feature samples and audio modality feature samples to zero according to a preset probability; and performing a weighted operation on the image modality feature samples and audio modality feature samples to obtain style modality feature samples that fuse image modality and / or audio modality.

[0126] Before performing feature fusion on image modal feature samples and audio modal feature samples, this application embodiment can utilize the technique of random feature dropout to set the feature value of one of the image modal feature samples and audio modal feature samples to zero according to a preset probability, thereby improving the ability of the intrinsic style extractor to extract style priors through a single modality.

[0127] In one embodiment of this application, the data samples used to train the intrinsic style extractor may include the Multi-view Emotional Audio-visual Dataset (MEAD), the High Definition Talking Face (HDTF) dataset, and other video data samples collected from the Internet. Facial regions in these video data samples are cropped and resized to a resolution of 512x512, and the total training dataset comprises approximately 300 hours of video.

[0128] The MEAD dataset is a large-scale, high-quality emotional audiovisual dataset designed specifically for generating emotional speaking faces. It contains video clips generated by 60 participants engaging in conversations at three different intensity levels with eight different emotions (excluding neutral emotions). The HDTF dataset is a stream-oriented, one-off, high-resolution audiovisual face generation dataset containing video samples collected from video websites and cropped to obtain facial regions.

[0129] During model training, embodiments of this application treat samples with the same identity and emotion in MEAD as positive samples, and segments from the same video in HDTF as positive samples. Furthermore, embodiments of this application randomly discard image feature samples or audio feature samples, but do not simultaneously set them to zero.

[0130] Referring again to Figure 2, in step S220, multimodal parameters for describing the video content of the preset video are obtained. The multimodal parameters include audio parameters and image parameters.

[0131] In one embodiment of this application, the method for obtaining audio parameters may include: obtaining a style reference video for extracting video style type; extracting an audio feature sequence from the style reference video; and encoding the audio feature sequence according to a pre-trained audio encoder to obtain audio parameters.

[0132] This application embodiment can utilize the intrinsic style extractor shown in Figure 5 to extract audio feature sequences from a style reference video, and then encode the audio feature sequences using the audio encoder therein to obtain audio parameters.

[0133] In some alternative implementations, audio parameters can also be obtained by capturing the user's voice. For example, the user can capture audio data containing speech content in real time using an audio input device such as a microphone, and then encode the audio data according to an audio encoder to obtain the corresponding audio parameters.

[0134] In one embodiment of this application, the method for obtaining image parameters may include: obtaining a reference image for describing the visual modal content of a video; performing encoding and decoding processing on the reference image according to a pre-trained image reference model to obtain image parameters; the image reference model includes an encoder and a decoder connected in sequence, and the image parameters include image encoding parameters output by the encoder and image decoding parameters output by the decoder.

[0135] This application embodiment utilizes the self-encoding process of the reference image to extract the corresponding image encoding parameters and image decoding parameters. Simultaneously, using the image encoding parameters and image decoding parameters as reference conditions can improve the guidance capability for video content generation, enabling the guided video frames to inherit the content characteristics of the reference image to the greatest extent.

[0136] In one embodiment of this application, a reference image for describing the visual modal content of a video can be obtained through a terminal device with a human-computer interaction interface.

[0137] For example, an image input control can be provided to the user on the interactive interface of the terminal device. The user can upload a reference image saved on the local terminal device through the image input control, or fill in the Uniform Resource Locator URL of the reference image saved on the remote server through the image input control, so that the corresponding reference image can be downloaded from the remote server through the URL.

[0138] For example, the user interface of a terminal device can provide an image selection control, allowing the user to choose a reference image from a set of candidate images. The image description information for each candidate image can include at least one image tag that matches its characteristics, facilitating the user's selection of a reference image with specific features based on the image tag's recommendations. The image tag could, for example, include information such as the gender, age, and skin condition of the people depicted in the image.

[0139] Referring to Figure 2, in step S230, based on the video style parameter distribution data, the multimodal parameters are encoded and decoded to obtain a video frame sequence that includes video content and matches the video style indicated by the video style parameters.

[0140] Figure 9 shows a flowchart of the encoding and decoding process for input parameters in one embodiment of this application. As shown in Figure 9, based on the above embodiment, step S230 may further include the following steps S231 to S233.

[0141] S231: Collect image samples from a noise model that conforms to a probability distribution to obtain a noisy image sequence.

[0142] S232: Input video style parameters, multimodal parameters, and noisy image sequences into a pre-trained denoising model. The denoising model includes an encoder for mapping image data to latent spatial features and a decoder for restoring latent spatial features to image data.

[0143] Latent spatial features are the features represented in the latent space after image data is mapped by the encoder. In the video generation model of this application, the encoder of the denoising model converts image data into latent spatial features, which contain key information and feature representations of the image. The decoder then restores the latent spatial features back to the image data. The use of latent spatial features helps to maintain visual quality while reducing computational costs, and through the manipulation and processing of latent spatial features during the diffusion and inverse denoising processes, high-quality video frame sequences can be restored from noise.

[0144] S233: In the denoising model, the video style parameters, multimodal parameters, and noisy image sequences are encoded and decoded to obtain a video frame sequence with noise removed.

[0145] In this embodiment, video style parameters and multimodal parameters are used as reference conditions to guide the denoising model to denoise the sampled noisy image sequence, thereby generating a video frame sequence that conforms to the characteristics of the video style and content. The video style parameters control the video style type, the image parameters in the multimodal parameters control the background and facial features of the characters, and the audio parameters control the movement of the characters' lips. Guided by multiple dynamic reference conditions, video content containing various emotions and facial expressions can be generated, significantly improving the diversity of video styles.

[0146] In one embodiment of this application, before inputting the noisy image sequence into the denoising model, key point features can be added to the noisy image sequence to introduce head motion posture as a reference condition.

[0147] Methods for adding keypoint features to a noisy image sequence may include: acquiring a keypoint image sequence, the keypoint images including keypoints used to describe head motion posture; encoding the keypoint image sequence to obtain a keypoint feature sequence; and fusing the keypoint feature sequence with the noisy image sequence to obtain a noisy image sequence carrying head motion posture information.

[0148] Keypoint feature sequences are feature sequences obtained by encoding keypoint image sequences. Keypoint images contain key points that describe head movement and posture. These keypoint images are arranged into a sequence according to the temporal order of video frames and then encoded. The resulting keypoint feature sequence carries head movement and posture information. In video generation, fusing these features with a sequence of noisy images introduces head movement and posture as a reference condition, further enriching the stylistic expression of the video and making the head posture of the person in the generated video more consistent with expectations.

[0149] The specific acquisition method is to use a key point detection algorithm (such as a deep learning-based facial key point detection model) to process the style reference video or other videos containing head movements, detect facial feature points located in the area above the mouth, map these key points onto a black background to synthesize key point images, and assemble a key point image sequence according to the time order of the video frames.

[0150] In one embodiment of this application, the key points include facial feature points located in the area above the mouth. For example, they may include: two key points located on the left side of the face for locating the left edge, two key points located on the right side of the face for locating the right edge, two key points corresponding to the pupils of both eyes, and two key points corresponding to the top and bottom of the bridge of the nose. The key point image is an image synthesized by mapping the above multiple facial feature points onto a black background.

[0151] This application uses keypoint image sequences as reference conditions, allowing control of a person's head posture in a video based on keypoints describing head movement. Combined with other reference conditions, head movement and lip movement can work together to further enhance the diversity of video styles.

[0152] The denoising model in this application embodiment can be implemented based on diffusion models. Diffusion models are a type of generative model based on a probability diffusion process. The core idea is to gradually add noise through a forward diffusion process to transform the data into a noise distribution, and then learn the inverse generation process to restore high-quality data samples from the noise.

[0153] The basic idea of ​​diffusion models originates from the diffusion process in physics, describing the movement of particles from regions of high concentration to regions of low concentration in a medium. In machine learning, diffusion models gradually transform data into a noise distribution by introducing random noise, and then gradually restore the data from the noise through a reverse process. Specifically, diffusion models consist of two main processes: a forward process and a reverse process, where the forward process is also called the diffusion process. The forward process gradually adds noise to the original data to create a set of pure noise, while the reverse process restores the input from the set of random noise.

[0154] The main types of diffusion models include Denoising Diffusion Probabilistic Models (DDPMs), which is the most representative diffusion model. It achieves data generation through a progressive denoising method. The main idea of ​​DDPMs is to add Gaussian noise to the forward process to gradually make the data approach a standard normal distribution, and then gradually denoise and restore the data by learning the inverse process.

[0155] The advantage of diffusion models lies in their ability to avoid mode collapse (a common problem in other generative models such as GANs) and to produce high-quality, diverse samples. Furthermore, the training process for diffusion models is relatively simple, requiring no complex adversarial training, and they possess more powerful data generation capabilities than variational autoencoders (VAEs).

[0156] Figure 10 shows a structural block diagram of a video generation model based on a diffusion model in one embodiment of this application.

[0157] As shown in Figure 10, the video generation model may include a HEAD-Kps Guider, a keypoint guidance model for incorporating head motion pose information; a Reference UNet, an image reference model for incorporating image parameters; and a Denoising UNet, a denoising model for performing image denoising. It may also include a VAE Encoder at the input for performing embedding operations on the reference image, and a VAE Decoder at the output for restoring feature vectors from the hidden layer space to the image space.

[0158] In one embodiment of this application, the denoising model includes a self-attention layer for encoding and decoding noisy image sequences, a reference image attention layer for encoding and decoding image parameters, an audio attention layer for encoding and decoding audio parameters, a style attention layer for encoding and decoding video style parameters, and a temporal module for establishing a time series of video frames.

[0159] In one embodiment of this application, a Latent Diffusion Model (LDM) can be used to generate video frames. The LDM employs a diffusion and denoising process in the latent space using a Variational Autoencoder (VAE). The LDM maps the input image x to the latent space, encoding the image as z = E(x), which helps maintain visual quality while reducing computational cost. During the diffusion process, Gaussian noise ∈ ~N(0,I) is gradually introduced into the latent vector z, and after T denoising iterations, it is degraded into complete noise z ~N(0,I). In the reverse denoising process, the target latent vector z is iteratively denoised from the sampled Gaussian noise using the diffusion model, and then decoded by the VAE decoder into the output image x = D(z).

[0160] The denoising iteration step count refers to the number of steps required to iteratively denoise the sampled Gaussian noise to reconstruct the target latent vector during the inverse denoising process of the diffusion model. In this application, after T denoising iteration steps, the latent vector is degraded from its initial state into complete noise. Then, the target latent vector is gradually denoised in the inverse process, and finally decoded into the output image by the decoder. It is an important parameter controlling the denoising process and affecting the quality of the generated video.

[0161] In one embodiment of this application, before encoding and decoding the multimodal parameters based on video style parameter distribution data, the video generation model can be trained using training samples to iteratively update the model parameters. The method for training the video generation model may include: obtaining the denoising loss error obtained by encoding and decoding the training samples using the denoising model; iteratively updating the model parameters of the denoising model based on the denoising loss error, wherein the update range of the model parameters is negatively correlated with the number of iterations.

[0162] During model training, given the latent features z0 = E(x0) and the reference condition c, the loss function used to calculate the denoising loss error can be expressed as the following formula.

[0163] Among them, zt Represents the latent noise features at iteration t, ∈ t The noise is predicted by the UNet model, which has been modified using an attention mechanism with parameter θ. This model employs a cross-attention mechanism to connect the reference condition c with the latent feature z. t The process of fusion guides image generation.

[0164] In one embodiment of this application, the method for iteratively updating the model parameters of the denoising model based on the denoising loss error may further include: iteratively updating the model parameters of the denoising model, the keypoint guidance model, and the image reference model based on the denoising loss error until a preset first iteration termination condition is met; the keypoint guidance model is used to input head motion posture information into the denoising model, and the image reference model is used to input image parameters into the denoising model; fixing the model parameters of the keypoint guidance model and the image reference model, and iteratively updating the model parameters of the denoising model based on the denoising loss error until a preset second iteration termination condition is met; fixing the model parameters of the self-attention layer, the reference image attention layer, the audio attention layer, and the temporal module, and iteratively updating the model parameters of the style attention layer based on the denoising loss error until a preset third iteration termination condition is met.

[0165] The preset first iteration termination condition can be: the denoising loss error is less than a preset first threshold, or the number of iterations reaches a preset maximum number of iterations. The preset second iteration termination condition can be: the rate of change of the denoising loss error is less than a preset second threshold, meaning the difference in denoising loss error across multiple consecutive iterations is less than this threshold, or the number of iterations reaches a preset maximum number of iterations. The preset third iteration termination condition can be: the denoising loss error converges to a stable value, meaning the fluctuation range of the denoising loss error across multiple consecutive iterations is less than a preset third threshold, or the number of iterations reaches a preset maximum number of iterations.

[0166] In the embodiments of this application, the training of the video generation model adopts a three-stage progressive training method to gradually improve the model's generation capability and stability. In each stage, the loss function in the above embodiments is used to predict the denoising loss error.

[0167] First, the embodiments of this application can train a model to generate single-frame images, wherein the denoising model Denoising UNet, the image reference model Reference UNet, and the head keypoint guidance model HEAD-Kps Guider jointly participate in the training. In this stage, the denoising model Denoising UNet takes a single frame as input, the image reference model Reference UNet processes different frames randomly selected from the same video clip, and the head keypoint guidance model HEAD-Kps Guider incorporates the encoded head keypoint features into the latent space.

[0168] Secondly, the training model in this application generates consecutive multi-frame images, including the temporal module and audio attention layer in the denoising model. At this stage, f consecutive frames are sampled from the video clip, and the model parameters of the image reference model (Reference UNet) and the head keypoint guidance model (HEAD-Kps Guider) are frozen.

[0169] The final stage is the transfer of intrinsic style. In this stage, all other modules of the model are frozen, and only the style attention layer is trained. This allows the model to generate corresponding facial expressions and details based on the intrinsic style input during portrait image generation.

[0170] During the multi-frame training phase, the number of consecutive frames f is set to 8. In style projection training, to enhance generalization ability, different emotions can be applied to the same identity on the MEAD dataset (e.g., generating sad video clips from happy reference images).

[0171] For the splitting of the training and test sets, 10 of the 46 identities can be selected from MEAD for testing. For HDTF, 25 videos can be randomly selected for testing. This application embodiment takes precautions to ensure no overlap in role identities between the training and test sets. A diffusion sampler is a tool used during diffusion-based inference to sample from the noise distribution for a reverse denoising process, thereby generating compliant data samples. During inference, to ensure fairness, this application embodiment uses EulerDiscreteScheduler as the diffusion sampler and sets the denoising steps for all diffusion-based methods to 25.

[0172] It should be noted that although the steps of the method in this application are described in a specific order in the accompanying drawings, this does not require or imply that the steps must be performed in that specific order, or that all the steps shown must be performed to achieve the desired result. Additional or alternative steps may be omitted, multiple steps may be combined into one step, and / or one step may be broken down into multiple steps.

[0173] The following describes an embodiment of the apparatus of this application, which can be used to execute the video generation method described in the above embodiments of this application. Figure 11 schematically shows a structural block diagram of the video generation apparatus provided in an embodiment of this application. As shown in Figure 11, the video generation apparatus 1100 includes:

[0174] The first acquisition module 1110 is configured to acquire video style parameter distribution data that conforms to a preset probability distribution. The video style parameter distribution data is used to describe the appearance of the preset probability distribution and to determine the video style parameters. The appearance of the preset probability distribution describes the distribution of the video style parameters in a preset dimension.

[0175] The second acquisition module 1120 is configured to acquire multimodal parameters used to describe the video content of a preset video, the multimodal parameters including audio parameters and image parameters;

[0176] The encoding / decoding module 1130 is configured to encode and decode multimodal parameters based on video style parameter distribution data to obtain a video frame sequence that includes video content and matches the video style indicated by the video style parameters.

[0177] In some embodiments of this application, based on the above technical solutions, the preset probability distribution model includes a Gaussian distribution model.

[0178] In some embodiments of this application, based on the above technical solutions, the first acquisition module 1110 includes:

[0179] The video acquisition module is configured to acquire style reference videos for extracting video style types.

[0180] The feature extraction module is configured to extract style modality features, including at least one modality type, from the style reference video;

[0181] The feature mapping module is configured to map style modality features according to a preset activation function to obtain video style parameter distribution data that conforms to a preset probability distribution.

[0182] In some embodiments of this application, based on the above technical solutions, the feature extraction module includes:

[0183] The image encoding module is configured to perform image encoding processing on the style reference video according to a pre-trained image encoder to obtain image modal features;

[0184] The audio encoding module is configured to perform audio encoding on the style reference video according to a pre-trained audio encoder to obtain audio modal features;

[0185] The feature weighting module is configured to perform feature fusion processing on image modal features and audio modal features to obtain style modal features that fuse image modal features and audio modal features.

[0186] In some embodiments of this application, based on the above technical solutions, the feature weighting module includes:

[0187] The similarity acquisition module is configured to acquire the feature similarity between image modal features and audio modal features;

[0188] The similarity activation module is configured to map feature similarity according to a preset activation function to obtain attention weights that conform to a probability distribution.

[0189] The attention weighting module is configured to perform weighted operations on image modal features or audio modal features based on attention weights.

[0190] In some embodiments of this application, based on the above technical solutions, the preset probability distribution model includes a Gaussian distribution model; the similarity activation module is further configured to: map the style modality features according to the preset activation function to obtain the mean and variance that conform to the probability distribution; and perform a weighted operation on the preset multivariate normal distribution according to the mean and variance to obtain video style parameters that conform to the Gaussian distribution.

[0191] In some embodiments of this application, based on the above technical solutions, the similarity activation module includes:

[0192] The sample acquisition module is configured to acquire pre-collected style modality feature samples;

[0193] The sample pairing module is configured to pair style modality feature samples with the same video style type into positive sample pairs and pair style modality feature samples with different video style types into negative sample pairs.

[0194] The sample mapping module is configured to map positive sample pairs and negative sample pairs according to a preset activation function to obtain the similarity loss error of the predicted values ​​of the sample pairs. The similarity loss error is negatively correlated with the similarity of the predicted values ​​of the positive sample pairs and positively correlated with the similarity of the predicted values ​​of the negative sample pairs.

[0195] The function update module is configured to iteratively update the activation parameters of the activation function based on the similarity loss error.

[0196] In some embodiments of this application, based on the above technical solutions, the sample acquisition module is further configured to: acquire image modal feature samples and audio modal feature samples from a pre-collected sample dataset; set the feature value of one of the image modal feature samples and audio modal feature samples to zero according to a preset probability; and perform feature fusion processing on the image modal feature samples and audio modal feature samples to obtain style modal feature samples that fuse image modality and / or audio modality.

[0197] In some embodiments of this application, based on the above technical solutions, the method for obtaining audio parameters includes: obtaining a style reference video for extracting video style types; extracting an audio feature sequence from the style reference video; and encoding the audio feature sequence according to a pre-trained audio encoder to obtain audio parameters.

[0198] In some embodiments of this application, based on the above technical solutions, the method for obtaining image parameters includes: obtaining a reference image for describing the visual modal content of a video; performing encoding and decoding processing on the reference image according to a pre-trained image reference model to obtain image parameters; the image reference model includes an encoder and a decoder connected in sequence, and the image parameters include image encoding parameters output by the encoder and image decoding parameters output by the decoder.

[0199] In some embodiments of this application, based on the above technical solutions, the encoding / decoding module 1130 includes:

[0200] The sample acquisition module is configured to acquire image samples from a noise model that conforms to a probability distribution, thereby obtaining a sequence of noisy images.

[0201] The parameter input module is configured to input video style parameters, multimodal parameters, and noisy image sequences into a pre-trained denoising model. The denoising model includes an encoder for mapping image data to latent spatial features and a decoder for restoring latent spatial features to image data.

[0202] The parametric denoising module is configured to encode and decode video style parameters, multimodal parameters, and noisy image sequences in the denoising model to obtain a denoised video frame sequence.

[0203] In some embodiments of this application, based on the above technical solutions, the parameter input module is further configured to: acquire a key point image sequence, wherein the key point image includes key points used to describe the head motion posture; encode the key point image sequence to obtain a key point feature sequence; and fuse the key point feature sequence with a noisy image sequence to obtain a noisy image sequence carrying head motion posture information.

[0204] In some embodiments of this application, based on the above technical solutions, key points include facial feature points located in the area above the mouth.

[0205] In some embodiments of this application, based on the above technical solutions, the denoising model includes a self-attention layer for encoding and decoding noisy image sequences, a reference image attention layer for encoding and decoding image parameters, an audio attention layer for encoding and decoding audio parameters, a style attention layer for encoding and decoding video style parameters, and a timing module for establishing a video frame time series.

[0206] In some embodiments of this application, based on the above technical solutions, the encoding / decoding module further includes:

[0207] The error acquisition module is configured to acquire the denoising loss error obtained by the denoising model through encoding and decoding of training samples.

[0208] The model update module is configured to iteratively update the model parameters of the denoised model based on the denoising loss error, and the update range of the model parameters is negatively correlated with the iteration round.

[0209] In some embodiments of this application, based on the above technical solutions, the model update module includes:

[0210] The first iteration module is configured to iteratively update the model parameters of the denoising model, the key point guidance model, and the image reference model based on the denoising loss error, until the preset first iteration termination condition is met; the key point guidance model is used to input head motion posture information into the denoising model, and the image reference model is used to input image parameters into the denoising model;

[0211] The second iteration module is configured to fix the model parameters of the key point guidance model and the image reference model, and iteratively update the model parameters of the denoising model according to the denoising loss error until the preset second iteration termination condition is met.

[0212] The third iteration module is configured to fix the model parameters of the self-attention layer, the reference image attention layer, the audio attention layer, and the temporal module, and iteratively update the model parameters of the style attention layer according to the denoising loss error until the preset third iteration termination condition is met.

[0213] The specific details of the video generation apparatus provided in the various embodiments of this application have been described in detail in the corresponding method embodiments, and will not be repeated here.

[0214] Figure 12 schematically illustrates a computer system architecture block diagram for implementing an electronic device according to an embodiment of the present application.

[0215] It should be noted that the computer system 1200 of the electronic device shown in Figure 12 is only an example and should not impose any limitations on the functionality and scope of use of the embodiments of this application.

[0216] As shown in Figure 12, the computer system 1200 includes a central processing unit (CPU) 1201, which can perform various appropriate actions and processes based on programs stored in read-only memory (ROM) 1202 or programs loaded from storage section 1208 into random access memory (RAM) 1203. The RAM 1203 also stores various programs and data required for system operation. The CPU 1201, ROM 1202, and RAM 1203 are interconnected via a bus 1204. An input / output interface 1205 (I / O interface) is also connected to the bus 1204.

[0217] The following components are connected to the input / output interface 1205: an input section 1206 including a keyboard, mouse, etc.; an output section 1207 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers, etc.; a storage section 1208 including a hard disk, etc.; and a communication section 1209 including a network interface card such as a local area network card, modem, etc. The communication section 1209 performs communication processing via a network such as the Internet. A drive 1210 is also connected to the input / output interface 1205 as needed. A removable medium 1211, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on the drive 1210 as needed so that computer programs read from it can be installed into the storage section 1208 as needed.

[0218] Specifically, according to embodiments of this application, the processes described in the various method flowcharts can be implemented as computer software programs. For example, embodiments of this application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication section 1209, and / or installed from removable medium 1211. When the computer program is executed by central processing unit 1201, it performs various functions defined in the system of this application.

[0219] It should be noted that the computer-readable medium shown in the embodiments of this application can be a computer-readable signal medium, a computer-readable storage medium, or any combination of the two. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), flash memory, optical fiber, portable compact disc read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this application, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such transmitted data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. The computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to wireless, wired, etc., or any suitable combination thereof.

[0220] In summary, this application provides a video generation method, a video generation apparatus, a computer-readable medium, an electronic device, and a computer program product. During the video generation process, video style parameter distribution data conforming to a preset probability distribution is acquired. This ensures that the acquisition of video style parameters is not a fixed, single value, but rather a random sampling within a certain probability range, exhibiting uncertainty and diversity. Simultaneously, multimodal parameters describing the video content of the preset video are acquired. These multimodal parameters characterize the video content information from multiple dimensions. Encoding and decoding processing is performed on the multimodal parameters based on the video style parameter distribution data. The encoding and decoding process adjusts the expression form of the multimodal parameters according to different video style parameters, thereby ensuring that the final video frame sequence matches the video style indicated by the video style parameters. Due to the diversity of video style parameters, the generated video frame sequence can present a rich variety of styles, avoiding the problem of single style in traditional video generation, improving the flexibility and adaptability of video generation, enhancing the generalization ability of the video generation model under different style requirements, and greatly expanding the range of video style representation.

[0221] Furthermore, the preset probability distribution includes either a Gaussian distribution or a Boltzmann distribution. The Gaussian distribution possesses symmetry and maximum entropy; its probability density function is symmetrically distributed around the mean, and its entropy reaches its maximum given variance. This means that with limited mean and variance information, the Gaussian distribution contains the greatest uncertainty. Setting the video style parameters to conform to a Gaussian distribution allows for a wider range of parameter values ​​during sampling, introducing more random variations into the video style and resulting in richer and more diverse generated video styles. The Boltzmann distribution is related to the energy state of the system; different energy states correspond to different probabilities. Introducing the Boltzmann distribution into the determination of video style parameters allows for the allocation of style parameters based on the probabilities of different energy states, simulating different style preferences and states. This enables more complex and nuanced style variations in video generation, further improving the quality and diversity of generated video styles.

[0222] Furthermore, when obtaining video style parameter distribution data conforming to a preset probability distribution, a style reference video for extracting video style types is first acquired. The style reference video is actual video footage containing authentic style information, making style information extraction based on it more grounded in reality. Next, style modal features, including at least one modality type, are extracted from the style reference video. Different modalities, such as visual and auditory, can reflect the video's style from multiple perspectives. Finally, the style modal features are mapped according to a preset activation function to obtain video style parameter distribution data conforming to the preset probability distribution. The activation function can non-linearly transform the style modal features to conform to the preset probability distribution. This results in video style parameter distribution data that accurately reflects the video's style characteristics. Furthermore, because it conforms to the preset probability distribution, it allows for flexible sampling based on probability in subsequent video generation, generating diverse video styles and improving the accuracy and diversity of video style generation.

[0223] Furthermore, when extracting style modal features from the style reference video, image modal features are obtained by encoding the style reference video using a pre-trained image encoder. The image encoder, trained on a large amount of data, can accurately extract key features from the image, such as color, texture, and shape; these features are important components of the video's visual style. Simultaneously, audio modal features are obtained by encoding the style reference video using a pre-trained audio encoder. The audio encoder can capture information such as pitch, timbre, and rhythm in the audio, which constitutes the video's auditory style. Then, feature fusion processing is performed on the image and audio modal features to obtain style modal features that fuse the image and audio modalities. By fusing features from both image and audio modalities, the visual and auditory styles of the video can be comprehensively considered, avoiding the limitations of single-modal information. This more comprehensively and accurately portrays the overall style of the video, making the generated video more visually and aurally harmonious and unified, thus improving the integrity and realism of the video's style.

[0224] Furthermore, when performing feature fusion processing on image and audio modal features, their feature similarity is first obtained. Calculating feature similarity identifies the correlated parts between image and audio modal features; these correlated parts reflect the co-style information of the video in terms of both visual and auditory senses. Then, the feature similarity is mapped according to a pre-defined activation function to obtain attention weights that conform to a probability distribution. The activation function transforms feature similarity into probabilistic attention weights, which represent the importance of different features in style representation. Finally, the image or audio modal features are weighted according to the attention weights. This weighting process highlights relevant features and suppresses irrelevant features, allowing the model to focus more on style-related parts of the image and audio modal features. This improves the accuracy of style modal features in representing the inherent style of the video, making the generated video style more realistic, reducing style bias, and enhancing the consistency and accuracy of the video style.

[0225] Furthermore, when the preset probability distribution model includes a Gaussian distribution model, the style modal features are mapped according to a preset activation function to obtain the mean and variance that conform to the probability distribution. The mean and variance are two key parameters of the Gaussian distribution; the mean determines the central location of the distribution, and the variance determines the degree of dispersion. By mapping the style modal features to obtain these two parameters, the shape of the Gaussian distribution can be adjusted according to the actual style characteristics of the video. Then, a weighted operation is performed on the preset multivariate normal distribution based on the mean and variance to obtain video style parameters that conform to the Gaussian distribution. The multivariate normal distribution considers the correlation between multiple variables, and in the generation of video style parameters, it can integrate multiple style feature variables, making the generated video style parameters more reasonable and natural. The video style parameters generated based on the Gaussian distribution have a certain degree of randomness and continuity, which can produce smooth transitions in style changes during video generation, increasing the naturalness and diversity of the video style.

[0226] Furthermore, before mapping the style modality features according to the preset activation function, pre-collected style modality feature samples are acquired. Style modality feature samples with the same video style type are paired as positive samples, and style modality feature samples with different video style types are paired as negative samples. The construction of positive and negative sample pairs provides the model with clear comparative information; positive samples are used to strengthen the association between features of the same style, while negative samples are used to distinguish the differences between different style features. The positive and negative sample pairs are mapped according to the preset activation function to obtain a similarity loss error that conforms to a probability distribution. The similarity loss error reflects the model's ability to distinguish between different style features. By minimizing the similarity loss error, the model can better learn the intrinsic relationships between style features. The activation parameters of the preset activation function are updated based on the similarity loss error. Updating the activation parameters adjusts the mapping relationship of the activation function, enabling it to more accurately map style modality features to video style parameters that conform to a probability distribution, improving the accuracy and stability of video style parameter generation, and thus enhancing the quality of video style generation.

[0227] Furthermore, when acquiring pre-collected style modality feature samples, pre-collected image modality feature samples and audio modality feature samples are obtained. Images and audio are two important components of video; acquiring feature samples from these two modalities separately can comprehensively obtain the style information of the video. At least one feature value in either the image modality feature sample or the audio modality feature sample is set to zero according to a preset probability value, and then feature fusion processing is performed on the image modality feature sample and the audio modality feature sample. Setting some feature values ​​to zero simulates the information loss situation that may occur in real-world applications. Performing feature fusion processing under this condition can enhance the model's robustness to information loss. The model needs to be able to accurately fuse image and audio modality features even with partial information loss, thereby improving the model's generalization ability and ensuring that the generated video style maintains a certain degree of stability and accuracy under different information conditions.

[0228] Furthermore, the audio parameters are obtained by encoding the style reference video using a pre-trained audio encoder to obtain an audio feature sequence. The audio encoder, trained on a large amount of audio data, can effectively extract key features from the audio. This audio feature sequence contains information such as frequency, amplitude, and rhythm, which are important indicators of the audio style. Then, based on the audio feature sequence, the pre-trained audio encoder performs encoding processing to obtain audio parameters. Audio parameters are a further abstraction and compression of the audio feature sequence, providing a more concise representation of the audio's style information. In video generation, audio parameters can control the video's audio style, such as music rhythm and sound effect types, making the generated video more consistent with the expected style and enhancing the audio experience.

[0229] Furthermore, image parameters are obtained by encoding the reference image using a pre-trained image reference model to obtain encoded parameters describing the reference image. The image reference model, trained on a large amount of image data, can accurately extract key features of the reference image. The encoded parameters contain information such as color, texture, and shape of the reference image. Then, based on the encoded parameters, the pre-trained image reference model performs decoding processing to obtain decoded parameters describing the reference image. The decoded parameters are a restoration and refinement of the encoded parameters, further enriching the image's detailed information. Using the encoded and decoded parameters as reference conditions in video generation, the generated video frames can visually inherit the content characteristics of the reference image to the greatest extent, such as color matching and object shapes, improving the consistency of the video's visual style with the reference image and giving the generated video a specific visual style.

[0230] Furthermore, when encoding and decoding multimodal parameters based on video style parameter distribution data, a noisy image sequence is first acquired. This noisy image sequence is randomly generated image data containing various random pixel information. Then, the video style parameters, multimodal parameters, and noisy image sequence are input into a pre-trained denoising model for denoising. During training, the denoising model learns how to remove noise and generate meaningful images based on the input parameters. The video style parameters provide style guidance for the denoising process, the multimodal parameters provide specific information about the video content, and the noisy image sequence, as the initial input, is gradually transformed into images that conform to the video style and content requirements during the denoising process. Guided by various dynamic reference conditions, the denoising model can generate video frame sequences with different styles and content based on different video style parameters and multimodal parameters. The generated video content can include various emotions and facial expressions, greatly improving the diversity of video styles and the richness of video content.

[0231] Furthermore, before inputting the noisy image sequence into the pre-trained denoising model, a pre-collected keypoint image sequence is acquired. This keypoint image sequence contains positional information of key parts in the video, such as facial feature points and body joints. The keypoint image sequence is then encoded to obtain a keypoint feature sequence. This encoding process transforms the keypoint image sequence into a feature form more suitable for model processing, containing information such as the position and motion of keypoints. The keypoint feature sequence is then fused with the noisy image sequence. This fusion process ensures that the noisy image sequence contains keypoint information. In the subsequent denoising process, the denoising model can adjust image generation based on the keypoint feature sequence, introducing reference conditions such as head movement posture into video generation. This results in more natural and accurate postures of people in the generated video, enriching the dynamic performance of the video and increasing the diversity of video styles.

[0232] Furthermore, key points include facial feature points in the area above the mouth. These feature points reflect information such as facial expressions and head movements. In video generation, these key points, combined with other reference conditions, enable the coordination between head and lip movements. For example, when a person is speaking, changes in the facial feature points above the mouth accurately reflect lip movements, resulting in more consistent facial expressions and actions in the generated video. This further enhances the diversity of video styles and overall video quality, increasing the video's realism and vividness.

[0233] Furthermore, the denoising model comprises a self-attention layer, a reference image attention layer, an audio attention layer, a style attention layer, and a temporal module. The self-attention layer focuses on the feature correlations between different locations within a noisy image sequence. By calculating attention weights between different locations, it highlights important features in the image, enabling the model to better capture the overall structure and local details of the image. The reference image attention layer processes the reference image according to image parameters, incorporating its style and content information into the denoising process, making the generated video frames visually closer to the reference image. The audio attention layer controls the lip movements of the characters based on audio parameters, synchronizing the lip movements in the video with the audio, enhancing the audiovisual consistency of the video. The style attention layer adjusts the video style according to video style parameters, stylizing the image during denoising based on different style parameters to ensure the generated video conforms to the expected style. The temporal module establishes the temporal sequence relationship between video frames, ensuring the generated video frames are temporally coherent and avoiding jumps and discontinuities. The collaborative work of multiple layers and modules enables the denoising model to comprehensively consider multiple aspects of video information, accurately generate video frame sequences that meet the requirements, improve the accuracy and stability of video generation, and further enrich the stylistic expression of the video.

[0234] Furthermore, before inputting the noisy image sequence into the pre-trained denoising model, the noisy image sequence is denoised according to the pre-trained denoising model to obtain a denoised image sequence. Then, the similarity loss error between the denoised image sequence and the pre-acquired real image sequence is obtained. The similarity loss error reflects the degree of difference between the denoised image sequence and the real image sequence. By minimizing this error, the model can learn how to better remove noise and generate results that are close to the real image. The model parameters of the pre-trained denoising model are updated based on the similarity loss error. Updating the model parameters can adjust the internal structure and mapping relationship of the denoising model, making the denoising model more accurate and effective in subsequent denoising processes, improving the performance of the denoising model, and thus improving the quality of video generation, making the generated video more realistic and clear.

[0235] Furthermore, a three-stage progressive training method is adopted when updating the model parameters of the pre-trained denoising model based on the similarity loss error. In the first stage, the model parameters of the pre-trained denoising model, the pre-trained keypoint guidance model, and the pre-trained image reference model are updated simultaneously. This joint update allows the three models to learn collaboratively from the initial stage, influencing each other and adapting together to the video generation task, accelerating the model's convergence speed and improving its overall performance. In the second stage, the model parameters of the pre-trained keypoint guidance model and the pre-trained image reference model are fixed, and only the model parameters of the pre-trained denoising model are updated. At this stage, the keypoint guidance model and the image reference model have provided relatively stable keypoint information and image reference information; focusing on updating the parameters of the denoising model can further optimize its performance, improving the denoising effect and the quality of video frame generation. In the third stage, the model parameters of the pre-trained denoising model, the pre-trained keypoint guidance model, and the pre-trained image reference model are fixed, and only the model parameters of the style attention layer in the pre-trained denoising model are updated. The style attention layer is responsible for adjusting the style of the video. Updating its parameters separately at this stage allows for fine-tuning of the video's style, making the generated video style more diverse and personalized, and further improving the overall quality and style diversity of the video.

[0236] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0237] It should be noted that although several modules or units for the device used to perform actions have been mentioned in the detailed description above, this division is not mandatory. In fact, according to the embodiments of this application, the features and functions of two or more modules or units described above can be embodied in one module or unit. Conversely, the features and functions of one module or unit described above can be further divided and embodied by multiple modules or units.

[0238] Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein can be implemented by software or by combining software with necessary hardware. Therefore, the technical solutions according to the embodiments of this application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, external hard drive, etc.) or on a network, including several instructions to cause a computing device (such as a personal computer, server, touch terminal, or network device, etc.) to execute the method according to the embodiments of this application.

[0239] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0240] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

1. A video generation method, executed by an electronic device, comprising: Obtain video style parameter distribution data that conforms to a preset probability distribution. The video style parameter distribution data is used to describe the manifestation of the preset probability distribution and to determine the video style parameters. The manifestation of the preset probability distribution describes the distribution of the video style parameters in a preset dimension. Obtain multimodal parameters for describing the video content of a preset video, the multimodal parameters including audio parameters and image parameters; and Based on the video style parameter distribution data, the multimodal parameters are encoded and decoded to obtain a video frame sequence that includes the video content and matches the video style indicated by the video style parameters.

2. The video generation method according to claim 1, wherein the preset probability distribution includes a Gaussian distribution or a Boltzmann distribution.

3. The video generation method according to claim 1 or 2, obtaining video style parameter distribution data conforming to a preset probability distribution, including: Obtain style reference videos for extracting video style types; Extract style modal features, including at least one modal type, from the style reference video; The style modality features are mapped according to a preset activation function to obtain video style parameter distribution data that conforms to a preset probability distribution.

4. The video generation method according to claim 3, comprising extracting style modality features including at least one modality type from the style reference video, including: The style reference video is image encoded using a pre-trained image encoder to obtain image modal features; The style reference video is processed by audio encoding based on a pre-trained audio encoder to obtain audio modal features; The image modal features and the audio modal features are subjected to feature fusion processing to obtain style modal features that fuse the image modality and the audio modality.

5. The video generation method according to claim 4, wherein feature fusion processing is performed on the image modal features and the audio modal features, comprising: Obtain the feature similarity between the image modal features and the audio modal features; The feature similarity is mapped according to a preset activation function to obtain attention weights that conform to a probability distribution; The image modal features or the audio modal features are weighted according to the attention weights.

6. The video generation method according to any one of claims 3 to 5, wherein the preset probability distribution model includes a Gaussian distribution model; and the style modality features are mapped according to a preset activation function to obtain video style parameter distribution data conforming to the preset probability distribution, including: The style modality features are mapped according to a preset activation function to obtain the mean and variance that conform to the probability distribution; The preset multivariate normal distribution is weighted according to the mean and variance to obtain video style parameters that conform to the Gaussian distribution.

7. The video generation method according to any one of claims 3 to 6, wherein before mapping the style modality features according to a preset activation function, the method further comprises: Obtain pre-collected style modality feature samples; Style modal feature samples with the same video style type are grouped into positive sample pairs, and style modal feature samples with different video style types are grouped into negative sample pairs; The positive sample pairs and the negative sample pairs are mapped according to the preset activation function to obtain the similarity loss error of the predicted values ​​of the sample pairs. The similarity loss error is negatively correlated with the similarity of the predicted values ​​of the positive sample pairs and positively correlated with the similarity of the predicted values ​​of the negative sample pairs. The activation parameters of the activation function are iteratively updated based on the similarity loss error.

8. The video generation method according to claim 7, wherein obtaining pre-collected style modality feature samples includes: Image modal feature samples and audio modal feature samples are obtained from a pre-collected sample dataset; The feature value of one of the image modal feature samples and the audio modal feature samples is set to zero according to a preset probability; The image modal feature samples and audio modal feature samples are subjected to feature fusion processing to obtain style modal feature samples that fuse image modality and / or audio modality.

9. The video generation method according to any one of claims 1 to 8, wherein the method for obtaining the audio parameters comprises: Obtain style reference videos for extracting video style types; Extract audio feature sequences from the style reference video; The audio parameters are obtained by encoding the audio feature sequence using a pre-trained audio encoder.

10. The video generation method according to any one of claims 1 to 9, wherein the method for obtaining the image parameters comprises: Obtain reference images to describe the visual modal content of the video; The reference image is encoded and decoded according to a pre-trained image reference model to obtain the image parameters; the image reference model includes an encoder and a decoder connected in sequence, and the image parameters include image encoding parameters output by the encoder and image decoding parameters output by the decoder.

11. The video generation method according to any one of claims 1 to 10, wherein the multimodal parameters are encoded and decoded based on the video style parameter distribution data, comprising: Image samples are collected from a noise model that conforms to a probability distribution to obtain a noisy image sequence; The video style parameters, the multimodal parameters, and the noisy image sequence are input into a pre-trained denoising model, which includes an encoder for mapping image data to latent spatial features and a decoder for restoring the latent spatial features to image data. In the denoising model, the video style parameters, the multimodal parameters, and the noisy image sequence are encoded and decoded to obtain a video frame sequence with noise removed.

12. The video generation method according to claim 11, wherein before inputting the noisy image sequence into the denoising model, the method further comprises: Acquire a sequence of keypoint images, wherein the keypoint images include keypoints used to describe the head movement posture; The key point image sequence is encoded to obtain a key point feature sequence; The key point feature sequence is fused with the noisy image sequence to obtain a noisy image sequence carrying head motion posture information.

13. The video generation method according to claim 12, wherein the key points include facial feature points located in the region above the mouth.

14. The video generation method according to any one of claims 11 to 13, wherein the denoising model comprises a self-attention layer for encoding and decoding the noisy image sequence, a reference image attention layer for encoding and decoding the image parameters, an audio attention layer for encoding and decoding the audio parameters, a style attention layer for encoding and decoding the video style parameters, and a timing module for establishing a video frame time series.

15. The video generation method according to claim 14, further comprising, before encoding and decoding the multimodal parameters based on the video style parameter distribution data: Obtain the denoising loss error obtained by encoding and decoding the training samples using the denoising model; The model parameters of the denoising model are iteratively updated based on the denoising loss error, and the update range of the model parameters is negatively correlated with the iteration round.

16. The video generation method according to claim 15, wherein the model parameters of the denoising model are iteratively updated based on the denoising loss error, comprising: The model parameters of the denoising model, key point guidance model, and image reference model are iteratively updated based on the denoising loss error until the preset first iteration termination condition is met. The key point guidance model is used to input head motion posture information into the denoising model, and the image reference model is used to input the image parameters into the denoising model; The model parameters of the key point guidance model and the image reference model are fixed, and the model parameters of the denoising model are iteratively updated according to the denoising loss error until the preset second iteration termination condition is met. The model parameters of the self-attention layer, the reference image attention layer, the audio attention layer, and the temporal module are fixed, and the model parameters of the style attention layer are iteratively updated according to the denoising loss error until the preset third iteration termination condition is met.

17. A video generation apparatus, comprising: The first acquisition module is configured to acquire video style parameter distribution data that conforms to a preset probability distribution. The video style parameter distribution data is used to describe the appearance of the preset probability distribution and to determine the video style parameters. The appearance of the preset probability distribution describes the distribution of the video style parameters in a preset dimension. The second acquisition module is configured to acquire multimodal parameters used to describe the video content of a preset video, the multimodal parameters including audio parameters and image parameters; and The encoding / decoding module is configured to encode and decode the multimodal parameters based on the video style parameter distribution data to obtain a video frame sequence that includes the video content and matches the video style indicated by the video style parameters.

18. A computer-readable medium having a computer program stored thereon, the computer program, when executed by a processor, implementing the video generation method according to any one of claims 1 to 16.

19. An electronic device comprising: processor; as well as Memory for storing the executable instructions of the processor; The processor is configured to execute the executable instructions to implement the video generation method as described in any one of claims 1 to 16.

20. A computer program product comprising a computer program that, when executed by a processor, implements the video generation method according to any one of claims 1 to 16.