Image and text based video generation method and apparatus
By acquiring fusion encoding and using a decoder to generate target keypoint sequences, the domain dependency and unstable generation effects of existing GAN methods are solved, realizing a video generation method based on images and text, and the generated video matches the original video semantically.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TSINGHUA UNIVERSITY
- Filing Date
- 2023-04-10
- Publication Date
- 2026-06-23
AI Technical Summary
Existing GAN-based facial image processing methods suffer from problems such as high domain dependence, unstable generation effects, and high training costs, and there is a lack of research on video synthesis that utilizes rich semantic knowledge from images and text.
By acquiring fusion encoding, including image encoding of reference images and text encoding of action text, a target keypoint sequence is generated using a decoder to reconstruct the action described in the action text of the target object in the video. The decoder is trained based on keypoint sequences and fusion encoded samples from the original video samples.
The generated reconstructed video semantically matches the original video, solving the problem of unstable generation effects in existing technologies. It realizes a video generation method based on images and text, integrating the semantic information of reference images and action text.
Smart Images

Figure CN116363563B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer vision technology, and in particular to a method and apparatus for video generation based on images and text. Background Technology
[0002] The rapidly developing digital entertainment industry has driven demand for various facial image processing applications, such as facial makeup, hairstyling, expression editing, and speech-video synthesis. Generative Adversarial Networks (GANs) have achieved significant success in these areas. However, GAN-based methods still have some limitations, such as high domain dependence, unstable generative effects, and high training costs.
[0003] Recently, due to the success of the CLIP (Contrastive Language-Image Pre-training) model, related techniques have proposed several text-guided facial image generation methods to synthesize images with arbitrary text descriptions. CLIP constructs a cross-modal semantic space between images and text. However, research on video synthesis utilizing the rich semantic knowledge in images and text is relatively lacking. Summary of the Invention
[0004] In view of the above problems, this disclosure provides a video generation method and apparatus based on images and text to overcome or at least partially solve the above problems.
[0005] A first aspect of this disclosure provides a video generation method based on images and text, comprising:
[0006] Obtain the fusion encoding, which includes: image encoding of a reference image and text encoding of action text, wherein the reference image includes the target object;
[0007] The fused code is input into the decoder to obtain the target key point sequence;
[0008] Based on the reference image and the target key point sequence, a reconstructed video is obtained. The video content of the reconstructed video is: the target object performing the action described in the action text.
[0009] The decoder is trained based on the first keypoint sequence of the original video sample and the second keypoint sequence corresponding to the reconstructed video sample, as well as the original video encoded sample and the fused encoded sample. The fused encoded sample includes the text encoding of the action text sample and the image encoding of the reference image sample.
[0010] Optionally, the training process of the decoder includes the following steps:
[0011] Acquire sample data pairs, the sample data pairs including the action text sample, the reference image sample and the original video sample, the action text sample is used to describe the action performed by the object sample in the original video sample, and the reference image sample is the first video frame of the original video sample;
[0012] The fused encoded sample is obtained based on the action text sample and the reference image sample;
[0013] The original video sample is input into the key point extractor to obtain the first key point sequence;
[0014] The first keypoint sequence is input into the encoder to be trained to obtain the original video encoded sample;
[0015] The original video encoded sample is input into the decoder to be trained to obtain the second keypoint sequence;
[0016] Based on the first keypoint sequence and the second keypoint sequence, as well as the original video coding sample and the fused coding sample, the encoder and the decoder to be trained are trained to obtain the trained encoder and the decoder.
[0017] Optionally, the step of training the encoder and decoder to be trained based on the first keypoint sequence and the second keypoint sequence, as well as the original video coding sample and the fused coding sample, to obtain the trained encoder and decoder, includes:
[0018] Based on the first keypoint sequence and the second keypoint sequence, a keypoint reconstruction loss function is established;
[0019] Based on the original video coding samples and the fused coding samples, a coding loss function is established;
[0020] The total loss function is determined based on the key point reconstruction loss function and the encoding loss function;
[0021] The encoder and decoder to be trained are trained according to the total loss function to obtain the trained encoder and decoder.
[0022] Optionally, establishing a keypoint reconstruction loss function based on the first keypoint sequence and the second keypoint sequence includes:
[0023] The differences in key points for each frame are determined based on the differences between each element in the first key point sequence and each element with the same index in the second key point sequence.
[0024] Based on the two elements of each pair of adjacent indices in the first keypoint sequence and the two elements of each pair of adjacent indices in the second keypoint sequence, determine the difference in keypoint movement between each pair of adjacent frames.
[0025] Based on the differences in keypoints in each frame and the differences in keypoint movement between two adjacent frames, a keypoint reconstruction loss function is established.
[0026] Optionally, establishing the coding loss function based on the original video coding samples and the fused coding samples includes:
[0027] Based on the original video encoded sample and the fused encoded sample, determine the similarity between the original video encoded sample and the fused encoded sample;
[0028] Based on the similarity, the encoding loss function is established.
[0029] Optionally, obtaining the fusion code includes:
[0030] Obtain the reference image and the action text;
[0031] The reference image and the action text are input into a text image pre-training model to obtain the image encoding of the reference image and the text encoding of the action text;
[0032] The image encoding of the reference image and the text encoding of the action text are fused together to obtain the fused encoding.
[0033] Optionally, obtaining the reconstructed video based on the reference image and the target key point sequence includes:
[0034] Extract the appearance features of the target object from the reference image, as well as multiple key points of the target object related to the action;
[0035] The synthetic flow field is determined based on the multiple key points and the target key point sequence;
[0036] The reconstructed video is synthesized based on the synthesized flow field and the appearance features.
[0037] Optionally, the target object is a face, and the action text is used to describe facial actions.
[0038] Optionally, it also includes:
[0039] Obtain a verification data pair, which includes verification action text, verification reference image and verification video. The verification action text is used to describe the action performed by the verification object in the verification video, and the verification reference image is the first video frame of the verification video.
[0040] Based on the verification action text and the verification reference image, a reconstructed verification video is obtained;
[0041] Obtain the third key point sequence corresponding to the verification video, and obtain the fourth key point sequence corresponding to the reconstructed verification video;
[0042] Determining the semantic matching degree between the verification video and the reconstructed verification video based on the third keypoint sequence and the fourth keypoint sequence includes:
[0043] The semantic matching degree between the verification video and the reconstructed verification video is determined according to the following formula:
[0044]
[0045] Wherein, LDTW represents the semantic matching degree, OM(·) is the optimal matching algorithm, Q is the matching sequence length corresponding to the third keypoint sequence and also the matching sequence length corresponding to the fourth keypoint sequence, q=1,2,…,Q; N represents the number of keypoints extracted in a video frame, n=1,2,…,N; This indicates that the q-th element in the matching sequence corresponding to the fourth key point sequence includes n key points; The q-th element in the matching sequence corresponding to the third keypoint sequence represents the n keypoints included in the sequence; ||·||2 represents the 2-norm.
[0046] A second aspect of this disclosure provides a video generation apparatus based on images and text, comprising:
[0047] The acquisition module is used to acquire the fusion encoding, which includes: image encoding of a reference image and text encoding of action text, wherein the reference image includes the target object;
[0048] The input module is used to input the fused encoder into the decoder to obtain the target key point sequence;
[0049] The reconstruction module is used to obtain a reconstructed video based on the reference image and the target key point sequence. The video content of the reconstructed video is: the target object performing the action described in the action text.
[0050] The decoder is trained based on the first keypoint sequence of the original video sample and the second keypoint sequence corresponding to the reconstructed video sample, as well as the original video encoded sample and the fused encoded sample. The fused encoded sample includes the text encoding of the action text sample and the image encoding of the reference image sample.
[0051] A third aspect of this disclosure provides an electronic device, including: a processor; and a memory for storing processor-executable instructions; wherein the processor is configured to execute instructions to implement the image and text-based video generation method of the first aspect.
[0052] A fourth aspect of this disclosure provides a computer-readable storage medium that, when instructions in the computer-readable storage medium are executed by a processor of an electronic device, enables the electronic device to perform the image and text-based video generation method of the first aspect.
[0053] The embodiments disclosed herein have the following advantages:
[0054] In this embodiment, the fusion encoding includes image encoding of the reference image and text encoding of the action text. Therefore, the fusion encoding includes semantic information of the reference image and the action text, and consequently, the reconstructed video generated based on the fusion encoding also contains semantic information of the reference image and the action text. The decoder is trained based on the first keypoint sequence of the original video sample and the second keypoint sequence corresponding to the reconstructed video sample, as well as the original video encoded sample and the fusion encoded sample. Therefore, the reconstructed video reconstructed based on the target keypoint sequence obtained by the decoder is semantically close to that of the original video. Attached Figure Description
[0055] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the accompanying drawings used in the description of the embodiments of this disclosure will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0056] Figure 1 This is a flowchart of the steps of a video generation method based on images and text in an embodiment of this disclosure;
[0057] Figure 2 This is a schematic diagram of the process of training the decoder in an embodiment of this disclosure;
[0058] Figure 3 This is a schematic diagram of the structure of a video generation device based on images and text in an embodiment of this disclosure. Detailed Implementation
[0059] To make the above-mentioned objectives, features and advantages of this disclosure more apparent and understandable, the disclosure will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0060] Reference Figure 1 The diagram illustrates a flowchart of a video generation method based on images and text, as shown in an embodiment of this disclosure. Figure 1 As shown, the video generation method based on images and text may specifically include steps S11 to S13.
[0061] Step S11: Obtain the fusion encoding, which includes: image encoding of the reference image and text encoding of the action text, wherein the reference image includes the target object;
[0062] Step S12: Input the fused encoder into the decoder to obtain the target key point sequence;
[0063] Step S13: Based on the reference image and the target key point sequence, a reconstructed video is obtained. The video content of the reconstructed video is: the target object performing the action described in the action text.
[0064] The decoder is trained based on the first keypoint sequence of the original video sample and the second keypoint sequence corresponding to the reconstructed video sample, as well as the original video encoded sample and the fused encoded sample. The fused encoded sample includes the text encoding of the action text sample and the image encoding of the reference image sample.
[0065] The action text is text describing an action, and the reference image includes a target object that can perform the action described in the action text. Optionally, the reference image can be a face reference image, in which the target object is a face, and the action described in the action text is a facial action, such as raising eyebrows, opening the mouth, or lowering the chin.
[0066] First, a reference image and at least one action text can be obtained. The reference image and at least one action text are then input into a pre-trained CLIP model. The CLIP model includes an image encoder that determines the image encoding of the reference image, and a text encoder that determines the text encoding of each action text. The training method for the CLIP model can refer to relevant techniques and is not limited here.
[0067] When there is only one text encoding for action text, the text encoding of the action text is fused with the image encoding of the reference image to obtain the fused encoding. When there are multiple text encodings for action text, the text encodings of the multiple action texts are first randomly concatenated, and then the text encodings of the concatenated multiple action texts are fused with the image encoding of the reference image to obtain the fused encoding.
[0068] The fused encoder is input into the decoder, which obtains a sequence of target keypoints. This sequence contains multiple elements, each element containing multiple keypoints corresponding to a video frame. A video frame can be reconstructed based on each element. The training process of the decoder will be described in detail later.
[0069] The reference image and the target keypoint sequence are input into a keypoint-based generator. The keypoint-based generator can generate a reconstructed video. The content of the reconstructed video is: the target object in the reference image performing an action described in the text. The keypoint-based generator is pre-trained, and the training method can refer to relevant techniques; no restrictions are imposed here.
[0070] The keypoint-based generator can extract the appearance features of the target object in the reference image, as well as multiple keypoints related to the target object and its actions; determine the synthetic flow field based on the multiple keypoints and the target keypoint sequence; and synthesize and reconstruct the video based on the synthetic flow field and appearance features.
[0071] For example, when the target object is a face, key points can include eyes, eyebrows, mouth, nose, etc. Based on multiple key points and a sequence of target key points, the deformation from multiple key points to the target key point sequence can be implicitly estimated as a synthetic flow field. By distorting appearance features with the synthetic flow field, a reconstructed video with the same target object as the reference image is synthesized. The target object in the reconstructed video performs the actions described in the action text.
[0072] The following section will describe the training process of the decoder. Figure 2 This is a schematic diagram of the decoder training process in this embodiment of the disclosure. During the decoder training process, Figure 2 The switch-connected training phase is used. After the decoder is trained, the switch-connected inference phase allows for the direct generation of the reconstructed video based on the fused encoding.
[0073] First, multiple sample data pairs are acquired. Each sample data pair includes at least one action text sample, a reference image sample, and an original video sample. The action text sample describes the action performed by an object sample in the original video sample. The reference image sample is the first video frame of the original video sample, and the object sample included in the reference image is of the same category as the target object included in the reference image. For example, if the reference image is a face reference image including a face, then the reference image sample is a face reference image sample including a face.
[0074] Facial Action Coding System (FACS) is a system that classifies human facial movements based on visual appearance. Based on various types and movements of facial muscles, FACS can identify different facial action units and action descriptors. Proposed FACS systems include 27 facial action units, 25 head and eye position codes, 28 miscellaneous motion codes, and a 5-level (AE) facial action intensity order scale. With increasing demand for scientific research, a series of facial action video datasets have been collected and labeled using FACS. When the reference image samples are facial reference images, multiple sample data pairs can be obtained based on a series of facial action video datasets.
[0075] Specifically, multiple facial motion videos from a series of facial motion video datasets can be edited into video clips shorter than 10 seconds. Each video clip contains a series of simple actions, and each video clip is used as a raw video sample. Each frame of the facial motion video dataset has an FACS label, so at least one action text corresponding to each raw video sample can be obtained. The first frame of each video clip is used as a reference image sample. In this way, sample data pairs can be constructed.
[0076] When the reference image sample is not a face reference image sample, multiple action videos can be acquired, each action video can be used as a raw video sample, and the actions in the action video can be identified to determine the corresponding action text sample. The first frame of the action video can be determined as the reference image sample.
[0077] After determining the sample data pair, the action text sample and reference image sample in the sample data pair can be input into the pre-trained CLIP model. The CLIP model outputs the text code of the action text sample and the image code of the reference image sample. By concatenating the text code of the action text sample and the image code of the reference image sample, a fused coded sample can be obtained.
[0078] By inputting the original video sample into a pre-trained keypoint extractor, the first keypoint sequence of the original video sample can be obtained. Each element in the first keypoint sequence corresponds to a video frame of the original video sample, and each element includes multiple keypoints in the video frame corresponding to that element.
[0079] When the reference image sample is a face reference image sample, the keypoint extractor can be instantiated as an expression deformation estimation network of Face-vid2vid (a face video generation model). Face-vid2vid proposes a face video synthesis method that models 2D keypoints in a face image as projections of a set of 3D keypoints. The expression deformation estimation network estimates the deformations of action-related 3D keypoints from a neutral 3D face expression. The sequence of these 3D keypoints produces a highly compressed and subject-independent facial action representation. This significantly reduces the workload compared to learning from a series of video frames.
[0080] Inputting the first keypoint sequence into the encoder to be trained yields the original video-coded sample. The encoder can use the action-related 3D keypoints in each frame as keypoints. The first keypoint sequence is mapped to the same dimension as the encoded reconstructed video sample using linear projection. The mapping result is then added to the position embedding as input to a standard transformer encoding module stack, along with a learnable ensemble token. As a typical approach for transformer-based tasks, I use the first output of the last encoding module as the original video-coded sample.
[0081] Inputting the original video-encoded samples into the decoder to be trained yields a second keypoint sequence. A video frame can be reconstructed based on each element in the second keypoint sequence, where each element includes multiple keypoints. The decoder contains multiple transformer decoding modules and a linear projection layer. Compared to the encoding module, the decoding module integrates the original video-encoded samples. The original video-encoded samples are used as keys and values for each cross-attention layer, and a positional embedding of length M (the length of the first keypoint sequence) is simply applied as the input query for the decoding module. The output of the decoding module is mapped to the keypoint dimension through linear projection, resulting in the reconstructed 3D keypoint sequence.
[0082] By inputting the second keypoint sequence into a pre-trained keypoint-based generator, reconstructed video samples can be generated. The method for generating reconstructed video samples from the second keypoint sequence using the keypoint generator can be found in the previous section on generating reconstructed video from the target keypoint sequence.
[0083] Based on the first and second keypoint sequences, as well as the original video coding samples and fused coding samples, a loss function can be constructed. The encoder and decoder to be trained are then trained based on the loss function, thus obtaining the trained encoder and decoder.
[0084] Specifically, a keypoint reconstruction loss function can be established based on the first keypoint sequence and the second keypoint sequence; an encoding loss function can be established based on the original video coding samples and the fused coding samples; a total loss function can be determined based on the keypoint reconstruction loss function and the encoding loss function; with the goal of minimizing the total loss function, the encoder and decoder to be trained are trained to obtain the trained encoder and the trained decoder.
[0085] Based on the first keypoint sequence and the second keypoint sequence, a keypoint reconstruction loss function is established, which may include: determining the difference of keypoints in each frame based on the difference between each element in the first keypoint sequence and each element with the same index in the second keypoint sequence; determining the difference of keypoint movement in two adjacent frames based on two elements of every two adjacent indices in the first keypoint sequence and two elements of every two adjacent indices in the second keypoint sequence; and establishing a keypoint reconstruction loss function based on the difference of keypoints in each frame and the difference of keypoint movement in two adjacent frames.
[0086] During the training phase, each video frame of the reconstructed video sample corresponds one-to-one with a video frame of the original video sample that has the same frame number. Correspondingly, each element in the first keypoint sequence also corresponds one-to-one with an element in the second keypoint sequence that has the same index. Therefore, based on the difference between each element in the first keypoint sequence and each element with the same index in the second keypoint sequence, the difference in keypoint movement for each frame can be determined. Based on the two elements of every two adjacent indices in the first keypoint sequence and the two elements of every corresponding two adjacent indices in the second keypoint sequence, the difference in keypoint movement between every two adjacent frames can be determined.
[0087] The keypoint reconstruction loss function L can be determined using the following formula. recon :
[0088]
[0089] Where M is the length of the first keypoint sequence and also the length of the second keypoint sequence, m = 1, 2, ..., M; N is the number of keypoints included in each element of the first keypoint sequence and also the number of keypoints included in each element of the second keypoint sequence, n = 1, 2, ..., N; The m-th element of the second key sequence represents the n key points included in the second key sequence; The m-th element of the first key sequence represents the n key points included in it; The (m+1)th element of the second key sequence contains n key points; The (m+1)th element of the first key sequence represents the n key points included in it; ||·||1 represents the 1-norm.
[0090] Keypoint reconstruction loss function L recon The first term of the corresponding formula represents the difference in key points in each frame, and the second term represents the difference in the movement of key points between two adjacent frames.
[0091] Based on the original video coded samples and the fused coded samples, a coding loss function is established. This can include: determining the similarity between the original video coded samples and the fused coded samples; and establishing the coding loss function based on the similarity. The coding loss function can encourage the original video coded samples to simultaneously contain temporal motion information from action text samples and spatial information from reference image samples.
[0092] The coding loss function L can be determined using the following formula. align :
[0093]
[0094] Among them, f TI For fused encoded samples, f v For the original video encoding sample, cos(f) TI f v Characteristic f TI and f v Cosine similarity between them.
[0095] The total loss function L can be determined by the following formula:
[0096]
[0097] Here, λ is the ratio factor, which can be set according to requirements; the meanings of the other characters can be found in the previous text.
[0098] In this embodiment, the encoder and decoder are trained simultaneously. The loss of the keypoint reconstruction loss function is optimized first in the decoder and then in the encoder through backpropagation. This optimization allows the encoder output to also have temporal information. Optimizing the encoder solely through the encoding loss function, on the other hand, cannot achieve the same effect as the method in this embodiment because CLIP can only extract features from images and text without temporal information.
[0099] After obtaining the trained decoder, in order to generate the reconstructed video based on the fusion encoding, there is no need to use a keypoint extractor and a trained encoder. Simply input the fusion encoding into the decoder to obtain the target keypoint sequence, and input the target keypoint sequence and the reference image into the keypoint-based generator to obtain the reconstructed video.
[0100] The image and text-based video generation method proposed in this disclosure only requires the reconstructed video to semantically match the action description, rather than strictly matching the original video frame by frame. For example, frame-level metrics are unsuitable for two videos displaying the same facial actions but at different speeds. To address this issue, this disclosure also proposes an evaluation metric for the reconstructed video called LDTW (landmark dynamic time warping). LDTW can evaluate the performance of the image and text-based video generation method proposed in this disclosure, and LDTW characterizes the semantic matching degree between the verification video and the reconstructed verification video.
[0101] First, you can obtain verification data pairs. Each verification data pair includes verification action text, a verification reference image, and a verification video. The verification action text describes the action performed by the verification object in the verification video, and the verification reference image is the first frame of the verification video. The method for obtaining verification data pairs can refer to the method for obtaining sample data pairs; the verification data pairs can be sample data pairs. Alternatively, the verification data pairs can be determined based on the reference image, action text, and the corresponding original video.
[0102] Based on the verification action text and the verification reference image, a reconstructed verification video is obtained, including: acquiring the text encoding of the verification action text and the image encoding of the verification reference image, and then fusing them to obtain a verification fusion encoding; inputting the verification fusion encoding into a trained decoder to obtain the fourth keypoint sequence corresponding to the reconstructed verification video; generating the reconstructed verification video based on the fourth keypoint sequence and the verification reference image; inputting the verification video into a pre-trained keypoint extractor to obtain the third keypoint sequence corresponding to the verification video; and obtaining the semantic matching degree between the verification video and the reconstructed verification video based on the third keypoint sequence and the fourth keypoint sequence.
[0103] The semantic matching degree between the verification video and the reconstructed verification video can be determined using the following formula:
[0104]
[0105] Wherein, LDTW represents the semantic matching degree, OM(·) is the optimal matching algorithm, Q is the matching sequence length corresponding to the third keypoint sequence and also the matching sequence length corresponding to the fourth keypoint sequence, q=1,2,…,Q; N represents the number of keypoints extracted in a video frame, n=1,2,…,N; This indicates that the q-th element in the matching sequence corresponding to the fourth key point sequence includes n key points; The q-th element in the matching sequence corresponding to the third keypoint sequence represents the n keypoints included in the sequence; ||·||2 represents the 2-norm.
[0106] The matching sequences corresponding to the third and fourth keypoint sequences are determined using an optimal matching algorithm. Q is neither less than the length of the third keypoint sequence nor greater than twice the length of the third keypoint sequence, because each index from the third keypoint sequence must match one or more indices from the fourth keypoint sequence, and the first indices of both sequences must match, as must their last indices. To evaluate the average performance across the entire validation set, the absolute differences are summed and divided by Q*N.
[0107] LDTW can be seen as a relaxed form of the keypoint reconstruction loss function. When generating a reconstructed test video based on the test action text and the test reference image, it is desirable for the reconstructed test video to reflect the semantics of the input test action text, but it is not required that the frames of the reconstructed test video be strictly aligned with the frames of the test video. Therefore, LDTW can be used to evaluate video generation methods based on images and text.
[0108] LDTW doesn't use a loss function during decoder training because it decodes the original video encoded samples to obtain the second keypoint sequence. The expectation is that the reconstructed video samples generated from the second keypoint sequence will be identical to the input original video samples, thus no relaxation is needed. However, during testing or evaluation, the input to the decoder is the verification fusion code, derived only from the test action text and test reference image, and cannot be required to be identical to the test video. Furthermore, the optimal matching algorithm OM is non-differentiable; therefore, LDTW cannot use a loss function.
[0109] It should be noted that, for the sake of simplicity, the method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of this disclosure are not limited to the described order of actions, because according to the embodiments of this disclosure, some steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of this disclosure.
[0110] Figure 3 This is a schematic diagram of the structure of a video generation device based on images and text according to an embodiment of this disclosure, such as... Figure 3 As shown, the device includes an acquisition module, an input module, and a reconstruction module, wherein:
[0111] The acquisition module is used to acquire the fusion encoding, which includes: image encoding of a reference image and text encoding of action text, wherein the reference image includes the target object;
[0112] The input module is used to input the fused encoder into the decoder to obtain the target key point sequence;
[0113] The reconstruction module is used to obtain a reconstructed video based on the reference image and the target key point sequence. The video content of the reconstructed video is: the target object performing the action described in the action text.
[0114] The decoder is trained based on the first keypoint sequence of the original video sample and the second keypoint sequence corresponding to the reconstructed video sample, as well as the original video encoded sample and the fused encoded sample. The fused encoded sample includes the text encoding of the action text sample and the image encoding of the reference image sample.
[0115] Optionally, the training process of the decoder includes the following steps:
[0116] Acquire sample data pairs, the sample data pairs including the action text sample, the reference image sample and the original video sample, the action text sample is used to describe the action performed by the object sample in the original video sample, and the reference image sample is the first video frame of the original video sample;
[0117] The fused encoded sample is obtained based on the action text sample and the reference image sample;
[0118] The original video sample is input into the key point extractor to obtain the first key point sequence;
[0119] The first keypoint sequence is input into the encoder to be trained to obtain the original video encoded sample;
[0120] The original video encoded sample is input into the decoder to be trained to obtain the second keypoint sequence;
[0121] Based on the first keypoint sequence and the second keypoint sequence, as well as the original video coding sample and the fused coding sample, the encoder and the decoder to be trained are trained to obtain the trained encoder and the decoder.
[0122] Optionally, the step of training the encoder and decoder to be trained based on the first keypoint sequence and the second keypoint sequence, as well as the original video coding sample and the fused coding sample, to obtain the trained encoder and decoder, includes:
[0123] Based on the first keypoint sequence and the second keypoint sequence, a keypoint reconstruction loss function is established;
[0124] Based on the original video coding samples and the fused coding samples, a coding loss function is established;
[0125] The total loss function is determined based on the key point reconstruction loss function and the encoding loss function;
[0126] The encoder and decoder to be trained are trained according to the total loss function to obtain the trained encoder and decoder.
[0127] Optionally, establishing a keypoint reconstruction loss function based on the first keypoint sequence and the second keypoint sequence includes:
[0128] The differences in key points for each frame are determined based on the differences between each element in the first key point sequence and each element with the same index in the second key point sequence.
[0129] Based on the two elements of each pair of adjacent indices in the first keypoint sequence and the two elements of each pair of adjacent indices in the second keypoint sequence, determine the difference in keypoint movement between each pair of adjacent frames.
[0130] Based on the differences in keypoints in each frame and the differences in keypoint movement between two adjacent frames, a keypoint reconstruction loss function is established.
[0131] Optionally, establishing the coding loss function based on the original video coding samples and the fused coding samples includes:
[0132] Based on the original video encoded sample and the fused encoded sample, determine the similarity between the original video encoded sample and the fused encoded sample;
[0133] Based on the similarity, the encoding loss function is established.
[0134] Optionally, the acquisition module is specifically used to perform:
[0135] Obtain the reference image and the action text;
[0136] The reference image and the action text are input into a text image pre-training model to obtain the image encoding of the reference image and the text encoding of the action text;
[0137] The image encoding of the reference image and the text encoding of the action text are fused together to obtain the fused encoding.
[0138] Optionally, the reconstruction module is specifically used to perform:
[0139] Extract the appearance features of the target object from the reference image, as well as multiple key points of the target object related to the action;
[0140] The synthetic flow field is determined based on the multiple key points and the target key point sequence;
[0141] The reconstructed video is synthesized based on the synthesized flow field and the appearance features.
[0142] Optionally, the target object is a face, and the action text is used to describe facial actions.
[0143] Optionally, it also includes:
[0144] The verification acquisition module is used to acquire verification data pairs, which include verification action text, verification reference image and verification video. The verification action text is used to describe the action performed by the verification object in the verification video, and the verification reference image is the first video frame of the verification video.
[0145] The verification reconstruction module is used to obtain a reconstructed verification video based on the verification action text and the verification reference image;
[0146] The sequence acquisition module is used to acquire the third key point sequence corresponding to the verification video and the fourth key point sequence corresponding to the reconstructed verification video.
[0147] The determining module is configured to determine the semantic matching degree between the verification video and the reconstructed verification video based on the third keypoint sequence and the fourth keypoint sequence, including:
[0148] The semantic matching degree between the verification video and the reconstructed verification video is determined according to the following formula:
[0149]
[0150] Wherein, LDTW represents the semantic matching degree, OM(·) is the optimal matching algorithm, Q is the matching sequence length corresponding to the third keypoint sequence and also the matching sequence length corresponding to the fourth keypoint sequence, q=1,2,…,Q; N represents the number of keypoints extracted in a video frame, n=1,2,…,N; This indicates that the q-th element in the matching sequence corresponding to the fourth key point sequence includes n key points; The q-th element in the matching sequence corresponding to the third keypoint sequence represents the n keypoints included in the sequence; ||·||2 represents the 2-norm.
[0151] It should be noted that the device embodiments are similar to the method embodiments, so the description is relatively simple. For relevant details, please refer to the method embodiments.
[0152] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.
[0153] Those skilled in the art will understand that embodiments of this disclosure can be provided as methods, apparatus, or computer program products. Therefore, embodiments of this disclosure can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of this disclosure can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0154] This disclosure describes embodiments of methods, apparatus, electronic devices, and computer program products according to embodiments of this disclosure with reference to flowchart illustrations and / or block diagrams. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0155] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0156] These computer program instructions can also be loaded onto a computer or other programmable data processing terminal equipment, causing a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0157] While preferred embodiments of the present disclosure have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the present disclosure.
[0158] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes the element.
[0159] The foregoing has provided a detailed description of a video generation method and apparatus based on images and text provided in this disclosure. Specific examples have been used to illustrate the principles and implementation methods of this disclosure. The descriptions of the above embodiments are only for the purpose of helping to understand the method and its core ideas. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this disclosure. Therefore, the content of this specification should not be construed as a limitation of this disclosure.
Claims
1. A video generation method based on images and text, characterized in that, include: Obtain the fusion encoding, which includes: image encoding of a reference image and text encoding of action text. The reference image includes the target object, the image encoding of the reference image contains semantic information of the reference image, and the text encoding of the action text contains semantic information of the action text. The fused code is input into the decoder to obtain the target key point sequence; Based on the reference image and the target key point sequence, a reconstructed video is obtained. The video content of the reconstructed video is: the target object performs the action described by the action text. The reconstructed video contains semantic information of the reference image and the action text. The decoder is trained based on the first keypoint sequence of the original video sample and the second keypoint sequence corresponding to the reconstructed video sample, as well as the original video encoded sample and the fused encoded sample, so that the reconstructed video reconstructed from the target keypoint sequence obtained by the decoder is semantically close to the original video; the fused encoded sample includes: the text encoding of the action text sample and the image encoding of the reference image sample. The acquisition of the fusion code includes: Obtain the reference image and the action text; The reference image and the action text are input into a text image pre-training model. The text image pre-training model is used to construct a cross-modal semantic space between the image and the text to obtain the image encoding of the reference image and the text encoding of the action text. The image encoding of the reference image and the text encoding of the action text are fused together to obtain the fused encoding.
2. The method according to claim 1, characterized in that, The training process of the decoder includes the following steps: Acquire sample data pairs, the sample data pairs including the action text sample, the reference image sample and the original video sample, the action text sample is used to describe the action performed by the object sample in the original video sample, and the reference image sample is the first video frame of the original video sample; The fused encoded sample is obtained based on the action text sample and the reference image sample; The original video sample is input into the key point extractor to obtain the first key point sequence; The first keypoint sequence is input into the encoder to be trained to obtain the original video encoded sample; The original video encoded sample is input into the decoder to be trained to obtain the second keypoint sequence; Based on the first keypoint sequence and the second keypoint sequence, as well as the original video coding sample and the fused coding sample, the encoder and the decoder to be trained are trained to obtain the trained encoder and the decoder.
3. The method according to claim 2, characterized in that, The step of training the encoder and decoder to be trained based on the first keypoint sequence and the second keypoint sequence, as well as the original video coding sample and the fused coding sample, to obtain the trained encoder and decoder, includes: Based on the first keypoint sequence and the second keypoint sequence, a keypoint reconstruction loss function is established; Based on the original video coding samples and the fused coding samples, a coding loss function is established; The total loss function is determined based on the key point reconstruction loss function and the encoding loss function; The encoder and decoder to be trained are trained according to the total loss function to obtain the trained encoder and decoder.
4. The method according to claim 3, characterized in that, The step of establishing a keypoint reconstruction loss function based on the first keypoint sequence and the second keypoint sequence includes: The differences in key points for each frame are determined based on the differences between each element in the first key point sequence and each element with the same index in the second key point sequence. Based on the two elements of each pair of adjacent indices in the first keypoint sequence and the two elements of each pair of adjacent indices in the second keypoint sequence, determine the difference in keypoint movement between each pair of adjacent frames. Based on the differences in keypoints in each frame and the differences in keypoint movement between two adjacent frames, a keypoint reconstruction loss function is established.
5. The method according to claim 3, characterized in that, The step of establishing a coding loss function based on the original video coding samples and the fused coding samples includes: Based on the original video encoded sample and the fused encoded sample, determine the similarity between the original video encoded sample and the fused encoded sample; Based on the similarity, the encoding loss function is established.
6. The method according to any one of claims 1-5, characterized in that, The step of obtaining the reconstructed video based on the reference image and the target key point sequence includes: Extract the appearance features of the target object from the reference image, as well as multiple key points of the target object related to the action; The synthetic flow field is determined based on the multiple key points and the target key point sequence; The reconstructed video is synthesized based on the synthesized flow field and the appearance features.
7. The method according to any one of claims 1-5, characterized in that, The target object is a face, and the action text is used to describe facial actions.
8. The method according to any one of claims 1-5, characterized in that, Also includes: Obtain a verification data pair, which includes verification action text, verification reference image and verification video. The verification action text is used to describe the action performed by the verification object in the verification video, and the verification reference image is the first video frame of the verification video. Based on the verification action text and the verification reference image, a reconstructed verification video is obtained; Obtain the third key point sequence corresponding to the verification video, and obtain the fourth key point sequence corresponding to the reconstructed verification video; Determining the semantic matching degree between the verification video and the reconstructed verification video based on the third keypoint sequence and the fourth keypoint sequence includes: The semantic matching degree between the verification video and the reconstructed verification video is determined according to the following formula: ; Wherein, LDTW represents the semantic matching degree, OM ( • ) is the optimal matching algorithm, Q is the matching sequence length corresponding to the third keypoint sequence and also the matching sequence length corresponding to the fourth keypoint sequence, q=1,2,…,Q; N represents the number of keypoints extracted in a video frame, n=1,2,…,N; This indicates that the q-th element in the matching sequence corresponding to the fourth key point sequence includes n key points; This indicates that the q-th element in the matching sequence corresponding to the third key point sequence includes n key points; It represents the 2-norm.
9. A video generation device based on images and text, characterized in that, include: The acquisition module is used to acquire the fusion encoding, which includes: the image encoding of a reference image and the text encoding of the action text. The reference image includes the target object, the image encoding of the reference image contains the semantic information of the reference image, and the text encoding of the action text contains the semantic information of the action text. The input module is used to input the fused encoder into the decoder to obtain the target key point sequence; The reconstruction module is used to obtain a reconstructed video based on the reference image and the target key point sequence. The video content of the reconstructed video is: the target object performing the action described by the action text. The reconstructed video contains semantic information of the reference image and the action text. The decoder is trained based on the first keypoint sequence of the original video sample and the second keypoint sequence corresponding to the reconstructed video sample, as well as the original video encoded sample and the fused encoded sample, so that the reconstructed video reconstructed from the target keypoint sequence obtained by the decoder is semantically close to the original video; the fused encoded sample includes: the text encoding of the action text sample and the image encoding of the reference image sample. The acquisition module is specifically used to perform: Obtain the reference image and the action text; The reference image and the action text are input into a text image pre-training model. The text image pre-training model is used to construct a cross-modal semantic space between the image and the text to obtain the image encoding of the reference image and the text encoding of the action text. The image encoding of the reference image and the text encoding of the action text are fused together to obtain the fused encoding.